Group similar items together without labels! Discover hidden patterns in your data with unsupervised learning.
Clustering is unsupervised learning - you don't have labels telling you which group each item belongs to. The algorithm discovers groups on its own!
Clustering means: "Put things that are similar close together, and things that are different far apartโwithout anyone telling you what the groups are." You only have a list of items (e.g. customers, products); the algorithm finds natural groups by similarity.
Imagine you have a basket of fruits and a child who has never seen fruits before.
Classification (Supervised): You tell the child "this is an apple, this is an orange" โ Child learns to identify new fruits
Clustering (Unsupervised): You say "sort these into groups" โ Child naturally groups by color, size, or shape WITHOUT knowing the names!
Example: 2D data (X and Y are two features). Same lesson in graph form โ clear X and Y axes.
Cluster A
Cluster B
Cluster C
| Industry | Clustering Use Case |
|---|---|
| Marketing | Customer segmentation (high-value, occasional, bargain hunters) |
| Retail | Product categorization, store grouping |
| Healthcare | Patient grouping by symptoms, disease subtypes |
| Social Media | Community detection, similar user grouping |
| Image Processing | Color quantization, image segmentation |
K-Means is the most popular clustering algorithm. It divides data into K clusters by finding K "centers" (centroids).
You need to place K pizza stores to serve a city. Where do you put them?
K-Means does exactly this! "Stores" = Centroids, "Customers" = Data points
Left: points (blue) and centroid (red). Right: after assigning points to centroid and moving centroid to their mean. Watch the red circle move.
Decide how many groups you want
Place K random points as starting cluster centers
Each data point joins the cluster of its closest centroid
Move each centroid to the average position of its cluster members
Keep assigning and updating until centroids stop moving
# K-Means Clustering in Python from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt # Step 1: Scale the data (important for K-Means!) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Step 2: Fit K-Means kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) kmeans.fit(X_scaled) # Step 3: Get cluster labels clusters = kmeans.labels_ print(f"Cluster assignments: {clusters}") # Step 4: Get cluster centers centers = kmeans.cluster_centers_ print(f"Cluster centers:\n{centers}") # Step 5: Visualize (for 2D data) plt.figure(figsize=(10, 6)) plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis') plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centroids') plt.legend() plt.title('K-Means Clustering') plt.show()
What is it? For each cluster, we measure how far every point is from its centroid. Intra-cluster distance means โdistance inside the cluster.โ K-Means tries to make this as small as possible: points in the same group should be close together.
Imagine you have 30 students and 3 classrooms (K=3). Intra-cluster distance = how spread out the students are within each room. Good clustering = students in the same room are similar (e.g. same grade); they sit close together. Bad clustering = mixed grades in one room, so some sit far from the โcenterโ of that room. K-Means keeps moving the โcenterโ (centroid) and reassigning students until the total โspreadโ inside each room is minimized.
Formula (idea): For each cluster, sum the squared distance of every point to its centroid. Add that up for all K clusters. That total is called Inertia or WCSS. Lower inertia = tighter, better-separated clusters.
Concept: Points and their centroid in one cluster (intra-cluster distances)
What is it? While intra-cluster distance measures how tight each group is, inter-cluster distance measures how far apart different clusters are from each other. A good clustering has small intra (points close to their centroid) and large inter (centroids far from each other).
Intra = students inside one room are close together. Inter = Room A and Room B are in different corridors. We want rooms that are clearly separated (high inter) and students in each room sitting near each other (low intra).
In K-Means we minimize WCSS (intra). We donโt directly maximize inter-cluster distance, but when clusters are well separated, both happen: tight groups and far-apart centroids.
Convergence means the algorithm stops when centroids barely move between steps. We repeat โassign points โ update centroidsโ until the change in centroid positions is below a threshold (or max iterations).
Random initialization can give different results each run. K-Means++ is a smarter way to choose initial centroids: pick the first at random, then choose the next ones with probability proportional to how far they are from already chosen centroids. This usually leads to better and more stable clusters.
K-Means uses Euclidean distance by default (straight-line โas the crow fliesโ). For high-dimensional or grid-like data, Manhattan distance (sum of absolute differences along axes) is sometimes used; in scikit-learn you can use KMeans(metric='manhattan') with the K-Medians idea. For most tabular data, Euclidean is standard.
Watch: red = centroid (pulses); blue = data points (slight float). In real K-Means, points โsnapโ to the nearest centroid, then the centroid moves to the center of its points. Repeat until stable.
Points โ nearest centroid
Centroid โ mean of points
Convergence
K-Means is sensitive to outliers. One point far away can pull a centroid toward it and distort the whole cluster. Hereโs how to deal with it:
Always check for outliers before K-Means. Either clean them, robust-scale, or switch to an algorithm that handles outliers (e.g. DBSCAN).
The biggest challenge in K-Means: How many clusters should you use?
Plot the "inertia" (within-cluster sum of squares = WCSS) for different K values. Look for the elbow โ where the curve bends and adding more clusters gives diminishing returns.
Elbow method: X-axis = Number of clusters (K), Y-axis = Inertia (WCSS)
Drag the slider to change the number of clusters. See how the data gets grouped differently! โ marks are centroids.
K=3 is the sweet spot here (matches the 3 natural groups). Too few merges groups, too many splits them!
# Elbow Method inertias = [] K_range = range(1, 11) for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(X_scaled) inertias.append(kmeans.inertia_) # Plot plt.figure(figsize=(10, 6)) plt.plot(K_range, inertias, 'bo-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method - Finding Optimal K') plt.show() # Look for the "elbow" - where the curve bends
Measures how similar points are to their own cluster vs other clusters. Higher = better!
from sklearn.metrics import silhouette_score silhouette_scores = [] for k in range(2, 11): # Start from 2 (need at least 2 clusters) kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(X_scaled) score = silhouette_score(X_scaled, labels) silhouette_scores.append(score) print(f"K={k}: Silhouette Score = {score:.3f}") # Plot plt.figure(figsize=(10, 6)) plt.plot(range(2, 11), silhouette_scores, 'go-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Silhouette Score') plt.title('Silhouette Method - Finding Optimal K') plt.show() # Choose K with highest silhouette score
With K-Means, you have to pick the number of groups (K) before running the algorithm. But what if you don't know how many groups there are? What if you want to see all possible groupings โ from "every item is its own group" down to "everything is one big group" โ and then pick the best level?
That's what Hierarchical Clustering does. It builds a tree of merges (called a dendrogram) that shows you exactly how items combined, step by step. You look at the tree and decide where to "cut" it.
Imagine a school with 30 students. At first, each student is their own "group of 1."
The dendrogram is like a family tree that records every merge. You can "cut" the tree at any height to get 2, 3, 4, or any number of groups.
If you have 100 data points, you start with 100 clusters (each containing 1 point).
Measure distance between every pair. The closest pair merges into one cluster. Now you have 99 clusters.
Keep merging the two closest clusters. 99 โ 98 โ 97 โ โฆ โ 1. Every merge is recorded.
The tree shows all merges. You "cut" at a height that gives you the number of clusters you want. A big vertical gap in the dendrogram = a natural place to cut.
A dendrogram is a tree diagram. The X-axis has the data points (or their indices). The Y-axis is the distance at which clusters merge. The higher two branches connect, the more different those clusters are.
Problem: An online store has 500 customers. You want to group them by spending behavior, but you don't know how many segments exist. Should it be 2? 3? 5?
Solution: Run hierarchical clustering. Look at the dendrogram. You see a big gap between 3 and 4 clusters โ so you cut at 3. The 3 groups turn out to be: "Big Spenders," "Occasional Buyers," and "Window Shoppers." Now marketing can tailor campaigns for each group!
When two clusters have multiple points, how do you measure the distance between them? There are several strategies:
| Method | How It Measures Distance | When to Use |
|---|---|---|
| Ward (most common) | Minimizes the total variance within clusters when merging | Default choice; produces compact, evenly sized clusters |
| Single | Distance between the two closest points of each cluster | Can find chain-like, elongated clusters |
| Complete | Distance between the two farthest points of each cluster | Produces compact clusters; sensitive to outliers |
| Average | Average distance between all pairs of points | Compromise between single and complete |
import numpy as np import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage, fcluster from sklearn.cluster import AgglomerativeClustering from sklearn.preprocessing import StandardScaler # --- Step 1: Prepare data (e.g. customer data) --- # Columns: Annual Spending ($), Visit Frequency (per month) X = np.array([[120, 15], [130, 14], [22, 3], [28, 4], [300, 25], [280, 22], [25, 2], [135, 16]]) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # --- Step 2: Build the dendrogram --- linkage_matrix = linkage(X_scaled, method='ward') plt.figure(figsize=(10, 5)) dendrogram(linkage_matrix, labels=['C1','C2','C3','C4','C5','C6','C7','C8']) plt.title('Customer Dendrogram') plt.xlabel('Customer') plt.ylabel('Distance (Ward)') plt.axhline(y=4, color='red', linestyle='--', label='Cut at 3 clusters') plt.legend() plt.show() # --- Step 3: Cut the dendrogram to get cluster labels --- labels_from_dendro = fcluster(linkage_matrix, t=3, criterion='maxclust') print("Cluster labels:", labels_from_dendro) # --- OR: Use sklearn (same result, easier for pipelines) --- model = AgglomerativeClustering(n_clusters=3, linkage='ward') labels_sklearn = model.fit_predict(X_scaled) print("Sklearn labels:", labels_sklearn)
K-Means finds round blobs. But what if your data has weird shapes? Like a crescent moon and a circle, or clusters with outliers (noise points that don't belong anywhere)? K-Means will force every point into a cluster, even the outliers. DBSCAN solves both problems: it finds clusters of any shape and says "this point is noise" for outliers.
Imagine looking at a city from above at night. You see dense clusters of lights โ those are neighborhoods. Between them, there are dark, empty areas. And a few lone houses in the middle of nowhere.
DBSCAN does exactly this: it finds dense regions of points and calls sparse points "noise."
| Parameter | What It Means | Analogy |
|---|---|---|
| eps (epsilon) | The radius of the neighborhood around each point. "How close do two points need to be to be considered neighbors?" | If you shine a flashlight with radius eps from any point, which other points are lit up? |
| min_samples | The minimum number of points within eps distance to form a dense region (a "core point"). | A neighborhood needs at least min_samples houses within flashlight range to count as a real neighborhood. |
Has at least min_samples neighbors within eps. It is in the heart of a cluster.
Within eps of a core point, but doesn't have enough neighbors to be a core itself. It's on the edge.
Not within eps of any core point. It's an outlier โ doesn't belong to any cluster. Labeled -1.
Start with a random point.
Count how many points are within eps distance. If count ≥ min_samples โ it's a core point. Start a new cluster!
Add all neighbors to this cluster. Check their neighbors too โ if they are also core points, keep expanding. The cluster grows like a chain through dense regions.
If it's a core point โ start another cluster. If not enough neighbors โ mark it as noise (-1).
Every point is either in a cluster or labeled noise.
Problem: A bank has 1 million transactions. Most are normal, but some are fraudulent. Fraud transactions form small, dense clusters (e.g. same ATM, same time window), while isolated odd transactions are just noise.
Why DBSCAN? K-Means would force every transaction into a cluster โ even the isolated weird ones. DBSCAN says "these 15 transactions form a suspicious cluster, and these 3 lonely transactions are just noise." The bank investigates only the real clusters.
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_moons # --- Step 1: Generate data with two crescent-moon shapes --- X, _ = make_moons(n_samples=300, noise=0.08, random_state=42) # Add some outliers outliers = np.array([[-1.5, 0.8], [2.5, -0.5], [0.5, 1.5]]) X = np.vstack([X, outliers]) # --- Step 2: Scale the data --- scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # --- Step 3: Run DBSCAN --- dbscan = DBSCAN(eps=0.3, min_samples=5) labels = dbscan.fit_predict(X_scaled) # --- Step 4: Analyze results --- n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = list(labels).count(-1) print(f"Clusters found: {n_clusters}") print(f"Noise points: {n_noise}") # --- Step 5: Visualize --- plt.figure(figsize=(10, 6)) # Cluster points plt.scatter(X_scaled[labels != -1, 0], X_scaled[labels != -1, 1], c=labels[labels != -1], cmap='viridis', s=40, label='Clusters') # Noise points (red X) plt.scatter(X_scaled[labels == -1, 0], X_scaled[labels == -1, 1], c='red', marker='x', s=100, linewidths=2, label='Noise') plt.title('DBSCAN: Two Moon Shapes + Outliers Detected') plt.legend() plt.show()
This is the hardest part. Here are practical tips:
2 * number_of_features. For 2D data โ 4 or 5.from sklearn.neighbors import NearestNeighbors # k-distance graph to find eps k = 5 # same as min_samples nn = NearestNeighbors(n_neighbors=k) nn.fit(X_scaled) distances, _ = nn.kneighbors(X_scaled) # Sort the k-th nearest neighbor distances k_distances = np.sort(distances[:, k-1]) plt.figure(figsize=(8, 4)) plt.plot(k_distances) plt.xlabel('Points (sorted by distance)') plt.ylabel(f'{k}-th nearest neighbor distance') plt.title('k-Distance Graph (find the elbow for eps)') plt.axhline(y=0.3, color='red', linestyle='--', label='eps = 0.3') plt.legend() plt.show()
Here's a decision guide. Think about your data and pick the right tool:
| Algorithm | Cluster Shape | Handles Outliers | Need K? | Speed | Best For |
|---|---|---|---|---|---|
| K-Means | Spherical / round blobs | No (forces all points into clusters) | Yes | Very fast | Large data, well-separated round clusters |
| Hierarchical | Any (depends on linkage) | No | Optional (cut dendrogram) | Slow for large data | Small-medium data, exploring hierarchy |
| DBSCAN | Any shape! | Yes! (labels them -1) | No | Medium | Irregular shapes, outlier detection |
Do you know how many clusters? โ Yes โ K-Means (fast and simple).
Do you have outliers? โ Yes โ DBSCAN (labels noise).
Want to explore cluster hierarchy? โ Yes โ Hierarchical (dendrogram).
Data is huge (millions of rows)? โ K-Means (fastest).
Clusters are weird shapes? โ DBSCAN.
Check your understanding. Click an answer โ youโll get instant feedback.
# Customer Segmentation Example import pandas as pd import numpy as np # Sample customer data customers = pd.DataFrame({ 'CustomerID': range(1, 101), 'Annual_Income': np.random.randint(20000, 150000, 100), 'Spending_Score': np.random.randint(1, 100, 100), 'Purchase_Frequency': np.random.randint(1, 50, 100) }) # Prepare features X = customers[['Annual_Income', 'Spending_Score', 'Purchase_Frequency']] # Scale features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Find optimal K for k in range(2, 7): kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(X_scaled) score = silhouette_score(X_scaled, labels) print(f"K={k}: Silhouette = {score:.3f}") # Apply final clustering kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) customers['Segment'] = kmeans.fit_predict(X_scaled) # Analyze segments segment_analysis = customers.groupby('Segment').agg({ 'Annual_Income': 'mean', 'Spending_Score': 'mean', 'Purchase_Frequency': 'mean', 'CustomerID': 'count' }).rename(columns={'CustomerID': 'Count'}) print("\n๐ Customer Segments:") print(segment_analysis.round(0))
The course source uses Hotel Reservations.csv: data = pd.read_csv("Hotel Reservations.csv"). Use it to cluster customer bookings (e.g. select numeric columns like lead_time, avg_price_per_room, no_of_weekend_nights). Scale with StandardScaler, then KMeans(n_clusters=k).fit(X); try elbow plot or silhouette to pick k. Download Hotel Reservations.csv from the datasets page. See Clustering.pdf in the course source for slides.
Every line of code from the course notebook is below (verbatim). Comments may be explained elsewhere; the code is unchanged.
# --- Code cell 1 ---
from IPython.core.display import HTML
HTML("""
<style>
h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")
# --- Code cell 2 ---
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# --- Code cell 3 ---
data = pd.read_csv("Hotel Reservations.csv")
# --- Code cell 4 ---
# We have a data of hotel reservations
# Use it for clustering customer bookings to identify patterns
# --- Code cell 7 ---
data.head(100)
# --- Code cell 8 ---
data.info()
# --- Code cell 9 ---
data.describe(include ='all')
# --- Code cell 10 ---
print(data['type_of_meal_plan'].value_counts())
# --- Code cell 11 ---
print(data['room_type_reserved'].value_counts())
# --- Code cell 12 ---
print(data['market_segment_type'].value_counts())
# --- Code cell 13 ---
print(data['booking_status'].value_counts())
# --- Code cell 14 ---
print(data['required_car_parking_space'].value_counts())
# --- Code cell 15 ---
print(data['repeated_guest'].value_counts())
# --- Code cell 16 ---
print(data['no_of_previous_bookings_not_canceled'].value_counts().head(10))
# As there are only ~3% repeat customers using previous booking data is not significant
# --- Code cell 17 ---
print(data['no_of_adults'].value_counts())
# --- Code cell 18 ---
print(data['no_of_children'].value_counts())
#Number of children 9 and 10 looks like outlier
# --- Code cell 19 ---
#oulier removal
data = data[data['no_of_children']<=3]
data.reset_index(inplace=True)
# --- Code cell 20 ---
print(data.columns)
# --- Code cell 21 ---
numerical_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
'no_of_week_nights', 'required_car_parking_space', 'lead_time','repeated_guest',
'avg_price_per_room', 'no_of_special_requests']
# --- Code cell 22 ---
categorical_features = ['type_of_meal_plan','room_type_reserved','market_segment_type']
# --- Code cell 23 ---
data_features = data[numerical_features + categorical_features + ['booking_status'] ]
data_features.head(10)
# --- Code cell 24 ---
data.repeated_guest.value_counts()
# --- Code cell 27 ---
x_train = data_features[numerical_features + categorical_features]
# There is nothing to predict in clustering
# We are just storing booking status flag in another variable to
# check later if the clusters have some pattern w.r.t booking cancellation
y_label = data_features[['booking_status']]
# --- Code cell 28 ---
x_train = pd.get_dummies(x_train, columns =categorical_features, drop_first= False)
print(x_train.columns)
# --- Code cell 29 ---
x_train.head(10)
# --- Code cell 32 ---
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_train[numerical_features] = scaler.fit_transform(x_train[numerical_features])
x_train.describe()
x_train_copy = x_train.copy()
# --- Code cell 33 ---
x_train
# --- Code cell 34 ---
x_train_copy.to_csv("x_train.csv",index= False)
# --- Code cell 36 ---
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0, n_init="auto").fit(x_train)
x_train['cluster_labels'] = kmeans.labels_
x_train['booking_status'] = y_label['booking_status']
# --- Code cell 37 ---
from sklearn.metrics import silhouette_score
silhouette_score(x_train_copy, kmeans.labels_)
# --- Code cell 38 ---
x_train['cluster_labels'].value_counts()
# --- Code cell 41 ---
x_train['booking_status'].value_counts()
# --- Code cell 42 ---
print("cancellation rate in data:", 100*11884/(11884 + 24388))
# --- Code cell 43 ---
cluster_number = []
cancellation_rate = []
for z in range(len(list(x_train['cluster_labels'].unique()))):
cluster_number.append(z)
temp = x_train[x_train['cluster_labels']==z]
temp_cancelled = temp[temp['booking_status']=='Canceled']
temp_not_cancelled = temp[temp['booking_status']=='Not_Canceled']
cancel = (len(temp_cancelled)/len(temp))*100
cancellation_rate.append(cancel)
# --- Code cell 44 ---
temp = pd.DataFrame({'cluster':cluster_number, 'cancellation': cancellation_rate})
sns.barplot(x = 'cluster',y = 'cancellation', data = temp)
# --- Code cell 46 ---
data['cluster'] = kmeans.labels_
data.to_csv('clustering_results.csv')
# --- Code cell 48 ---
# Check average value of numerical features across clusters
# --- Code cell 49 ---
plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
temp = pd.DataFrame(data.groupby('cluster')['no_of_adults'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_adults', data = temp)
plt.subplot(3,3,2)
temp = pd.DataFrame(data.groupby('cluster')['no_of_children'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_children', data = temp)
plt.subplot(3,3,3)
temp = pd.DataFrame(data.groupby('cluster')['no_of_weekend_nights'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_weekend_nights',data = temp)
plt.subplot(3,3,4)
temp = pd.DataFrame(data.groupby('cluster')['no_of_week_nights'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_week_nights', data = temp)
plt.subplot(3,3,5)
temp = pd.DataFrame(data.groupby('cluster')['required_car_parking_space'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'required_car_parking_space', data = temp)
plt.subplot(3,3,6)
temp = pd.DataFrame(data.groupby('cluster')['lead_time'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'lead_time',data = temp)
plt.subplot(3,3,7)
temp = pd.DataFrame(data.groupby('cluster')['avg_price_per_room'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'avg_price_per_room', data = temp)
plt.subplot(3,3,8)
temp = pd.DataFrame(data.groupby('cluster')['no_of_special_requests'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_special_requests', data = temp)
plt.show()
# --- Code cell 51 ---
# Check frequency of categorical features across clusters
# --- Code cell 52 ---
plt.figure(figsize=(20, 12))
plt.subplot(2,2,1)
sns.countplot(x='cluster', hue='type_of_meal_plan', data=data)
plt.subplot(2,2,2)
sns.countplot(x='cluster', hue='room_type_reserved', data=data)
plt.subplot(2,2,3)
sns.countplot(x='cluster', hue='market_segment_type', data=data)
plt.subplot(2,2,4)
sns.countplot(x='cluster', hue='repeated_guest', data=data)
plt.show()
# --- Code cell 55 ---
from sklearn.cluster import KMeans
wcss = []
for i in range(2, 30):
kmeans = KMeans(n_clusters = i, init = 'k-means++', n_init= 'auto', random_state = 42)
kmeans.fit(x_train_copy)
wcss.append(kmeans.inertia_)
# --- Code cell 57 ---
import matplotlib.pyplot as plt
K = range(2, 30)
plt.plot(K, wcss, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Within cluster Sum of Squared distances')
plt.title('The Elbow Method')
plt.show()
Every line of code from the course notebook (verbatim).
# --- Code cell 1 ---
import pandas as pd
# --- Code cell 2 ---
# read original data and feature matrix
# --- Code cell 3 ---
x_train = pd.read_csv("x_train.csv")
data = pd.read_csv("Hotel Reservations.csv")
# --- Code cell 4 ---
# reduce the number of rows in data if you face memory issues
# This is just to see end to end execution - not a recommended step
x_train = x_train[0:10000]
data = data[0:10000]
# --- Code cell 5 ---
x_train_copy = x_train.copy()
# --- Code cell 7 ---
# Try agglomerative clustering with cosine distance metric and distance threshold as input
# You can also specify n_clusters and set distance_threshold to None
# You can try different distance metrics and linkage criterias
# --- Code cell 8 ---
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering( n_clusters = None,
linkage = 'complete',
distance_threshold = 0.5, # if n_clusters is number then this should be None
metric = 'cosine')
clustering.fit(x_train)
x_train['cluster_labels'] = clustering.labels_
x_train['booking_status'] = data['booking_status']
print(x_train['cluster_labels'].value_counts())
# --- Code cell 10 ---
# get cancellation rate in each cluster
# --- Code cell 11 ---
cluster_number = []
cancellation_rate = []
for z in range(len(list(x_train['cluster_labels'].unique()))):
cluster_number.append(z)
temp = x_train[x_train['cluster_labels']==z]
temp_cancelled = temp[temp['booking_status']=='Canceled']
temp_not_cancelled = temp[temp['booking_status']=='Not_Canceled']
cancel = (len(temp_cancelled)/len(temp))*100
cancellation_rate.append(cancel)
# --- Code cell 13 ---
import seaborn as sns
temp = pd.DataFrame({'cluster':cluster_number, 'cancellation': cancellation_rate})
sns.barplot(x = 'cluster',y = 'cancellation', data = temp)
# --- Code cell 15 ---
# get silehoutee score
# --- Code cell 16 ---
from sklearn.metrics import silhouette_score
silhouette_score(x_train_copy, clustering.labels_)
In one sentence: why would you choose DBSCAN over K-Means when your dataset has many outliers or oddly shaped clusters? (Think: density, noise, and not having to pick K.)
To become a master of clustering, you must know every core point. Non-core points make you stand out in interviews and real projects.
eps (max distance for neighbors), min_samples (min points to form core).| Concept | Key Points |
|---|---|
| Clustering | Unsupervised learning - finds natural groups without labels |
| K-Means | Fast, simple; finds K spherical clusters using centroids |
| Choosing K | Elbow method (inertia) or Silhouette score |
| DBSCAN | Density-based; finds any shape, detects outliers |
| Scaling | ALWAYS scale your data before clustering! |
Every single line of the K-Means notebook explained like you are 5 years old.
K-Means Code Walkthrough