๐ŸŽฏ Clustering

Group similar items together without labels! Discover hidden patterns in your data with unsupervised learning.

Part 1: What is Clustering?

Clustering is unsupervised learning - you don't have labels telling you which group each item belongs to. The algorithm discovers groups on its own!

๐Ÿ‘ถ In One Sentence (Like You're 5)

Clustering means: "Put things that are similar close together, and things that are different far apartโ€”without anyone telling you what the groups are." You only have a list of items (e.g. customers, products); the algorithm finds natural groups by similarity.

๐ŸŽ๐ŸŠ๐Ÿ‡ The Fruit Sorting Analogy

Imagine you have a basket of fruits and a child who has never seen fruits before.

Classification (Supervised): You tell the child "this is an apple, this is an orange" โ†’ Child learns to identify new fruits

Clustering (Unsupervised): You say "sort these into groups" โ†’ Child naturally groups by color, size, or shape WITHOUT knowing the names!

Clustering Discovers Natural Groups

Example: 2D data (X and Y are two features). Same lesson in graph form โ€“ clear X and Y axes.

X (e.g. Feature 1) Y (e.g. Feature 2) Cluster A Cluster B Cluster C
Figure: Scatter plot. X-axis = one feature, Y-axis = another. Colors = 3 clusters found by the algorithm.
๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด

Cluster A

๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต

Cluster B

๐ŸŸข๐ŸŸข๐ŸŸข

Cluster C

Real-World Applications

Industry Clustering Use Case
Marketing Customer segmentation (high-value, occasional, bargain hunters)
Retail Product categorization, store grouping
Healthcare Patient grouping by symptoms, disease subtypes
Social Media Community detection, similar user grouping
Image Processing Color quantization, image segmentation

Part 2: K-Means Clustering

K-Means is the most popular clustering algorithm. It divides data into K clusters by finding K "centers" (centroids).

๐Ÿ“ The Pizza Delivery Analogy

You need to place K pizza stores to serve a city. Where do you put them?

  1. Start by randomly placing K stores
  2. Each customer goes to the nearest store
  3. Move each store to the center of its customer area
  4. Repeat until stores stop moving!

K-Means does exactly this! "Stores" = Centroids, "Customers" = Data points

๐ŸŽฌ Animated: One K-Means iteration (Assign โ†’ Update)

Left: points (blue) and centroid (red). Right: after assigning points to centroid and moving centroid to their mean. Watch the red circle move.

Before update After update (centroid moved) โ†’
Animation: centroid (red/green) and points. After โ€œupdateโ€, centroid moves to the mean of its points.

K-Means Algorithm Steps

1
Choose K (number of clusters)

Decide how many groups you want

2
Initialize K random centroids

Place K random points as starting cluster centers

3
Assign each point to nearest centroid

Each data point joins the cluster of its closest centroid

4
Update centroids

Move each centroid to the average position of its cluster members

5
Repeat until convergence

Keep assigning and updating until centroids stop moving

# K-Means Clustering in Python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Step 1: Scale the data (important for K-Means!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Fit K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_scaled)

# Step 3: Get cluster labels
clusters = kmeans.labels_
print(f"Cluster assignments: {clusters}")

# Step 4: Get cluster centers
centers = kmeans.cluster_centers_
print(f"Cluster centers:\n{centers}")

# Step 5: Visualize (for 2D data)
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.legend()
plt.title('K-Means Clustering')
plt.show()

๐Ÿ“ Intra-Cluster Distance (Within-Cluster Sum of Squares โ€“ WCSS)

What is it? For each cluster, we measure how far every point is from its centroid. Intra-cluster distance means โ€œdistance inside the cluster.โ€ K-Means tries to make this as small as possible: points in the same group should be close together.

๐Ÿ  Layman Example: Students in Classrooms

Imagine you have 30 students and 3 classrooms (K=3). Intra-cluster distance = how spread out the students are within each room. Good clustering = students in the same room are similar (e.g. same grade); they sit close together. Bad clustering = mixed grades in one room, so some sit far from the โ€œcenterโ€ of that room. K-Means keeps moving the โ€œcenterโ€ (centroid) and reassigning students until the total โ€œspreadโ€ inside each room is minimized.

Formula (idea): For each cluster, sum the squared distance of every point to its centroid. Add that up for all K clusters. That total is called Inertia or WCSS. Lower inertia = tighter, better-separated clusters.

Concept: Points and their centroid in one cluster (intra-cluster distances)

X (Feature 1) Y (Feature 2) C C = centroid; dashed = intra-cluster distances
Figure: One cluster. Red = centroid (C). Purple = data points. Orange dashed = distances counted in WCSS.

๐Ÿ“ Inter-Cluster Distance (Between Clusters)

What is it? While intra-cluster distance measures how tight each group is, inter-cluster distance measures how far apart different clusters are from each other. A good clustering has small intra (points close to their centroid) and large inter (centroids far from each other).

๐Ÿ  Layman Example: Classrooms Again

Intra = students inside one room are close together. Inter = Room A and Room B are in different corridors. We want rooms that are clearly separated (high inter) and students in each room sitting near each other (low intra).

In K-Means we minimize WCSS (intra). We donโ€™t directly maximize inter-cluster distance, but when clusters are well separated, both happen: tight groups and far-apart centroids.

๐Ÿ”„ Convergence and Random Initialization

Convergence means the algorithm stops when centroids barely move between steps. We repeat โ€œassign points โ†’ update centroidsโ€ until the change in centroid positions is below a threshold (or max iterations).

Random initialization can give different results each run. K-Means++ is a smarter way to choose initial centroids: pick the first at random, then choose the next ones with probability proportional to how far they are from already chosen centroids. This usually leads to better and more stable clusters.

๐Ÿ“ Distance Metrics (Euclidean vs Manhattan)

K-Means uses Euclidean distance by default (straight-line โ€œas the crow fliesโ€). For high-dimensional or grid-like data, Manhattan distance (sum of absolute differences along axes) is sometimes used; in scikit-learn you can use KMeans(metric='manhattan') with the K-Medians idea. For most tabular data, Euclidean is standard.

๐ŸŽฌ K-Means in Action (Animated)

Watch: red = centroid (pulses); blue = data points (slight float). In real K-Means, points โ€œsnapโ€ to the nearest centroid, then the centroid moves to the center of its points. Repeat until stable.

Step 1: Assign

Points โ†’ nearest centroid

Step 2: Update

Centroid โ†’ mean of points

Repeat until done

Convergence

โš ๏ธ Dealing with Outliers in K-Means

K-Means is sensitive to outliers. One point far away can pull a centroid toward it and distort the whole cluster. Hereโ€™s how to deal with it:

๐Ÿ’ก Takeaway

Always check for outliers before K-Means. Either clean them, robust-scale, or switch to an algorithm that handles outliers (e.g. DBSCAN).

Part 3: Choosing the Right K

The biggest challenge in K-Means: How many clusters should you use?

Method 1: Elbow Method

Plot the "inertia" (within-cluster sum of squares = WCSS) for different K values. Look for the elbow โ€“ where the curve bends and adding more clusters gives diminishing returns.

Elbow method: X-axis = Number of clusters (K), Y-axis = Inertia (WCSS)

Number of clusters (K) Inertia (WCSS) 0 200 400 600 1 3 5 7 9 10 Elbow (Kโ‰ˆ5)
Figure: As K increases, inertia drops. The โ€œelbowโ€ (red dot) suggests a good K โ€“ here around 5.

๐ŸŽฎ Interactive: Change K and Watch the Clusters Form!

Drag the slider to change the number of clusters. See how the data gets grouped differently! โœ• marks are centroids.

โœ• โœ• โœ• K=2 โ€” Two clusters
K = 2

K=3 is the sweet spot here (matches the 3 natural groups). Too few merges groups, too many splits them!

# Elbow Method
inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method - Finding Optimal K')
plt.show()

# Look for the "elbow" - where the curve bends

Method 2: Silhouette Score

Measures how similar points are to their own cluster vs other clusters. Higher = better!

from sklearn.metrics import silhouette_score

silhouette_scores = []

for k in range(2, 11):  # Start from 2 (need at least 2 clusters)
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(range(2, 11), silhouette_scores, 'go-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method - Finding Optimal K')
plt.show()

# Choose K with highest silhouette score

๐Ÿ’ก Silhouette Score Interpretation

  • +1: Perfect! Points are very well matched to their cluster
  • 0: Points are on the boundary between clusters
  • -1: Points might be in the wrong cluster

Part 4: Hierarchical Clustering

๐Ÿ‘ถ The Problem in Plain English

With K-Means, you have to pick the number of groups (K) before running the algorithm. But what if you don't know how many groups there are? What if you want to see all possible groupings โ€” from "every item is its own group" down to "everything is one big group" โ€” and then pick the best level?

That's what Hierarchical Clustering does. It builds a tree of merges (called a dendrogram) that shows you exactly how items combined, step by step. You look at the tree and decide where to "cut" it.

๐Ÿซ The School Analogy

Imagine a school with 30 students. At first, each student is their own "group of 1."

  1. The two most similar students (say, best friends Alice and Bob) merge into a group of 2.
  2. Next, the two closest things merge โ€” maybe Carol joins Alice+Bob, or two other students merge.
  3. This keeps going: small groups merge into bigger groups.
  4. Eventually, everyone is in one giant group.

The dendrogram is like a family tree that records every merge. You can "cut" the tree at any height to get 2, 3, 4, or any number of groups.

How It Works โ€” Step by Step

1
Start: every point is its own cluster

If you have 100 data points, you start with 100 clusters (each containing 1 point).

2
Find the two closest clusters

Measure distance between every pair. The closest pair merges into one cluster. Now you have 99 clusters.

3
Repeat

Keep merging the two closest clusters. 99 โ†’ 98 โ†’ 97 โ†’ โ€ฆ โ†’ 1. Every merge is recorded.

4
Read the dendrogram and cut

The tree shows all merges. You "cut" at a height that gives you the number of clusters you want. A big vertical gap in the dendrogram = a natural place to cut.

What Is a Dendrogram?

A dendrogram is a tree diagram. The X-axis has the data points (or their indices). The Y-axis is the distance at which clusters merge. The higher two branches connect, the more different those clusters are.

Dendrogram โ€” How to Read It

Distance Data Points 0 2 4 6 A B C D E โœ‚๏ธ Cut here โ†’ 3 clusters
Dendrogram: A & B merge first (closest). C & D merge next. Then AB joins CD. Finally E joins all. Cut the dashed line to get 3 clusters: {A,B}, {C,D}, {E}.

๐Ÿ›’ Real-World Example: Customer Segmentation

Problem: An online store has 500 customers. You want to group them by spending behavior, but you don't know how many segments exist. Should it be 2? 3? 5?

Solution: Run hierarchical clustering. Look at the dendrogram. You see a big gap between 3 and 4 clusters โ€” so you cut at 3. The 3 groups turn out to be: "Big Spenders," "Occasional Buyers," and "Window Shoppers." Now marketing can tailor campaigns for each group!

Linkage Methods โ€” How to Measure "Distance Between Clusters"

When two clusters have multiple points, how do you measure the distance between them? There are several strategies:

MethodHow It Measures DistanceWhen to Use
Ward (most common)Minimizes the total variance within clusters when mergingDefault choice; produces compact, evenly sized clusters
SingleDistance between the two closest points of each clusterCan find chain-like, elongated clusters
CompleteDistance between the two farthest points of each clusterProduces compact clusters; sensitive to outliers
AverageAverage distance between all pairs of pointsCompromise between single and complete

Python Code โ€” Full Example

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

# --- Step 1: Prepare data (e.g. customer data) ---
# Columns: Annual Spending ($), Visit Frequency (per month)
X = np.array([[120, 15], [130, 14], [22, 3], [28, 4],
              [300, 25], [280, 22], [25, 2], [135, 16]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Step 2: Build the dendrogram ---
linkage_matrix = linkage(X_scaled, method='ward')

plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix, labels=['C1','C2','C3','C4','C5','C6','C7','C8'])
plt.title('Customer Dendrogram')
plt.xlabel('Customer')
plt.ylabel('Distance (Ward)')
plt.axhline(y=4, color='red', linestyle='--', label='Cut at 3 clusters')
plt.legend()
plt.show()

# --- Step 3: Cut the dendrogram to get cluster labels ---
labels_from_dendro = fcluster(linkage_matrix, t=3, criterion='maxclust')
print("Cluster labels:", labels_from_dendro)

# --- OR: Use sklearn (same result, easier for pipelines) ---
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_sklearn = model.fit_predict(X_scaled)
print("Sklearn labels:", labels_sklearn)

๐Ÿค” When to Use Hierarchical Clustering Instead of K-Means

  • You don't know K and want to explore different numbers of clusters visually.
  • You want to see the hierarchy โ€” which items are most similar, which merge early vs late.
  • Your dataset is small to medium (under ~10,000 points). For large data, K-Means is faster.
  • You need non-spherical clusters (single/complete linkage can find chains or bands).

Part 5: DBSCAN โ€” Density-Based Clustering

๐Ÿ‘ถ The Problem in Plain English

K-Means finds round blobs. But what if your data has weird shapes? Like a crescent moon and a circle, or clusters with outliers (noise points that don't belong anywhere)? K-Means will force every point into a cluster, even the outliers. DBSCAN solves both problems: it finds clusters of any shape and says "this point is noise" for outliers.

๐Ÿ™๏ธ The City Neighborhoods Analogy

Imagine looking at a city from above at night. You see dense clusters of lights โ€” those are neighborhoods. Between them, there are dark, empty areas. And a few lone houses in the middle of nowhere.

  • Dense areas (many lights close together) = Clusters
  • Dark gaps = Cluster boundaries
  • Lone houses far from everything = Outliers / Noise

DBSCAN does exactly this: it finds dense regions of points and calls sparse points "noise."

Two Parameters You Need to Know

ParameterWhat It MeansAnalogy
eps (epsilon)The radius of the neighborhood around each point. "How close do two points need to be to be considered neighbors?"If you shine a flashlight with radius eps from any point, which other points are lit up?
min_samplesThe minimum number of points within eps distance to form a dense region (a "core point").A neighborhood needs at least min_samples houses within flashlight range to count as a real neighborhood.

Three Types of Points

๐ŸŸข
Core Point

Has at least min_samples neighbors within eps. It is in the heart of a cluster.

๐ŸŸก
Border Point

Within eps of a core point, but doesn't have enough neighbors to be a core itself. It's on the edge.

๐Ÿ”ด
Noise Point

Not within eps of any core point. It's an outlier โ€” doesn't belong to any cluster. Labeled -1.

How DBSCAN Works โ€” Step by Step

1
Pick any unvisited point

Start with a random point.

2
Check its neighborhood

Count how many points are within eps distance. If count ≥ min_samples โ†’ it's a core point. Start a new cluster!

3
Expand the cluster

Add all neighbors to this cluster. Check their neighbors too โ€” if they are also core points, keep expanding. The cluster grows like a chain through dense regions.

4
Move to the next unvisited point

If it's a core point โ†’ start another cluster. If not enough neighbors โ†’ mark it as noise (-1).

5
Done when all points are visited

Every point is either in a cluster or labeled noise.

DBSCAN Visualization โ€” Core, Border, and Noise Points

Core Border Noise Core point Border point Noise (-1)
Green = core points (enough neighbors). Yellow = border (near a core). Red = noise/outliers (too far from everything).

๐Ÿ—บ๏ธ Real-World Example: Finding Fraud Clusters

Problem: A bank has 1 million transactions. Most are normal, but some are fraudulent. Fraud transactions form small, dense clusters (e.g. same ATM, same time window), while isolated odd transactions are just noise.

Why DBSCAN? K-Means would force every transaction into a cluster โ€” even the isolated weird ones. DBSCAN says "these 15 transactions form a suspicious cluster, and these 3 lonely transactions are just noise." The bank investigates only the real clusters.

Python Code โ€” Full Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons

# --- Step 1: Generate data with two crescent-moon shapes ---
X, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

# Add some outliers
outliers = np.array([[-1.5, 0.8], [2.5, -0.5], [0.5, 1.5]])
X = np.vstack([X, outliers])

# --- Step 2: Scale the data ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Step 3: Run DBSCAN ---
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# --- Step 4: Analyze results ---
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

# --- Step 5: Visualize ---
plt.figure(figsize=(10, 6))
# Cluster points
plt.scatter(X_scaled[labels != -1, 0], X_scaled[labels != -1, 1],
           c=labels[labels != -1], cmap='viridis', s=40, label='Clusters')
# Noise points (red X)
plt.scatter(X_scaled[labels == -1, 0], X_scaled[labels == -1, 1],
           c='red', marker='x', s=100, linewidths=2, label='Noise')
plt.title('DBSCAN: Two Moon Shapes + Outliers Detected')
plt.legend()
plt.show()

How to Choose eps and min_samples

This is the hardest part. Here are practical tips:

from sklearn.neighbors import NearestNeighbors

# k-distance graph to find eps
k = 5  # same as min_samples
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)

# Sort the k-th nearest neighbor distances
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(8, 4))
plt.plot(k_distances)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'{k}-th nearest neighbor distance')
plt.title('k-Distance Graph (find the elbow for eps)')
plt.axhline(y=0.3, color='red', linestyle='--', label='eps = 0.3')
plt.legend()
plt.show()

๐Ÿค” When to Use DBSCAN Instead of K-Means

  • Your clusters are non-spherical (crescent moons, rings, irregular blobs).
  • You have outliers/noise that shouldn't be forced into a cluster.
  • You don't know K and don't want to guess.
  • Cluster sizes are very different (one big cluster, one tiny one).

Part 6: Which Algorithm Should I Use?

Here's a decision guide. Think about your data and pick the right tool:

Algorithm Cluster Shape Handles Outliers Need K? Speed Best For
K-Means Spherical / round blobs No (forces all points into clusters) Yes Very fast Large data, well-separated round clusters
Hierarchical Any (depends on linkage) No Optional (cut dendrogram) Slow for large data Small-medium data, exploring hierarchy
DBSCAN Any shape! Yes! (labels them -1) No Medium Irregular shapes, outlier detection

๐ŸŽฏ Quick Decision Tree

Do you know how many clusters? โ†’ Yes โ†’ K-Means (fast and simple).

Do you have outliers? โ†’ Yes โ†’ DBSCAN (labels noise).

Want to explore cluster hierarchy? โ†’ Yes โ†’ Hierarchical (dendrogram).

Data is huge (millions of rows)? โ†’ K-Means (fastest).

Clusters are weird shapes? โ†’ DBSCAN.

Test Yourself

Check your understanding. Click an answer โ€“ youโ€™ll get instant feedback.

1. What does โ€œintra-cluster distanceโ€ mean?

2. K-Means is sensitive to outliers. What can you do?

3. In the Elbow method, what is on the Y-axis?

4. Good clustering has _____ intra-cluster distance and _____ inter-cluster distance.

Part 5: Real Example - Customer Segmentation

# Customer Segmentation Example
import pandas as pd
import numpy as np

# Sample customer data
customers = pd.DataFrame({
    'CustomerID': range(1, 101),
    'Annual_Income': np.random.randint(20000, 150000, 100),
    'Spending_Score': np.random.randint(1, 100, 100),
    'Purchase_Frequency': np.random.randint(1, 50, 100)
})

# Prepare features
X = customers[['Annual_Income', 'Spending_Score', 'Purchase_Frequency']]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal K
for k in range(2, 7):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"K={k}: Silhouette = {score:.3f}")

# Apply final clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
customers['Segment'] = kmeans.fit_predict(X_scaled)

# Analyze segments
segment_analysis = customers.groupby('Segment').agg({
    'Annual_Income': 'mean',
    'Spending_Score': 'mean',
    'Purchase_Frequency': 'mean',
    'CustomerID': 'count'
}).rename(columns={'CustomerID': 'Count'})

print("\n๐Ÿ“Š Customer Segments:")
print(segment_analysis.round(0))

๐Ÿ“Š Interpreting the Segments

  • Segment 0 (High Income, High Spending): VIP customers - target for premium products
  • Segment 1 (Low Income, High Spending): At risk - may need financial products
  • Segment 2 (High Income, Low Spending): Potential - target with engagement campaigns
  • Segment 3 (Low Income, Low Spending): Budget conscious - target with discounts

๐Ÿšซ Common Mistakes in Clustering

  • Not scaling features โ€” If one feature is 0โ€“100 and another is 0โ€“1, distance is dominated by the first. Always scale (e.g. StandardScaler) before K-Means.
  • Assuming K is obvious โ€” There is no "correct" K; use elbow plot, silhouette score, and business sense together.
  • Using K-Means when clusters aren't round โ€” K-Means assumes roughly spherical clusters; for odd shapes or lots of noise, consider DBSCAN or hierarchical.
  • Forgetting that clusters are unlabeled โ€” The algorithm gives you group 0, 1, 2โ€ฆ you have to interpret what each group means.

๐Ÿ“˜ From the course notebook (Clustering)

The course source uses Hotel Reservations.csv: data = pd.read_csv("Hotel Reservations.csv"). Use it to cluster customer bookings (e.g. select numeric columns like lead_time, avg_price_per_room, no_of_weekend_nights). Scale with StandardScaler, then KMeans(n_clusters=k).fit(X); try elbow plot or silhouette to pick k. Download Hotel Reservations.csv from the datasets page. See Clustering.pdf in the course source for slides.

Complete code from course notebook: kmeans_clustering.ipynb

Every line of code from the course notebook is below (verbatim). Comments may be explained elsewhere; the code is unchanged.

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# --- Code cell 3 ---
data =  pd.read_csv("Hotel Reservations.csv")

# --- Code cell 4 ---
# We have a data of hotel reservations
# Use it for clustering customer bookings to identify patterns

# --- Code cell 7 ---
data.head(100)

# --- Code cell 8 ---
data.info()

# --- Code cell 9 ---
data.describe(include ='all')

# --- Code cell 10 ---
print(data['type_of_meal_plan'].value_counts())

# --- Code cell 11 ---
print(data['room_type_reserved'].value_counts())

# --- Code cell 12 ---
print(data['market_segment_type'].value_counts())

# --- Code cell 13 ---
print(data['booking_status'].value_counts())

# --- Code cell 14 ---
print(data['required_car_parking_space'].value_counts())

# --- Code cell 15 ---
print(data['repeated_guest'].value_counts())

# --- Code cell 16 ---
print(data['no_of_previous_bookings_not_canceled'].value_counts().head(10))
# As there are only ~3% repeat customers using previous booking data is not significant

# --- Code cell 17 ---
print(data['no_of_adults'].value_counts())

# --- Code cell 18 ---
print(data['no_of_children'].value_counts())
#Number of children 9 and 10 looks like outlier

# --- Code cell 19 ---
#oulier removal
data = data[data['no_of_children']<=3]
data.reset_index(inplace=True)

# --- Code cell 20 ---
print(data.columns)

# --- Code cell 21 ---
numerical_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'required_car_parking_space', 'lead_time','repeated_guest',
       'avg_price_per_room', 'no_of_special_requests']

# --- Code cell 22 ---
categorical_features  = ['type_of_meal_plan','room_type_reserved','market_segment_type']

# --- Code cell 23 ---
data_features = data[numerical_features + categorical_features + ['booking_status']  ]
data_features.head(10)

# --- Code cell 24 ---
data.repeated_guest.value_counts()

# --- Code cell 27 ---
x_train  = data_features[numerical_features + categorical_features]

# There is nothing to predict in clustering
# We are just storing booking status flag in another variable to 
# check later if the clusters have some pattern w.r.t booking cancellation
y_label = data_features[['booking_status']]

# --- Code cell 28 ---
x_train = pd.get_dummies(x_train, columns =categorical_features, drop_first= False)
print(x_train.columns)

# --- Code cell 29 ---
x_train.head(10)

# --- Code cell 32 ---
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_train[numerical_features] = scaler.fit_transform(x_train[numerical_features])
x_train.describe()
x_train_copy = x_train.copy()

# --- Code cell 33 ---
x_train

# --- Code cell 34 ---
x_train_copy.to_csv("x_train.csv",index= False)

# --- Code cell 36 ---
from sklearn.cluster import KMeans


kmeans = KMeans(n_clusters=10, random_state=0, n_init="auto").fit(x_train)
x_train['cluster_labels'] = kmeans.labels_
x_train['booking_status'] = y_label['booking_status']

# --- Code cell 37 ---
from sklearn.metrics import silhouette_score
silhouette_score(x_train_copy, kmeans.labels_)

# --- Code cell 38 ---
x_train['cluster_labels'].value_counts()

# --- Code cell 41 ---
x_train['booking_status'].value_counts()

# --- Code cell 42 ---
print("cancellation rate in data:", 100*11884/(11884 + 24388))

# --- Code cell 43 ---
cluster_number = []
cancellation_rate = []

for z in range(len(list(x_train['cluster_labels'].unique()))):
               cluster_number.append(z)
               temp = x_train[x_train['cluster_labels']==z]
               temp_cancelled = temp[temp['booking_status']=='Canceled']
               temp_not_cancelled = temp[temp['booking_status']=='Not_Canceled']
               cancel = (len(temp_cancelled)/len(temp))*100
               cancellation_rate.append(cancel)

# --- Code cell 44 ---
temp = pd.DataFrame({'cluster':cluster_number, 'cancellation': cancellation_rate})
sns.barplot(x = 'cluster',y = 'cancellation', data = temp)

# --- Code cell 46 ---
data['cluster'] = kmeans.labels_
data.to_csv('clustering_results.csv')

# --- Code cell 48 ---
# Check average value of numerical features across clusters

# --- Code cell 49 ---

plt.figure(figsize=(20, 12))

plt.subplot(3,3,1)
temp = pd.DataFrame(data.groupby('cluster')['no_of_adults'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_adults', data = temp)

plt.subplot(3,3,2)
temp = pd.DataFrame(data.groupby('cluster')['no_of_children'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_children', data = temp)

plt.subplot(3,3,3)
temp = pd.DataFrame(data.groupby('cluster')['no_of_weekend_nights'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_weekend_nights',data = temp)

plt.subplot(3,3,4)
temp = pd.DataFrame(data.groupby('cluster')['no_of_week_nights'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_week_nights', data = temp)

plt.subplot(3,3,5)
temp = pd.DataFrame(data.groupby('cluster')['required_car_parking_space'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'required_car_parking_space', data = temp)

plt.subplot(3,3,6)
temp = pd.DataFrame(data.groupby('cluster')['lead_time'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'lead_time',data = temp)

plt.subplot(3,3,7)
temp = pd.DataFrame(data.groupby('cluster')['avg_price_per_room'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'avg_price_per_room', data = temp)

plt.subplot(3,3,8)
temp = pd.DataFrame(data.groupby('cluster')['no_of_special_requests'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_special_requests', data = temp)

plt.show()

# --- Code cell 51 ---
# Check frequency of categorical features across clusters

# --- Code cell 52 ---

plt.figure(figsize=(20, 12))

plt.subplot(2,2,1)
sns.countplot(x='cluster', hue='type_of_meal_plan', data=data)

plt.subplot(2,2,2)
sns.countplot(x='cluster', hue='room_type_reserved', data=data)

plt.subplot(2,2,3)
sns.countplot(x='cluster', hue='market_segment_type', data=data)

plt.subplot(2,2,4)
sns.countplot(x='cluster', hue='repeated_guest', data=data)

plt.show()

# --- Code cell 55 ---
from sklearn.cluster import KMeans
wcss = [] 

for i in range(2, 30): 
    kmeans = KMeans(n_clusters = i, init = 'k-means++', n_init= 'auto', random_state = 42)
    kmeans.fit(x_train_copy) 
    wcss.append(kmeans.inertia_)

# --- Code cell 57 ---
import matplotlib.pyplot as plt
K = range(2, 30)
plt.plot(K, wcss, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Within cluster Sum of Squared distances')
plt.title('The Elbow Method')
plt.show()

Complete code from course notebook: agglomerative_clustering.ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 1 ---
import pandas as pd

# --- Code cell 2 ---
# read original data and feature matrix

# --- Code cell 3 ---
x_train = pd.read_csv("x_train.csv")
data =  pd.read_csv("Hotel Reservations.csv")

# --- Code cell 4 ---
# reduce the number of rows in data if you face memory issues
# This is just to see end to end execution - not a recommended step

x_train = x_train[0:10000]
data = data[0:10000]

# --- Code cell 5 ---
x_train_copy = x_train.copy()

# --- Code cell 7 ---
# Try agglomerative clustering with cosine distance metric and distance threshold as input
# You can also specify n_clusters and set distance_threshold to None
# You can try different distance metrics and linkage criterias

# --- Code cell 8 ---
from sklearn.cluster import AgglomerativeClustering


clustering = AgglomerativeClustering( n_clusters = None,
                                      linkage = 'complete',
                                      distance_threshold = 0.5, # if n_clusters is number then this should be None
                                      metric = 'cosine')

clustering.fit(x_train)
x_train['cluster_labels'] = clustering.labels_
x_train['booking_status'] = data['booking_status']
print(x_train['cluster_labels'].value_counts())

# --- Code cell 10 ---
# get cancellation rate in each cluster

# --- Code cell 11 ---
cluster_number = []
cancellation_rate = []

for z in range(len(list(x_train['cluster_labels'].unique()))):
               cluster_number.append(z)
               temp = x_train[x_train['cluster_labels']==z]
               temp_cancelled = temp[temp['booking_status']=='Canceled']
               temp_not_cancelled = temp[temp['booking_status']=='Not_Canceled']
               cancel = (len(temp_cancelled)/len(temp))*100
               cancellation_rate.append(cancel)

# --- Code cell 13 ---
import seaborn as sns
temp = pd.DataFrame({'cluster':cluster_number, 'cancellation': cancellation_rate})
sns.barplot(x = 'cluster',y = 'cancellation', data = temp)

# --- Code cell 15 ---
# get silehoutee score

# --- Code cell 16 ---
from sklearn.metrics import silhouette_score
silhouette_score(x_train_copy, clustering.labels_)

๐Ÿ’ญ Short reflection

In one sentence: why would you choose DBSCAN over K-Means when your dataset has many outliers or oddly shaped clusters? (Think: density, noise, and not having to pick K.)

Core & Non-Core Points โ€“ Mastery Checklist

To become a master of clustering, you must know every core point. Non-core points make you stand out in interviews and real projects.

โœ… CORE (Must know โ€“ exam & job essentials)

  • Clustering is unsupervised โ€“ no labels; algorithm finds groups.
  • K-Means: choose K, initialize K centroids, assign each point to nearest centroid, update centroid to mean of its points, repeat until convergence.
  • Intra-cluster distance (WCSS/Inertia): sum of squared distances of points to their centroid; minimize it.
  • Inter-cluster distance: distance between clusters; good clustering = low intra, high inter.
  • Choosing K: Elbow method (plot inertia vs K; pick the elbow) and Silhouette score (higher = better).
  • Always scale features before K-Means (distance-based).
  • K-Means is sensitive to outliers; handle with removal, capping, RobustScaler, or DBSCAN.
  • DBSCAN: density-based; finds arbitrary shapes and labels outliers as -1; no need to set K.
  • Hierarchical clustering: dendrogram; can cut at desired K; good for small/medium data.
  • Compare algorithms: K-Means (spherical, needs K), Hierarchical (any shape, optional K), DBSCAN (any shape, outliers, no K).

๐Ÿ“š NON-CORE (Good to know โ€“ depth & interviews)

  • Convergence: stop when centroid movement is below threshold or max iterations reached.
  • K-Means++: smarter initialization; reduces bad local minima and improves stability.
  • Distance metrics: Euclidean (default), Manhattan for grid-like or high-dimensional data.
  • Silhouette interpretation: +1 (ideal), 0 (boundary), -1 (wrong cluster).
  • Multiple runs: run K-Means with different random seeds and pick the best inertia or silhouette.
  • Dendrogram: read height as โ€œdistance when clusters mergeโ€; cut where gap is large.
  • DBSCAN parameters: eps (max distance for neighbors), min_samples (min points to form core).
  • GMM (Gaussian Mixture): soft clustering (probabilities); can capture ellipsoidal clusters.
  • Use clustering for exploration (discover segments) and preprocessing (e.g. cluster then model per cluster).

Summary

Concept Key Points
Clustering Unsupervised learning - finds natural groups without labels
K-Means Fast, simple; finds K spherical clusters using centroids
Choosing K Elbow method (inertia) or Silhouette score
DBSCAN Density-based; finds any shape, detects outliers
Scaling ALWAYS scale your data before clustering!

๐ŸŽฏ Pro Tips

  • Always scale features - clustering is distance-based!
  • Try multiple K values - use both elbow and silhouette
  • Visualize results - even if just 2D PCA plot
  • Interpret clusters - give them meaningful business names

Want to see the code?

Every single line of the K-Means notebook explained like you are 5 years old.

K-Means Code Walkthrough