Clustering | Fakhruddin Khambaty's Learning Hub

Part 1: What is Clustering?

Clustering is unsupervised learning - you don't have labels telling you which group each item belongs to. The algorithm discovers groups on its own!

👶 In One Sentence (Like You're 5)

Clustering means: "Put things that are similar close together, and things that are different far apart—without anyone telling you what the groups are." You only have a list of items (e.g. customers, products); the algorithm finds natural groups by similarity.

🍎🍊🍇 The Fruit Sorting Analogy

Imagine you have a basket of fruits and a child who has never seen fruits before.

Classification (Supervised): You tell the child "this is an apple, this is an orange" → Child learns to identify new fruits

Clustering (Unsupervised): You say "sort these into groups" → Child naturally groups by color, size, or shape WITHOUT knowing the names!

Clustering Discovers Natural Groups

Example: 2D data (X and Y are two features). Same lesson in graph form – clear X and Y axes.

Figure: Scatter plot. X-axis = one feature, Y-axis = another. Colors = 3 clusters found by the algorithm.

🔴🔴🔴

Cluster A

🔵🔵🔵

Cluster B

🟢🟢🟢

Cluster C

Real-World Applications

Industry	Clustering Use Case
Marketing	Customer segmentation (high-value, occasional, bargain hunters)
Retail	Product categorization, store grouping
Healthcare	Patient grouping by symptoms, disease subtypes
Social Media	Community detection, similar user grouping
Image Processing	Color quantization, image segmentation

Part 2: K-Means Clustering

K-Means is the most popular clustering algorithm. It divides data into K clusters by finding K "centers" (centroids).

📍 The Pizza Delivery Analogy

You need to place K pizza stores to serve a city. Where do you put them?

Start by randomly placing K stores
Each customer goes to the nearest store
Move each store to the center of its customer area
Repeat until stores stop moving!

K-Means does exactly this! "Stores" = Centroids, "Customers" = Data points

🎬 Animated: One K-Means iteration (Assign → Update)

Left: points (blue) and centroid (red). Right: after assigning points to centroid and moving centroid to their mean. Watch the red circle move.

Animation: centroid (red/green) and points. After “update”, centroid moves to the mean of its points.

K-Means Algorithm Steps

1

Choose K (number of clusters)

Decide how many groups you want

2

Initialize K random centroids

Place K random points as starting cluster centers

3

Assign each point to nearest centroid

Each data point joins the cluster of its closest centroid

4

Update centroids

Move each centroid to the average position of its cluster members

5

Repeat until convergence

Keep assigning and updating until centroids stop moving

# K-Means Clustering in Python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Step 1: Scale the data (important for K-Means!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Fit K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_scaled)

# Step 3: Get cluster labels
clusters = kmeans.labels_
print(f"Cluster assignments: {clusters}")

# Step 4: Get cluster centers
centers = kmeans.cluster_centers_
print(f"Cluster centers:\n{centers}")

# Step 5: Visualize (for 2D data)
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.legend()
plt.title('K-Means Clustering')
plt.show()

📐 Intra-Cluster Distance (Within-Cluster Sum of Squares – WCSS)

What is it? For each cluster, we measure how far every point is from its centroid. Intra-cluster distance means “distance inside the cluster.” K-Means tries to make this as small as possible: points in the same group should be close together.

🏠 Layman Example: Students in Classrooms

Imagine you have 30 students and 3 classrooms (K=3). Intra-cluster distance = how spread out the students are within each room. Good clustering = students in the same room are similar (e.g. same grade); they sit close together. Bad clustering = mixed grades in one room, so some sit far from the “center” of that room. K-Means keeps moving the “center” (centroid) and reassigning students until the total “spread” inside each room is minimized.

Formula (idea): For each cluster, sum the squared distance of every point to its centroid. Add that up for all K clusters. That total is called Inertia or WCSS. Lower inertia = tighter, better-separated clusters.

Concept: Points and their centroid in one cluster (intra-cluster distances)

Figure: One cluster. Red = centroid (C). Purple = data points. Orange dashed = distances counted in WCSS.

📐 Inter-Cluster Distance (Between Clusters)

What is it? While intra-cluster distance measures how tight each group is, inter-cluster distance measures how far apart different clusters are from each other. A good clustering has small intra (points close to their centroid) and large inter (centroids far from each other).

🏠 Layman Example: Classrooms Again

Intra = students inside one room are close together. Inter = Room A and Room B are in different corridors. We want rooms that are clearly separated (high inter) and students in each room sitting near each other (low intra).

In K-Means we minimize WCSS (intra). We don’t directly maximize inter-cluster distance, but when clusters are well separated, both happen: tight groups and far-apart centroids.

🔄 Convergence and Random Initialization

Convergence means the algorithm stops when centroids barely move between steps. We repeat “assign points → update centroids” until the change in centroid positions is below a threshold (or max iterations).

Random initialization can give different results each run. K-Means++ is a smarter way to choose initial centroids: pick the first at random, then choose the next ones with probability proportional to how far they are from already chosen centroids. This usually leads to better and more stable clusters.

📏 Distance Metrics (Euclidean vs Manhattan)

K-Means uses Euclidean distance by default (straight-line “as the crow flies”). For high-dimensional or grid-like data, Manhattan distance (sum of absolute differences along axes) is sometimes used; in scikit-learn you can use KMeans(metric='manhattan') with the K-Medians idea. For most tabular data, Euclidean is standard.

🎬 K-Means in Action (Animated)

Watch: red = centroid (pulses); blue = data points (slight float). In real K-Means, points “snap” to the nearest centroid, then the centroid moves to the center of its points. Repeat until stable.

Step 1: Assign

Points → nearest centroid

Step 2: Update

Centroid → mean of points

Repeat until done

Convergence

⚠️ Dealing with Outliers in K-Means

K-Means is sensitive to outliers. One point far away can pull a centroid toward it and distort the whole cluster. Here’s how to deal with it:

Detect outliers first: Use boxplots, IQR, or Z-scores (see the Missing Values & Outliers lesson).
Remove them: Drop extreme points before running K-Means if they are errors or irrelevant.
Cap/winsorize: Replace extreme values with a max/min threshold so they don’t dominate.
Use robust scaling: Scale with RobustScaler (median and IQR) instead of StandardScaler so one extreme value doesn’t stretch the scale.
Try K-Medians: Use the median instead of the mean when updating centroids; medians are less pulled by outliers.
Use DBSCAN: If you have many outliers or odd shapes, DBSCAN labels them as “noise” instead of forcing them into a cluster.

💡 Takeaway

Always check for outliers before K-Means. Either clean them, robust-scale, or switch to an algorithm that handles outliers (e.g. DBSCAN).

Part 3: Choosing the Right K

The biggest challenge in K-Means: How many clusters should you use?

Method 1: Elbow Method

Plot the "inertia" (within-cluster sum of squares = WCSS) for different K values. Look for the elbow – where the curve bends and adding more clusters gives diminishing returns.

Elbow method: X-axis = Number of clusters (K), Y-axis = Inertia (WCSS)

Figure: As K increases, inertia drops. The “elbow” (red dot) suggests a good K – here around 5.

🎮 Interactive: Change K and Watch the Clusters Form!

Drag the slider to change the number of clusters. See how the data gets grouped differently! ✕ marks are centroids.

K = 2 K = 5 K = 2

K=3 is the sweet spot here (matches the 3 natural groups). Too few merges groups, too many splits them!

# Elbow Method
inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method - Finding Optimal K')
plt.show()

# Look for the "elbow" - where the curve bends

Method 2: Silhouette Score

Measures how similar points are to their own cluster vs other clusters. Higher = better!

from sklearn.metrics import silhouette_score

silhouette_scores = []

for k in range(2, 11):  # Start from 2 (need at least 2 clusters)
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(range(2, 11), silhouette_scores, 'go-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method - Finding Optimal K')
plt.show()

# Choose K with highest silhouette score

💡 Silhouette Score Interpretation

+1: Perfect! Points are very well matched to their cluster
0: Points are on the boundary between clusters
-1: Points might be in the wrong cluster

Part 4: Hierarchical Clustering

👶 The Problem in Plain English

With K-Means, you have to pick the number of groups (K) before running the algorithm. But what if you don't know how many groups there are? What if you want to see all possible groupings — from "every item is its own group" down to "everything is one big group" — and then pick the best level?

That's what Hierarchical Clustering does. It builds a tree of merges (called a dendrogram) that shows you exactly how items combined, step by step. You look at the tree and decide where to "cut" it.

🏫 The School Analogy

Imagine a school with 30 students. At first, each student is their own "group of 1."

The two most similar students (say, best friends Alice and Bob) merge into a group of 2.
Next, the two closest things merge — maybe Carol joins Alice+Bob, or two other students merge.
This keeps going: small groups merge into bigger groups.
Eventually, everyone is in one giant group.

The dendrogram is like a family tree that records every merge. You can "cut" the tree at any height to get 2, 3, 4, or any number of groups.

How It Works — Step by Step

1

Start: every point is its own cluster

If you have 100 data points, you start with 100 clusters (each containing 1 point).

2

Find the two closest clusters

Measure distance between every pair. The closest pair merges into one cluster. Now you have 99 clusters.

3

Repeat

Keep merging the two closest clusters. 99 → 98 → 97 → … → 1. Every merge is recorded.

4

Read the dendrogram and cut

The tree shows all merges. You "cut" at a height that gives you the number of clusters you want. A big vertical gap in the dendrogram = a natural place to cut.

What Is a Dendrogram?

A dendrogram is a tree diagram. The X-axis has the data points (or their indices). The Y-axis is the distance at which clusters merge. The higher two branches connect, the more different those clusters are.

Dendrogram — How to Read It

Dendrogram: A & B merge first (closest). C & D merge next. Then AB joins CD. Finally E joins all. Cut the dashed line to get 3 clusters: {A,B}, {C,D}, {E}.

🛒 Real-World Example: Customer Segmentation

Problem: An online store has 500 customers. You want to group them by spending behavior, but you don't know how many segments exist. Should it be 2? 3? 5?

Solution: Run hierarchical clustering. Look at the dendrogram. You see a big gap between 3 and 4 clusters — so you cut at 3. The 3 groups turn out to be: "Big Spenders," "Occasional Buyers," and "Window Shoppers." Now marketing can tailor campaigns for each group!

Linkage Methods — How to Measure "Distance Between Clusters"

When two clusters have multiple points, how do you measure the distance between them? There are several strategies:

Method	How It Measures Distance	When to Use
Ward (most common)	Minimizes the total variance within clusters when merging	Default choice; produces compact, evenly sized clusters
Single	Distance between the two closest points of each cluster	Can find chain-like, elongated clusters
Complete	Distance between the two farthest points of each cluster	Produces compact clusters; sensitive to outliers
Average	Average distance between all pairs of points	Compromise between single and complete

Python Code — Full Example

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

# --- Step 1: Prepare data (e.g. customer data) ---
# Columns: Annual Spending ($), Visit Frequency (per month)
X = np.array([[120, 15], [130, 14], [22, 3], [28, 4],
              [300, 25], [280, 22], [25, 2], [135, 16]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Step 2: Build the dendrogram ---
linkage_matrix = linkage(X_scaled, method='ward')

plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix, labels=['C1','C2','C3','C4','C5','C6','C7','C8'])
plt.title('Customer Dendrogram')
plt.xlabel('Customer')
plt.ylabel('Distance (Ward)')
plt.axhline(y=4, color='red', linestyle='--', label='Cut at 3 clusters')
plt.legend()
plt.show()

# --- Step 3: Cut the dendrogram to get cluster labels ---
labels_from_dendro = fcluster(linkage_matrix, t=3, criterion='maxclust')
print("Cluster labels:", labels_from_dendro)

# --- OR: Use sklearn (same result, easier for pipelines) ---
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_sklearn = model.fit_predict(X_scaled)
print("Sklearn labels:", labels_sklearn)

🤔 When to Use Hierarchical Clustering Instead of K-Means

You don't know K and want to explore different numbers of clusters visually.
You want to see the hierarchy — which items are most similar, which merge early vs late.
Your dataset is small to medium (under ~10,000 points). For large data, K-Means is faster.
You need non-spherical clusters (single/complete linkage can find chains or bands).

Part 5: DBSCAN — Density-Based Clustering

👶 The Problem in Plain English

K-Means finds round blobs. But what if your data has weird shapes? Like a crescent moon and a circle, or clusters with outliers (noise points that don't belong anywhere)? K-Means will force every point into a cluster, even the outliers. DBSCAN solves both problems: it finds clusters of any shape and says "this point is noise" for outliers.

🏙️ The City Neighborhoods Analogy

Imagine looking at a city from above at night. You see dense clusters of lights — those are neighborhoods. Between them, there are dark, empty areas. And a few lone houses in the middle of nowhere.

Dense areas (many lights close together) = Clusters
Dark gaps = Cluster boundaries
Lone houses far from everything = Outliers / Noise

DBSCAN does exactly this: it finds dense regions of points and calls sparse points "noise."

Two Parameters You Need to Know

Parameter	What It Means	Analogy
eps (epsilon)	The radius of the neighborhood around each point. "How close do two points need to be to be considered neighbors?"	If you shine a flashlight with radius `eps` from any point, which other points are lit up?
min_samples	The minimum number of points within `eps` distance to form a dense region (a "core point").	A neighborhood needs at least `min_samples` houses within flashlight range to count as a real neighborhood.

Three Types of Points

🟢

Core Point

Has at least min_samples neighbors within eps. It is in the heart of a cluster.

🟡

Border Point

Within eps of a core point, but doesn't have enough neighbors to be a core itself. It's on the edge.

🔴

Noise Point

Not within eps of any core point. It's an outlier — doesn't belong to any cluster. Labeled -1.

How DBSCAN Works — Step by Step

1

Pick any unvisited point

Start with a random point.

2

Check its neighborhood

Count how many points are within eps distance. If count ≥ min_samples → it's a core point. Start a new cluster!

3

Expand the cluster

Add all neighbors to this cluster. Check their neighbors too — if they are also core points, keep expanding. The cluster grows like a chain through dense regions.

4

Move to the next unvisited point

If it's a core point → start another cluster. If not enough neighbors → mark it as noise (-1).

5

Done when all points are visited

Every point is either in a cluster or labeled noise.

DBSCAN Visualization — Core, Border, and Noise Points

Green = core points (enough neighbors). Yellow = border (near a core). Red = noise/outliers (too far from everything).

🗺️ Real-World Example: Finding Fraud Clusters

Problem: A bank has 1 million transactions. Most are normal, but some are fraudulent. Fraud transactions form small, dense clusters (e.g. same ATM, same time window), while isolated odd transactions are just noise.

Why DBSCAN? K-Means would force every transaction into a cluster — even the isolated weird ones. DBSCAN says "these 15 transactions form a suspicious cluster, and these 3 lonely transactions are just noise." The bank investigates only the real clusters.

Python Code — Full Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons

# --- Step 1: Generate data with two crescent-moon shapes ---
X, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

# Add some outliers
outliers = np.array([[-1.5, 0.8], [2.5, -0.5], [0.5, 1.5]])
X = np.vstack([X, outliers])

# --- Step 2: Scale the data ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Step 3: Run DBSCAN ---
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# --- Step 4: Analyze results ---
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

# --- Step 5: Visualize ---
plt.figure(figsize=(10, 6))
# Cluster points
plt.scatter(X_scaled[labels != -1, 0], X_scaled[labels != -1, 1],
           c=labels[labels != -1], cmap='viridis', s=40, label='Clusters')
# Noise points (red X)
plt.scatter(X_scaled[labels == -1, 0], X_scaled[labels == -1, 1],
           c='red', marker='x', s=100, linewidths=2, label='Noise')
plt.title('DBSCAN: Two Moon Shapes + Outliers Detected')
plt.legend()
plt.show()

How to Choose eps and min_samples

This is the hardest part. Here are practical tips:

min_samples: A good default is 2 * number_of_features. For 2D data → 4 or 5.
eps: Use the k-distance graph: for each point, compute the distance to its k-th nearest neighbor (k = min_samples). Sort these distances and plot them. Look for an "elbow" — that's your eps.

from sklearn.neighbors import NearestNeighbors

# k-distance graph to find eps
k = 5  # same as min_samples
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)

# Sort the k-th nearest neighbor distances
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(8, 4))
plt.plot(k_distances)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'{k}-th nearest neighbor distance')
plt.title('k-Distance Graph (find the elbow for eps)')
plt.axhline(y=0.3, color='red', linestyle='--', label='eps = 0.3')
plt.legend()
plt.show()

🤔 When to Use DBSCAN Instead of K-Means

Your clusters are non-spherical (crescent moons, rings, irregular blobs).
You have outliers/noise that shouldn't be forced into a cluster.
You don't know K and don't want to guess.
Cluster sizes are very different (one big cluster, one tiny one).

Part 6: Which Algorithm Should I Use?

Here's a decision guide. Think about your data and pick the right tool:

Algorithm	Cluster Shape	Handles Outliers	Need K?	Speed	Best For
K-Means	Spherical / round blobs	No (forces all points into clusters)	Yes	Very fast	Large data, well-separated round clusters
Hierarchical	Any (depends on linkage)	No	Optional (cut dendrogram)	Slow for large data	Small-medium data, exploring hierarchy
DBSCAN	Any shape!	Yes! (labels them -1)	No	Medium	Irregular shapes, outlier detection

🎯 Quick Decision Tree

Do you know how many clusters? → Yes → K-Means (fast and simple).

Do you have outliers? → Yes → DBSCAN (labels noise).

Want to explore cluster hierarchy? → Yes → Hierarchical (dendrogram).

Data is huge (millions of rows)? → K-Means (fastest).

Clusters are weird shapes? → DBSCAN.

Test Yourself

Check your understanding. Click an answer – you’ll get instant feedback.

1. What does “intra-cluster distance” mean?

The distance between different cluster centroids The distance of points inside a cluster to their own centroid The number of points in each cluster

2. K-Means is sensitive to outliers. What can you do?

Remove or cap outliers, or use RobustScaler / DBSCAN Ignore them – K-Means handles them automatically Use more clusters (higher K) to absorb outliers

3. In the Elbow method, what is on the Y-axis?

Number of clusters (K) Inertia (WCSS – within-cluster sum of squares) Silhouette score

4. Good clustering has _ intra-cluster distance and _ inter-cluster distance.

Low (tight groups); high (well separated) High; low Low; low

Part 5: Real Example - Customer Segmentation

# Customer Segmentation Example
import pandas as pd
import numpy as np

# Sample customer data
customers = pd.DataFrame({
    'CustomerID': range(1, 101),
    'Annual_Income': np.random.randint(20000, 150000, 100),
    'Spending_Score': np.random.randint(1, 100, 100),
    'Purchase_Frequency': np.random.randint(1, 50, 100)
})

# Prepare features
X = customers[['Annual_Income', 'Spending_Score', 'Purchase_Frequency']]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal K
for k in range(2, 7):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"K={k}: Silhouette = {score:.3f}")

# Apply final clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
customers['Segment'] = kmeans.fit_predict(X_scaled)

# Analyze segments
segment_analysis = customers.groupby('Segment').agg({
    'Annual_Income': 'mean',
    'Spending_Score': 'mean',
    'Purchase_Frequency': 'mean',
    'CustomerID': 'count'
}).rename(columns={'CustomerID': 'Count'})

print("\n📊 Customer Segments:")
print(segment_analysis.round(0))

📊 Interpreting the Segments

Segment 0 (High Income, High Spending): VIP customers - target for premium products
Segment 1 (Low Income, High Spending): At risk - may need financial products
Segment 2 (High Income, Low Spending): Potential - target with engagement campaigns
Segment 3 (Low Income, Low Spending): Budget conscious - target with discounts

🚫 Common Mistakes in Clustering

Not scaling features — If one feature is 0–100 and another is 0–1, distance is dominated by the first. Always scale (e.g. StandardScaler) before K-Means.
Assuming K is obvious — There is no "correct" K; use elbow plot, silhouette score, and business sense together.
Using K-Means when clusters aren't round — K-Means assumes roughly spherical clusters; for odd shapes or lots of noise, consider DBSCAN or hierarchical.
Forgetting that clusters are unlabeled — The algorithm gives you group 0, 1, 2… you have to interpret what each group means.

📘 From the course notebook (Clustering)

The course source uses Hotel Reservations.csv: data = pd.read_csv("Hotel Reservations.csv"). Use it to cluster customer bookings (e.g. select numeric columns like lead_time, avg_price_per_room, no_of_weekend_nights). Scale with StandardScaler, then KMeans(n_clusters=k).fit(X); try elbow plot or silhouette to pick k. Download Hotel Reservations.csv from the datasets page. See Clustering.pdf in the course source for slides.

Complete code from course notebook: kmeans_clustering.ipynb

Every line of code from the course notebook is below (verbatim). Comments may be explained elsewhere; the code is unchanged.

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# --- Code cell 3 ---
data =  pd.read_csv("Hotel Reservations.csv")

# --- Code cell 4 ---
# We have a data of hotel reservations
# Use it for clustering customer bookings to identify patterns

# --- Code cell 7 ---
data.head(100)

# --- Code cell 8 ---
data.info()

# --- Code cell 9 ---
data.describe(include ='all')

# --- Code cell 10 ---
print(data['type_of_meal_plan'].value_counts())

# --- Code cell 11 ---
print(data['room_type_reserved'].value_counts())

# --- Code cell 12 ---
print(data['market_segment_type'].value_counts())

# --- Code cell 13 ---
print(data['booking_status'].value_counts())

# --- Code cell 14 ---
print(data['required_car_parking_space'].value_counts())

# --- Code cell 15 ---
print(data['repeated_guest'].value_counts())

# --- Code cell 16 ---
print(data['no_of_previous_bookings_not_canceled'].value_counts().head(10))
# As there are only ~3% repeat customers using previous booking data is not significant

# --- Code cell 17 ---
print(data['no_of_adults'].value_counts())

# --- Code cell 18 ---
print(data['no_of_children'].value_counts())
#Number of children 9 and 10 looks like outlier

# --- Code cell 19 ---
#oulier removal
data = data[data['no_of_children']<=3]
data.reset_index(inplace=True)

# --- Code cell 20 ---
print(data.columns)

# --- Code cell 21 ---
numerical_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'required_car_parking_space', 'lead_time','repeated_guest',
       'avg_price_per_room', 'no_of_special_requests']

# --- Code cell 22 ---
categorical_features  = ['type_of_meal_plan','room_type_reserved','market_segment_type']

# --- Code cell 23 ---
data_features = data[numerical_features + categorical_features + ['booking_status']  ]
data_features.head(10)

# --- Code cell 24 ---
data.repeated_guest.value_counts()

# --- Code cell 27 ---
x_train  = data_features[numerical_features + categorical_features]

# There is nothing to predict in clustering
# We are just storing booking status flag in another variable to 
# check later if the clusters have some pattern w.r.t booking cancellation
y_label = data_features[['booking_status']]

# --- Code cell 28 ---
x_train = pd.get_dummies(x_train, columns =categorical_features, drop_first= False)
print(x_train.columns)

# --- Code cell 29 ---
x_train.head(10)

# --- Code cell 32 ---
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_train[numerical_features] = scaler.fit_transform(x_train[numerical_features])
x_train.describe()
x_train_copy = x_train.copy()

# --- Code cell 33 ---
x_train

# --- Code cell 34 ---
x_train_copy.to_csv("x_train.csv",index= False)

# --- Code cell 36 ---
from sklearn.cluster import KMeans


kmeans = KMeans(n_clusters=10, random_state=0, n_init="auto").fit(x_train)
x_train['cluster_labels'] = kmeans.labels_
x_train['booking_status'] = y_label['booking_status']

# --- Code cell 37 ---
from sklearn.metrics import silhouette_score
silhouette_score(x_train_copy, kmeans.labels_)

# --- Code cell 38 ---
x_train['cluster_labels'].value_counts()

# --- Code cell 41 ---
x_train['booking_status'].value_counts()

# --- Code cell 42 ---
print("cancellation rate in data:", 100*11884/(11884 + 24388))

# --- Code cell 43 ---
cluster_number = []
cancellation_rate = []

for z in range(len(list(x_train['cluster_labels'].unique()))):
               cluster_number.append(z)
               temp = x_train[x_train['cluster_labels']==z]
               temp_cancelled = temp[temp['booking_status']=='Canceled']
               temp_not_cancelled = temp[temp['booking_status']=='Not_Canceled']
               cancel = (len(temp_cancelled)/len(temp))*100
               cancellation_rate.append(cancel)

# --- Code cell 44 ---
temp = pd.DataFrame({'cluster':cluster_number, 'cancellation': cancellation_rate})
sns.barplot(x = 'cluster',y = 'cancellation', data = temp)

# --- Code cell 46 ---
data['cluster'] = kmeans.labels_
data.to_csv('clustering_results.csv')

# --- Code cell 48 ---
# Check average value of numerical features across clusters

# --- Code cell 49 ---

plt.figure(figsize=(20, 12))

plt.subplot(3,3,1)
temp = pd.DataFrame(data.groupby('cluster')['no_of_adults'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_adults', data = temp)

plt.subplot(3,3,2)
temp = pd.DataFrame(data.groupby('cluster')['no_of_children'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_children', data = temp)

plt.subplot(3,3,3)
temp = pd.DataFrame(data.groupby('cluster')['no_of_weekend_nights'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_weekend_nights',data = temp)

plt.subplot(3,3,4)
temp = pd.DataFrame(data.groupby('cluster')['no_of_week_nights'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_week_nights', data = temp)

plt.subplot(3,3,5)
temp = pd.DataFrame(data.groupby('cluster')['required_car_parking_space'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'required_car_parking_space', data = temp)

plt.subplot(3,3,6)
temp = pd.DataFrame(data.groupby('cluster')['lead_time'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'lead_time',data = temp)

plt.subplot(3,3,7)
temp = pd.DataFrame(data.groupby('cluster')['avg_price_per_room'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'avg_price_per_room', data = temp)

plt.subplot(3,3,8)
temp = pd.DataFrame(data.groupby('cluster')['no_of_special_requests'].mean()).reset_index()
sns.barplot(x = 'cluster',y = 'no_of_special_requests', data = temp)

plt.show()

# --- Code cell 51 ---
# Check frequency of categorical features across clusters

# --- Code cell 52 ---

plt.figure(figsize=(20, 12))

plt.subplot(2,2,1)
sns.countplot(x='cluster', hue='type_of_meal_plan', data=data)

plt.subplot(2,2,2)
sns.countplot(x='cluster', hue='room_type_reserved', data=data)

plt.subplot(2,2,3)
sns.countplot(x='cluster', hue='market_segment_type', data=data)

plt.subplot(2,2,4)
sns.countplot(x='cluster', hue='repeated_guest', data=data)

plt.show()

# --- Code cell 55 ---
from sklearn.cluster import KMeans
wcss = [] 

for i in range(2, 30): 
    kmeans = KMeans(n_clusters = i, init = 'k-means++', n_init= 'auto', random_state = 42)
    kmeans.fit(x_train_copy) 
    wcss.append(kmeans.inertia_)

# --- Code cell 57 ---
import matplotlib.pyplot as plt
K = range(2, 30)
plt.plot(K, wcss, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Within cluster Sum of Squared distances')
plt.title('The Elbow Method')
plt.show()

Complete code from course notebook: agglomerative_clustering.ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 1 ---
import pandas as pd

# --- Code cell 2 ---
# read original data and feature matrix

# --- Code cell 3 ---
x_train = pd.read_csv("x_train.csv")
data =  pd.read_csv("Hotel Reservations.csv")

# --- Code cell 4 ---
# reduce the number of rows in data if you face memory issues
# This is just to see end to end execution - not a recommended step

x_train = x_train[0:10000]
data = data[0:10000]

# --- Code cell 5 ---
x_train_copy = x_train.copy()

# --- Code cell 7 ---
# Try agglomerative clustering with cosine distance metric and distance threshold as input
# You can also specify n_clusters and set distance_threshold to None
# You can try different distance metrics and linkage criterias

# --- Code cell 8 ---
from sklearn.cluster import AgglomerativeClustering


clustering = AgglomerativeClustering( n_clusters = None,
                                      linkage = 'complete',
                                      distance_threshold = 0.5, # if n_clusters is number then this should be None
                                      metric = 'cosine')

clustering.fit(x_train)
x_train['cluster_labels'] = clustering.labels_
x_train['booking_status'] = data['booking_status']
print(x_train['cluster_labels'].value_counts())

# --- Code cell 10 ---
# get cancellation rate in each cluster

# --- Code cell 11 ---
cluster_number = []
cancellation_rate = []

for z in range(len(list(x_train['cluster_labels'].unique()))):
               cluster_number.append(z)
               temp = x_train[x_train['cluster_labels']==z]
               temp_cancelled = temp[temp['booking_status']=='Canceled']
               temp_not_cancelled = temp[temp['booking_status']=='Not_Canceled']
               cancel = (len(temp_cancelled)/len(temp))*100
               cancellation_rate.append(cancel)

# --- Code cell 13 ---
import seaborn as sns
temp = pd.DataFrame({'cluster':cluster_number, 'cancellation': cancellation_rate})
sns.barplot(x = 'cluster',y = 'cancellation', data = temp)

# --- Code cell 15 ---
# get silehoutee score

# --- Code cell 16 ---
from sklearn.metrics import silhouette_score
silhouette_score(x_train_copy, clustering.labels_)

💭 Short reflection

In one sentence: why would you choose DBSCAN over K-Means when your dataset has many outliers or oddly shaped clusters? (Think: density, noise, and not having to pick K.)

Core & Non-Core Points – Mastery Checklist

To become a master of clustering, you must know every core point. Non-core points make you stand out in interviews and real projects.

✅ CORE (Must know – exam & job essentials)

Clustering is unsupervised – no labels; algorithm finds groups.
K-Means: choose K, initialize K centroids, assign each point to nearest centroid, update centroid to mean of its points, repeat until convergence.
Intra-cluster distance (WCSS/Inertia): sum of squared distances of points to their centroid; minimize it.
Inter-cluster distance: distance between clusters; good clustering = low intra, high inter.
Choosing K: Elbow method (plot inertia vs K; pick the elbow) and Silhouette score (higher = better).
Always scale features before K-Means (distance-based).
K-Means is sensitive to outliers; handle with removal, capping, RobustScaler, or DBSCAN.
DBSCAN: density-based; finds arbitrary shapes and labels outliers as -1; no need to set K.
Hierarchical clustering: dendrogram; can cut at desired K; good for small/medium data.
Compare algorithms: K-Means (spherical, needs K), Hierarchical (any shape, optional K), DBSCAN (any shape, outliers, no K).

📚 NON-CORE (Good to know – depth & interviews)

Convergence: stop when centroid movement is below threshold or max iterations reached.
K-Means++: smarter initialization; reduces bad local minima and improves stability.
Distance metrics: Euclidean (default), Manhattan for grid-like or high-dimensional data.
Silhouette interpretation: +1 (ideal), 0 (boundary), -1 (wrong cluster).
Multiple runs: run K-Means with different random seeds and pick the best inertia or silhouette.
Dendrogram: read height as “distance when clusters merge”; cut where gap is large.
DBSCAN parameters: eps (max distance for neighbors), min_samples (min points to form core).
GMM (Gaussian Mixture): soft clustering (probabilities); can capture ellipsoidal clusters.
Use clustering for exploration (discover segments) and preprocessing (e.g. cluster then model per cluster).

Summary

Concept	Key Points
Clustering	Unsupervised learning - finds natural groups without labels
K-Means	Fast, simple; finds K spherical clusters using centroids
Choosing K	Elbow method (inertia) or Silhouette score
DBSCAN	Density-based; finds any shape, detects outliers
Scaling	ALWAYS scale your data before clustering!

🎯 Pro Tips

Always scale features - clustering is distance-based!
Try multiple K values - use both elbow and silhouette
Visualize results - even if just 2D PCA plot
Interpret clusters - give them meaningful business names

🎯 Clustering

Part 1: What is Clustering?

👶 In One Sentence (Like You're 5)

🍎🍊🍇 The Fruit Sorting Analogy

Clustering Discovers Natural Groups

Real-World Applications

Part 2: K-Means Clustering

📍 The Pizza Delivery Analogy

🎬 Animated: One K-Means iteration (Assign → Update)

K-Means Algorithm Steps

Choose K (number of clusters)

Initialize K random centroids

Assign each point to nearest centroid

Update centroids

Repeat until convergence

📐 Intra-Cluster Distance (Within-Cluster Sum of Squares – WCSS)

🏠 Layman Example: Students in Classrooms

📐 Inter-Cluster Distance (Between Clusters)

🏠 Layman Example: Classrooms Again

🔄 Convergence and Random Initialization

📏 Distance Metrics (Euclidean vs Manhattan)

🎬 K-Means in Action (Animated)

⚠️ Dealing with Outliers in K-Means

💡 Takeaway

Part 3: Choosing the Right K

Method 1: Elbow Method

🎮 Interactive: Change K and Watch the Clusters Form!

Method 2: Silhouette Score

💡 Silhouette Score Interpretation

Part 4: Hierarchical Clustering

👶 The Problem in Plain English

🏫 The School Analogy

How It Works — Step by Step

Start: every point is its own cluster

Find the two closest clusters

Repeat

Read the dendrogram and cut

What Is a Dendrogram?

Dendrogram — How to Read It

🛒 Real-World Example: Customer Segmentation

Linkage Methods — How to Measure "Distance Between Clusters"

Python Code — Full Example

🤔 When to Use Hierarchical Clustering Instead of K-Means

Part 5: DBSCAN — Density-Based Clustering

👶 The Problem in Plain English

🏙️ The City Neighborhoods Analogy

Two Parameters You Need to Know

Three Types of Points

Core Point

Border Point

Noise Point

How DBSCAN Works — Step by Step

Pick any unvisited point

Check its neighborhood

Expand the cluster

Move to the next unvisited point

Done when all points are visited

DBSCAN Visualization — Core, Border, and Noise Points

🗺️ Real-World Example: Finding Fraud Clusters

Python Code — Full Example

How to Choose eps and min_samples

🤔 When to Use DBSCAN Instead of K-Means

Part 6: Which Algorithm Should I Use?

🎯 Quick Decision Tree

Test Yourself

1. What does “intra-cluster distance” mean?

2. K-Means is sensitive to outliers. What can you do?

3. In the Elbow method, what is on the Y-axis?

4. Good clustering has _____ intra-cluster distance and _____ inter-cluster distance.

Part 5: Real Example - Customer Segmentation

📊 Interpreting the Segments

🚫 Common Mistakes in Clustering

📘 From the course notebook (Clustering)

Complete code from course notebook: kmeans_clustering.ipynb

Complete code from course notebook: agglomerative_clustering.ipynb

💭 Short reflection

Core & Non-Core Points – Mastery Checklist

✅ CORE (Must know – exam & job essentials)

📚 NON-CORE (Good to know – depth & interviews)

Summary

4. Good clustering has _ intra-cluster distance and _ inter-cluster distance.