K-Means Clustering Code Walkthrough

Every single line of the K-Means notebook explained like you are 5 years old. We cluster hotel bookings to find hidden patterns.

Download the dataset first: Hotel Reservations.csv — Save it in the same folder as your script so pd.read_csv("Hotel Reservations.csv") works.

What We'll Cover

  1. Step 1 – Import Libraries (the toolbox)
  2. Step 2 – Load & Explore the Data (look before you leap)
  3. Step 3 – Clean the Data (remove garbage)
  4. Step 4 – Pick Features (choose what matters)
  5. Step 5 – One-Hot Encode Categories (turn words into numbers)
  6. Step 6 – Scale the Data (make everything fair)
  7. Step 7 – Run K-Means (the magic grouping)
  8. Step 8 – Evaluate with Silhouette Score (how good are our groups?)
  9. Step 9 – Analyze Cluster Patterns (what did we find?)
  10. Step 10 – Visualize Clusters (charts!)
  11. Step 11 – The Elbow Method (pick the right K)
STEP 1

Import Libraries

Before we write any data science code, we need to load the tools (libraries). Think of it like opening a toolbox before fixing something.

from IPython.core.display import HTML

HTML("""
<style>
h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")

Line-by-line:

  • from IPython.core.display import HTML — This loads a special tool that lets you inject HTML/CSS into a Jupyter notebook. It's purely cosmetic.
  • HTML("""...""") — This injects CSS styles to make <h1> headings blue and <h2> headings green inside the notebook. It has zero effect on the data science. It just makes the notebook look prettier.

Why is this here?

The instructor likes pretty colors in the notebook. You can skip this cell entirely — it doesn't touch your data or your clustering. It's like painting your toolbox before opening it.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Line-by-line:

  • import pandas as pdpandas is the #1 library for working with tables (rows & columns) in Python. as pd means "I'll call it pd for short instead of typing pandas every time."
  • import seaborn as snsseaborn makes beautiful charts with one line of code. It sits on top of matplotlib. sns is its nickname (named after the character Samuel Norman Seaborn).
  • import matplotlib.pyplot as pltmatplotlib is the original charting library. pyplot is the part we use most. plt is its nickname.

Analogy

pandas = your Excel spreadsheet tool. seaborn = your fancy graph designer. matplotlib = the engine that actually draws the graphs on screen.

STEP 2

Load & Explore the Data

We read the hotel reservations CSV file and look at what's inside.

data = pd.read_csv("Hotel Reservations.csv")

What this does:

  • pd.read_csv("Hotel Reservations.csv") — Opens the CSV file (a spreadsheet saved as text, where commas separate each column) and reads every row and column into Python.
  • data = ... — We store the entire table in a variable called data. Now data is a DataFrame — pandas' word for "a table with rows and columns."

Common error: If you get FileNotFoundError, it means the CSV file is not in the same folder as your notebook. Move it there, or use the full path like pd.read_csv("/Users/you/Downloads/Hotel Reservations.csv").

Looking at the Data

data.head(100)

What this does:

  • .head(100) — Shows the first 100 rows of the table. If you write .head() with no number, it shows the first 5 rows. This is your first glance at the data — like flipping open a book to the first page.
  • You'll see columns like: Booking_ID, no_of_adults, no_of_children, lead_time, avg_price_per_room, booking_status, etc.
data.info()

What this does:

  • Prints a summary of the table: how many rows (36,275), how many columns (19), each column's name, how many non-null values, and the data type (int64 = whole number, float64 = decimal, object = text).
  • Why? To check for missing values (if Non-Null Count < 36275, data is missing) and to see which columns are numbers vs. text.
data.describe(include='all')

What this does:

  • Shows statistics for every column: count, mean, std (standard deviation), min, 25%, 50% (median), 75%, max for numbers; count, unique, top, freq for text columns.
  • include='all' — Without this, it only shows number columns. With 'all', it includes text columns too.
  • Why? To spot weirdness: Is the max value of no_of_children = 10? That might be an outlier (mistake).

Checking Each Column's Values

print(data['type_of_meal_plan'].value_counts())
print(data['room_type_reserved'].value_counts())
print(data['market_segment_type'].value_counts())
print(data['booking_status'].value_counts())
print(data['required_car_parking_space'].value_counts())
print(data['repeated_guest'].value_counts())
print(data['no_of_previous_bookings_not_canceled'].value_counts().head(10))
print(data['no_of_adults'].value_counts())
print(data['no_of_children'].value_counts())

What .value_counts() does:

  • For any column, it counts how many times each unique value appears and sorts from most common to least common.
  • Example: data['booking_status'].value_counts() might show: Not_Canceled: 24,390 and Canceled: 11,885.
  • Why run this for every column? To understand what values exist, spot outliers, and decide which columns are useful for clustering.
  • Key finding: no_of_children has values 9 and 10 — that looks like a data entry error (outlier). Only ~3% are repeat customers, so previous booking history won't help much.

Analogy

Imagine you have a giant bag of Skittles. value_counts() is like sorting them by color and counting: "42 red, 38 green, 35 yellow..." — You instantly see which color is most common and if there's a weird one you've never seen before.

STEP 3

Clean the Data (Remove Outliers)

We spotted that no_of_children has values 9 and 10 — likely data entry errors. No hotel booking realistically has 9–10 children. Let's remove them.

data = data[data['no_of_children'] <= 3]
data.reset_index(inplace=True)

Line-by-line:

  • data[data['no_of_children'] <= 3] — This is a filter. It says "keep only rows where the number of children is 3 or less." All rows with 9 or 10 children get thrown away.
  • How it works inside: data['no_of_children'] <= 3 creates a list of True/False for every row. Then data[...] keeps only the True rows.
  • data = ... — We overwrite the old data with the cleaned version.
  • .reset_index(inplace=True) — After removing rows, the row numbers (index) have gaps (e.g., 0, 1, 2, 5, 8...). This re-numbers them cleanly as 0, 1, 2, 3, 4... inplace=True means "change the existing DataFrame, don't create a new copy."

Why remove outliers?

If one person says they booked a hotel for 10 children, that number is SO different from everyone else that K-Means will create a cluster JUST for that one weirdo. That's not useful. Removing outliers keeps the algorithm focused on real patterns.

STEP 4

Pick the Features (Columns We'll Use)

Not every column is useful. We pick the ones that describe the behavior of a booking.

numerical_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'required_car_parking_space', 'lead_time',
       'repeated_guest', 'avg_price_per_room', 'no_of_special_requests']

What this does:

  • Creates a Python list called numerical_features containing the names of 9 columns that are numbers.
  • Why these? They describe how the guest booked: how many people, how many nights, how early they booked (lead_time), the price, and how demanding they are (special_requests).
  • Why not Booking_ID? It's just a label ("INN00001") — it has no meaning. Including it would confuse the algorithm.
  • Why not arrival_year, arrival_month, arrival_date? The instructor chose to exclude calendar info and focus on booking behavior. You could include them — it's a design choice.
categorical_features = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']

What this does:

  • Creates a list of 3 columns that contain text (categories), not numbers.
  • type_of_meal_plan — e.g., "Meal Plan 1", "Not Selected"
  • room_type_reserved — e.g., "Room_Type 1", "Room_Type 6"
  • market_segment_type — e.g., "Online", "Offline"
  • Problem: K-Means only understands numbers. We'll need to convert these text values into numbers later (Step 5).
data_features = data[numerical_features + categorical_features + ['booking_status']]
data_features.head(10)

Line-by-line:

  • numerical_features + categorical_features + ['booking_status'] — In Python, adding lists glues them together. So this creates one big list of all 13 column names.
  • data[...] — Selects only those 13 columns from the full 19-column table. We store this smaller table in data_features.
  • Why include booking_status? We won't feed it to K-Means (clustering is unsupervised — no labels!). But we'll keep it nearby so we can check later: "Do certain clusters have higher cancellation rates?"
x_train = data_features[numerical_features + categorical_features]

# There is nothing to predict in clustering
# We are just storing booking status flag in another variable to
# check later if the clusters have some pattern w.r.t booking cancellation
y_label = data_features[['booking_status']]

Line-by-line:

  • x_train = ... — This is the data we will actually feed to K-Means. It has 12 columns (9 numerical + 3 categorical). No booking_status.
  • y_label = data_features[['booking_status']] — We save booking_status separately. Double brackets [[...]] means "give me a DataFrame (table), not a Series (single column)." This is purely for analysis later.
  • Key concept: In clustering, there is NO y (target/label). Unlike regression or classification where you predict something, clustering just groups similar things together. y_label here is only a convenience for post-hoc analysis.

Analogy

Imagine sorting laundry without reading the labels. You group clothes by color, size, and fabric — that's clustering. The "brand" tag (booking_status) is hidden in your pocket; you'll peek at it later to see if your groups accidentally separated Nike from Adidas.

STEP 5

One-Hot Encode Categorical Features

K-Means only understands numbers, but type_of_meal_plan has values like "Meal Plan 1", "Not Selected"... We need to convert these to numbers.

x_train = pd.get_dummies(x_train, columns=categorical_features, drop_first=False)
print(x_train.columns)

Line-by-line:

  • pd.get_dummies() — This is one-hot encoding. For each category value, it creates a new column that is either 0 or 1.
  • columns=categorical_features — Only convert these 3 columns. Leave the numerical ones alone.
  • drop_first=False — Keep all dummy columns (don't drop the first category). For clustering, we usually keep all; for regression, we'd drop one to avoid multicollinearity.

What does one-hot encoding look like?

Before: type_of_meal_plan = "Meal Plan 1"

After: type_of_meal_plan_Meal Plan 1 = 1, type_of_meal_plan_Meal Plan 2 = 0, type_of_meal_plan_Not Selected = 0

Each category gets its own column. If that row IS that category, it's 1. Otherwise 0. Now everything is a number!

Analogy

Instead of writing "favorite color = blue", you write: "is_red = 0, is_blue = 1, is_green = 0". Computers love this format.

STEP 6

Scale the Data (MinMaxScaler)

This is critical for K-Means. Here's why:

Problem: lead_time ranges from 0 to 443 days. no_of_children ranges from 0 to 3. If we don't scale, K-Means will think lead_time is 100x more important just because its numbers are bigger. That's unfair!

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
x_train[numerical_features] = scaler.fit_transform(x_train[numerical_features])
x_train.describe()
x_train_copy = x_train.copy()

Line-by-line:

  • from sklearn.preprocessing import MinMaxScaler — Import the MinMaxScaler tool from scikit-learn (the go-to machine learning library).
  • scaler = MinMaxScaler() — Create a scaler object. It's like loading a calculator that knows how to rescale numbers.
  • scaler.fit_transform(x_train[numerical_features]) — Two things happen here:
    • fit: The scaler looks at each column and finds the min and max values.
    • transform: It rescales every value using the formula: new_value = (old_value - min) / (max - min). After this, every value is between 0 and 1.
  • x_train[numerical_features] = ... — Overwrite the original numbers with the scaled versions.
  • x_train.describe() — Print stats to confirm: min should be 0, max should be 1 for every numerical column.
  • x_train_copy = x_train.copy() — Save a copy of the scaled data before we add cluster labels. We'll need the "clean" version later for the elbow method and silhouette score.

Analogy

Imagine comparing heights (in cm) and weights (in kg). A person who is 180 cm and 80 kg looks "farther" from 160 cm / 70 kg mostly because of the height difference (20 vs 10). MinMaxScaler puts both on a 0-to-1 scale, so they're equally important. Now height 0.67 vs 0.33 and weight 0.8 vs 0.6 — fair comparison!

x_train_copy.to_csv("x_train.csv", index=False)

What this does:

  • Saves the scaled feature data to a CSV file. index=False means don't include the row numbers as a column. This is optional — the instructor saved it for reference or reuse.
STEP 7

Run K-Means Clustering

This is the main event. We tell the computer: "Please split these 36,000+ bookings into 10 groups."

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10, random_state=0, n_init="auto").fit(x_train)
x_train['cluster_labels'] = kmeans.labels_
x_train['booking_status'] = y_label['booking_status']

Line-by-line:

  • from sklearn.cluster import KMeans — Import the K-Means algorithm from scikit-learn.
  • KMeans(n_clusters=10, ...) — Create a K-Means model that will make 10 clusters (groups numbered 0–9).
  • random_state=0 — K-Means starts with random center points. Setting random_state=0 makes it use the same random starting point every time, so you get reproducible results. Run it today or tomorrow — same answer.
  • n_init="auto" — K-Means will automatically decide how many times to run with different starting points and pick the best result. (Older scikit-learn versions defaulted to 10 runs.)
  • .fit(x_train)This is where the magic happens. The algorithm:
    1. Randomly places 10 center points in the data space.
    2. Assigns each booking to the nearest center (using Euclidean distance).
    3. Recalculates each center as the average of all bookings assigned to it.
    4. Repeats steps 2–3 until the centers stop moving (convergence).
  • kmeans.labels_ — After fitting, this attribute contains a number (0–9) for each row, telling which cluster that booking belongs to.
  • x_train['cluster_labels'] = kmeans.labels_ — Adds a new column to our table showing each booking's cluster.
  • x_train['booking_status'] = y_label['booking_status'] — Adds back the booking status column so we can analyze cancellation patterns per cluster.

Analogy

Imagine you throw 36,000 balls on a football field. You place 10 flags randomly. Each ball rolls to its nearest flag. Then you move each flag to the center of its ball pile. Balls re-roll to the nearest flag. Repeat until flags stop moving. Now you have 10 neat piles. Each pile is a "cluster."

STEP 8

Evaluate with Silhouette Score

How do we know if our 10 clusters are any good?

from sklearn.metrics import silhouette_score

silhouette_score(x_train_copy, kmeans.labels_)

Line-by-line:

  • from sklearn.metrics import silhouette_score — Import the scoring function.
  • silhouette_score(x_train_copy, kmeans.labels_) — Computes a score from -1 to +1:
    • +1 = perfect clusters (each point is far from other clusters, close to its own)
    • 0 = overlapping clusters (points are on the boundary)
    • -1 = terrible (points are in the wrong clusters)
  • We pass x_train_copy (the clean data without cluster labels) and kmeans.labels_ (the cluster assignments).
  • A score of 0.15–0.3 is common for real-world data. Don't expect 0.9 — real data is messy.

Analogy

Silhouette score asks each student in class: "Are you sitting closer to your friends (same cluster) or closer to the other group?" If everyone says "I'm way closer to my friends," score is high. If people are confused about which group they belong to, score is low.

x_train['cluster_labels'].value_counts()

What this does:

  • Counts how many bookings ended up in each cluster. You want clusters to be reasonably sized. If one cluster has 30,000 bookings and another has 5, that's suspicious — the algorithm barely split anything.
STEP 9

Analyze Cluster Patterns (Cancellation Rates)

Now the fun part: what did the clusters find? Let's check if some clusters have higher cancellation rates.

x_train['booking_status'].value_counts()

What this does:

  • Shows the overall count: how many bookings were canceled vs not canceled across ALL data. This is the baseline.
print("cancellation rate in data:", 100 * 11884 / (11884 + 24388))

What this does:

  • Calculates the overall cancellation rate: 11,884 canceled out of 36,272 total = about 32.8%.
  • This is our "baseline" — if a cluster has a much higher or lower cancellation rate, that's interesting!

Calculate Cancellation Rate Per Cluster

cluster_number = []
cancellation_rate = []

for z in range(len(list(x_train['cluster_labels'].unique()))):
    cluster_number.append(z)
    temp = x_train[x_train['cluster_labels'] == z]
    temp_cancelled = temp[temp['booking_status'] == 'Canceled']
    temp_not_cancelled = temp[temp['booking_status'] == 'Not_Canceled']
    cancel = (len(temp_cancelled) / len(temp)) * 100
    cancellation_rate.append(cancel)

Line-by-line:

  • cluster_number = [] — Create an empty list. We'll fill it with cluster numbers (0, 1, 2, ... 9).
  • cancellation_rate = [] — Another empty list for each cluster's cancellation percentage.
  • for z in range(len(list(x_train['cluster_labels'].unique()))): — This is a loop that says "for each unique cluster label, do the following." Let's break it:
    • .unique() → gets unique values: [0, 1, 2, ..., 9]
    • list(...) → converts to a Python list
    • len(...) → counts them: 10
    • range(10) → loop from 0 to 9
  • cluster_number.append(z) — Add the current cluster number (0, 1, 2...) to the list.
  • temp = x_train[x_train['cluster_labels'] == z] — Filter the data to only rows belonging to cluster z.
  • temp_cancelled = temp[temp['booking_status'] == 'Canceled'] — From that cluster, keep only canceled bookings.
  • cancel = (len(temp_cancelled) / len(temp)) * 100 — Calculate: (number of cancellations / total bookings in cluster) × 100 = cancellation percentage.
  • cancellation_rate.append(cancel) — Save the percentage.

Plot the Cancellation Rates

temp = pd.DataFrame({'cluster': cluster_number, 'cancellation': cancellation_rate})
sns.barplot(x='cluster', y='cancellation', data=temp)

Line-by-line:

  • pd.DataFrame({...}) — Create a new table with 2 columns: cluster (0–9) and cancellation (the percentage).
  • sns.barplot(x='cluster', y='cancellation', data=temp) — Draw a bar chart. Each bar is one cluster, and its height is the cancellation rate. You can instantly see which clusters cancel more!
  • Insight: If clusters 3 and 7 have 60% cancellation vs the baseline 33%, those groups of guests are high-risk. A hotel could offer them discounts or reminders.
data['cluster'] = kmeans.labels_
data.to_csv('clustering_results.csv')

What this does:

  • Adds the cluster label (0–9) back to the original unscaled data.
  • Saves everything to a CSV file so you can open it in Excel and explore the clusters yourself.
STEP 10

Visualize Cluster Characteristics

Now we look at what makes each cluster different. We compare both numerical and categorical features across clusters.

Numerical Features per Cluster

plt.figure(figsize=(20, 12))

plt.subplot(3, 3, 1)
temp = pd.DataFrame(data.groupby('cluster')['no_of_adults'].mean()).reset_index()
sns.barplot(x='cluster', y='no_of_adults', data=temp)

plt.subplot(3, 3, 2)
temp = pd.DataFrame(data.groupby('cluster')['no_of_children'].mean()).reset_index()
sns.barplot(x='cluster', y='no_of_children', data=temp)

# ... same pattern for: no_of_weekend_nights, no_of_week_nights,
# required_car_parking_space, lead_time, avg_price_per_room,
# no_of_special_requests

plt.show()

Line-by-line (same pattern repeated 8 times):

  • plt.figure(figsize=(20, 12)) — Create a big canvas (20 inches wide, 12 tall) to hold multiple charts.
  • plt.subplot(3, 3, 1) — Divide the canvas into a 3×3 grid of charts. 1 means "put this chart in position 1 (top-left)."
  • data.groupby('cluster')['no_of_adults'].mean() — Group all rows by cluster, then calculate the average number of adults in each cluster.
  • .reset_index() — Turn the grouped result back into a regular table (with cluster as a column, not an index).
  • sns.barplot(...) — Draw a bar chart comparing the average across clusters.
  • plt.show() — Display all 8 charts at once.
  • What you learn: "Cluster 5 has families (high children count), Cluster 2 books far in advance (high lead_time), Cluster 8 pays the most (high avg_price_per_room)."

Categorical Features per Cluster

plt.figure(figsize=(20, 12))

plt.subplot(2, 2, 1)
sns.countplot(x='cluster', hue='type_of_meal_plan', data=data)

plt.subplot(2, 2, 2)
sns.countplot(x='cluster', hue='room_type_reserved', data=data)

plt.subplot(2, 2, 3)
sns.countplot(x='cluster', hue='market_segment_type', data=data)

plt.subplot(2, 2, 4)
sns.countplot(x='cluster', hue='repeated_guest', data=data)

plt.show()

Line-by-line:

  • sns.countplot(x='cluster', hue='type_of_meal_plan', data=data) — A count plot with clusters on the x-axis and colored bars for each meal plan type. Unlike barplot (which shows averages), countplot shows raw counts — how many bookings in each cluster chose each meal plan.
  • hue=... — "Color the bars by this category." Each meal plan gets a different color within each cluster.
  • What you learn: "Cluster 0 is all Meal Plan 1 people, Cluster 4 is dominated by Online bookings, Cluster 9 books Room Type 4..."
STEP 11

The Elbow Method (Choose the Right K)

We used K=10 earlier, but how do we know 10 is the right number? The Elbow Method helps us decide.

Important: In practice, you should run the Elbow Method BEFORE finalizing your clusters, not after. The notebook does it at the end for educational purposes.

from sklearn.cluster import KMeans

wcss = []

for i in range(2, 30):
    kmeans = KMeans(n_clusters=i, init='k-means++', n_init='auto', random_state=42)
    kmeans.fit(x_train_copy)
    wcss.append(kmeans.inertia_)

Line-by-line:

  • wcss = [] — Create an empty list to store the "Within-Cluster Sum of Squares" for each value of K. WCSS measures how spread out the points are within their clusters — lower = tighter clusters.
  • for i in range(2, 30): — Try K = 2, 3, 4, 5, ... all the way to 29. That's 28 different experiments!
  • KMeans(n_clusters=i, init='k-means++', ...) — Create a K-Means model with i clusters. 'k-means++' is a smarter way to pick initial center points (avoids putting two centers right next to each other).
  • kmeans.fit(x_train_copy) — Run K-Means on the clean scaled data. We use x_train_copy because x_train now has extra columns (cluster_labels, booking_status).
  • kmeans.inertia_ — After fitting, this is the WCSS value. It's the sum of (distance from each point to its cluster center)² for all points.
  • wcss.append(...) — Save the WCSS for this K value.

Plot the Elbow Curve

import matplotlib.pyplot as plt

K = range(2, 30)
plt.plot(K, wcss, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Within cluster Sum of Squared distances')
plt.title('The Elbow Method')
plt.show()

Line-by-line:

  • K = range(2, 30) — The x-axis values: 2, 3, 4, ..., 29.
  • plt.plot(K, wcss, 'bx-') — Plot K (x-axis) vs WCSS (y-axis). 'bx-' means: b = blue, x = X markers, - = connect with lines.
  • plt.xlabel(...), plt.ylabel(...), plt.title(...) — Label the axes and title.
  • plt.show() — Display the chart.

How to read the Elbow Chart

The chart looks like a bent arm. WCSS always goes down as K increases (more clusters = smaller groups = less spread). But at some point, adding more clusters barely helps — that's the "elbow" where the curve bends sharply.

Look for the elbow: If the curve bends around K=5 or K=8, that's your sweet spot. Before the elbow: too few clusters (big, messy groups). After the elbow: too many clusters (splitting hairs for no reason).

Analogy

Imagine organizing a classroom into study groups. With 2 groups, kids are very different from each other within each group (high WCSS). With 30 groups of 1 person each, WCSS = 0 (everyone is perfectly grouped — with themselves!). The sweet spot is somewhere in between — say 5 groups where kids are similar enough within each group but you haven't over-fragmented.

SUMMARY

The Full Pipeline in Plain English

What We Did (7 Steps)

  1. Loaded hotel booking data (36,275 rows, 19 columns).
  2. Explored it with head(), info(), describe(), and value_counts().
  3. Cleaned it by removing outlier rows (children > 3).
  4. Selected features (9 numerical + 3 categorical). Kept booking_status aside for later analysis.
  5. One-hot encoded the 3 text columns so K-Means can understand them.
  6. Scaled numerical columns to 0–1 so no single column dominates.
  7. Ran K-Means with K=10, checked the silhouette score, then analyzed which clusters have high/low cancellation rates.

Key Takeaways

  • Clustering is unsupervised — no labels, no "right answer." The algorithm finds natural groups.
  • Always scale before K-Means — otherwise big-number columns bully small-number columns.
  • Use the Elbow Method to pick K — don't just guess.
  • Silhouette Score tells you how well-separated clusters are (higher = better).
  • The real value is in the analysis AFTER clustering — what makes each cluster unique? Can you act on it?
Clustering Theory Course Hub