Every single line of the K-Means notebook explained like you are 5 years old. We cluster hotel bookings to find hidden patterns.
pd.read_csv("Hotel Reservations.csv") works.
Before we write any data science code, we need to load the tools (libraries). Think of it like opening a toolbox before fixing something.
from IPython.core.display import HTML HTML(""" <style> h1 { color: blue !important; } h2 { color: green !important; } </style> """)
from IPython.core.display import HTML — This loads a special tool that lets you inject HTML/CSS into a Jupyter notebook. It's purely cosmetic.HTML("""...""") — This injects CSS styles to make <h1> headings blue and <h2> headings green inside the notebook. It has zero effect on the data science. It just makes the notebook look prettier.The instructor likes pretty colors in the notebook. You can skip this cell entirely — it doesn't touch your data or your clustering. It's like painting your toolbox before opening it.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt
import pandas as pd — pandas is the #1 library for working with tables (rows & columns) in Python. as pd means "I'll call it pd for short instead of typing pandas every time."import seaborn as sns — seaborn makes beautiful charts with one line of code. It sits on top of matplotlib. sns is its nickname (named after the character Samuel Norman Seaborn).import matplotlib.pyplot as plt — matplotlib is the original charting library. pyplot is the part we use most. plt is its nickname.pandas = your Excel spreadsheet tool. seaborn = your fancy graph designer. matplotlib = the engine that actually draws the graphs on screen.
We read the hotel reservations CSV file and look at what's inside.
data = pd.read_csv("Hotel Reservations.csv")
pd.read_csv("Hotel Reservations.csv") — Opens the CSV file (a spreadsheet saved as text, where commas separate each column) and reads every row and column into Python.data = ... — We store the entire table in a variable called data. Now data is a DataFrame — pandas' word for "a table with rows and columns."Common error: If you get FileNotFoundError, it means the CSV file is not in the same folder as your notebook. Move it there, or use the full path like pd.read_csv("/Users/you/Downloads/Hotel Reservations.csv").
data.head(100)
.head(100) — Shows the first 100 rows of the table. If you write .head() with no number, it shows the first 5 rows. This is your first glance at the data — like flipping open a book to the first page.Booking_ID, no_of_adults, no_of_children, lead_time, avg_price_per_room, booking_status, etc.data.info()
int64 = whole number, float64 = decimal, object = text).data.describe(include='all')
include='all' — Without this, it only shows number columns. With 'all', it includes text columns too.no_of_children = 10? That might be an outlier (mistake).print(data['type_of_meal_plan'].value_counts()) print(data['room_type_reserved'].value_counts()) print(data['market_segment_type'].value_counts()) print(data['booking_status'].value_counts()) print(data['required_car_parking_space'].value_counts()) print(data['repeated_guest'].value_counts()) print(data['no_of_previous_bookings_not_canceled'].value_counts().head(10)) print(data['no_of_adults'].value_counts()) print(data['no_of_children'].value_counts())
.value_counts() does:data['booking_status'].value_counts() might show: Not_Canceled: 24,390 and Canceled: 11,885.no_of_children has values 9 and 10 — that looks like a data entry error (outlier). Only ~3% are repeat customers, so previous booking history won't help much.Imagine you have a giant bag of Skittles. value_counts() is like sorting them by color and counting: "42 red, 38 green, 35 yellow..." — You instantly see which color is most common and if there's a weird one you've never seen before.
We spotted that no_of_children has values 9 and 10 — likely data entry errors. No hotel booking realistically has 9–10 children. Let's remove them.
data = data[data['no_of_children'] <= 3] data.reset_index(inplace=True)
data[data['no_of_children'] <= 3] — This is a filter. It says "keep only rows where the number of children is 3 or less." All rows with 9 or 10 children get thrown away.data['no_of_children'] <= 3 creates a list of True/False for every row. Then data[...] keeps only the True rows.data = ... — We overwrite the old data with the cleaned version..reset_index(inplace=True) — After removing rows, the row numbers (index) have gaps (e.g., 0, 1, 2, 5, 8...). This re-numbers them cleanly as 0, 1, 2, 3, 4... inplace=True means "change the existing DataFrame, don't create a new copy."If one person says they booked a hotel for 10 children, that number is SO different from everyone else that K-Means will create a cluster JUST for that one weirdo. That's not useful. Removing outliers keeps the algorithm focused on real patterns.
Not every column is useful. We pick the ones that describe the behavior of a booking.
numerical_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'repeated_guest', 'avg_price_per_room', 'no_of_special_requests']
numerical_features containing the names of 9 columns that are numbers.lead_time), the price, and how demanding they are (special_requests).Booking_ID? It's just a label ("INN00001") — it has no meaning. Including it would confuse the algorithm.arrival_year, arrival_month, arrival_date? The instructor chose to exclude calendar info and focus on booking behavior. You could include them — it's a design choice.categorical_features = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
type_of_meal_plan — e.g., "Meal Plan 1", "Not Selected"room_type_reserved — e.g., "Room_Type 1", "Room_Type 6"market_segment_type — e.g., "Online", "Offline"data_features = data[numerical_features + categorical_features + ['booking_status']] data_features.head(10)
numerical_features + categorical_features + ['booking_status'] — In Python, adding lists glues them together. So this creates one big list of all 13 column names.data[...] — Selects only those 13 columns from the full 19-column table. We store this smaller table in data_features.booking_status? We won't feed it to K-Means (clustering is unsupervised — no labels!). But we'll keep it nearby so we can check later: "Do certain clusters have higher cancellation rates?"x_train = data_features[numerical_features + categorical_features] # There is nothing to predict in clustering # We are just storing booking status flag in another variable to # check later if the clusters have some pattern w.r.t booking cancellation y_label = data_features[['booking_status']]
x_train = ... — This is the data we will actually feed to K-Means. It has 12 columns (9 numerical + 3 categorical). No booking_status.y_label = data_features[['booking_status']] — We save booking_status separately. Double brackets [[...]] means "give me a DataFrame (table), not a Series (single column)." This is purely for analysis later.y_label here is only a convenience for post-hoc analysis.Imagine sorting laundry without reading the labels. You group clothes by color, size, and fabric — that's clustering. The "brand" tag (booking_status) is hidden in your pocket; you'll peek at it later to see if your groups accidentally separated Nike from Adidas.
K-Means only understands numbers, but type_of_meal_plan has values like "Meal Plan 1", "Not Selected"... We need to convert these to numbers.
x_train = pd.get_dummies(x_train, columns=categorical_features, drop_first=False) print(x_train.columns)
pd.get_dummies() — This is one-hot encoding. For each category value, it creates a new column that is either 0 or 1.columns=categorical_features — Only convert these 3 columns. Leave the numerical ones alone.drop_first=False — Keep all dummy columns (don't drop the first category). For clustering, we usually keep all; for regression, we'd drop one to avoid multicollinearity.Before: type_of_meal_plan = "Meal Plan 1"
After: type_of_meal_plan_Meal Plan 1 = 1, type_of_meal_plan_Meal Plan 2 = 0, type_of_meal_plan_Not Selected = 0
Each category gets its own column. If that row IS that category, it's 1. Otherwise 0. Now everything is a number!
Instead of writing "favorite color = blue", you write: "is_red = 0, is_blue = 1, is_green = 0". Computers love this format.
This is critical for K-Means. Here's why:
Problem: lead_time ranges from 0 to 443 days. no_of_children ranges from 0 to 3. If we don't scale, K-Means will think lead_time is 100x more important just because its numbers are bigger. That's unfair!
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() x_train[numerical_features] = scaler.fit_transform(x_train[numerical_features]) x_train.describe() x_train_copy = x_train.copy()
from sklearn.preprocessing import MinMaxScaler — Import the MinMaxScaler tool from scikit-learn (the go-to machine learning library).scaler = MinMaxScaler() — Create a scaler object. It's like loading a calculator that knows how to rescale numbers.scaler.fit_transform(x_train[numerical_features]) — Two things happen here:
new_value = (old_value - min) / (max - min). After this, every value is between 0 and 1.x_train[numerical_features] = ... — Overwrite the original numbers with the scaled versions.x_train.describe() — Print stats to confirm: min should be 0, max should be 1 for every numerical column.x_train_copy = x_train.copy() — Save a copy of the scaled data before we add cluster labels. We'll need the "clean" version later for the elbow method and silhouette score.Imagine comparing heights (in cm) and weights (in kg). A person who is 180 cm and 80 kg looks "farther" from 160 cm / 70 kg mostly because of the height difference (20 vs 10). MinMaxScaler puts both on a 0-to-1 scale, so they're equally important. Now height 0.67 vs 0.33 and weight 0.8 vs 0.6 — fair comparison!
x_train_copy.to_csv("x_train.csv", index=False)
index=False means don't include the row numbers as a column. This is optional — the instructor saved it for reference or reuse.This is the main event. We tell the computer: "Please split these 36,000+ bookings into 10 groups."
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=10, random_state=0, n_init="auto").fit(x_train) x_train['cluster_labels'] = kmeans.labels_ x_train['booking_status'] = y_label['booking_status']
from sklearn.cluster import KMeans — Import the K-Means algorithm from scikit-learn.KMeans(n_clusters=10, ...) — Create a K-Means model that will make 10 clusters (groups numbered 0–9).random_state=0 — K-Means starts with random center points. Setting random_state=0 makes it use the same random starting point every time, so you get reproducible results. Run it today or tomorrow — same answer.n_init="auto" — K-Means will automatically decide how many times to run with different starting points and pick the best result. (Older scikit-learn versions defaulted to 10 runs.).fit(x_train) — This is where the magic happens. The algorithm:
kmeans.labels_ — After fitting, this attribute contains a number (0–9) for each row, telling which cluster that booking belongs to.x_train['cluster_labels'] = kmeans.labels_ — Adds a new column to our table showing each booking's cluster.x_train['booking_status'] = y_label['booking_status'] — Adds back the booking status column so we can analyze cancellation patterns per cluster.Imagine you throw 36,000 balls on a football field. You place 10 flags randomly. Each ball rolls to its nearest flag. Then you move each flag to the center of its ball pile. Balls re-roll to the nearest flag. Repeat until flags stop moving. Now you have 10 neat piles. Each pile is a "cluster."
How do we know if our 10 clusters are any good?
from sklearn.metrics import silhouette_score silhouette_score(x_train_copy, kmeans.labels_)
from sklearn.metrics import silhouette_score — Import the scoring function.silhouette_score(x_train_copy, kmeans.labels_) — Computes a score from -1 to +1:
x_train_copy (the clean data without cluster labels) and kmeans.labels_ (the cluster assignments).Silhouette score asks each student in class: "Are you sitting closer to your friends (same cluster) or closer to the other group?" If everyone says "I'm way closer to my friends," score is high. If people are confused about which group they belong to, score is low.
x_train['cluster_labels'].value_counts()
Now the fun part: what did the clusters find? Let's check if some clusters have higher cancellation rates.
x_train['booking_status'].value_counts()
print("cancellation rate in data:", 100 * 11884 / (11884 + 24388))
cluster_number = [] cancellation_rate = [] for z in range(len(list(x_train['cluster_labels'].unique()))): cluster_number.append(z) temp = x_train[x_train['cluster_labels'] == z] temp_cancelled = temp[temp['booking_status'] == 'Canceled'] temp_not_cancelled = temp[temp['booking_status'] == 'Not_Canceled'] cancel = (len(temp_cancelled) / len(temp)) * 100 cancellation_rate.append(cancel)
cluster_number = [] — Create an empty list. We'll fill it with cluster numbers (0, 1, 2, ... 9).cancellation_rate = [] — Another empty list for each cluster's cancellation percentage.for z in range(len(list(x_train['cluster_labels'].unique()))): — This is a loop that says "for each unique cluster label, do the following." Let's break it:
.unique() → gets unique values: [0, 1, 2, ..., 9]list(...) → converts to a Python listlen(...) → counts them: 10range(10) → loop from 0 to 9cluster_number.append(z) — Add the current cluster number (0, 1, 2...) to the list.temp = x_train[x_train['cluster_labels'] == z] — Filter the data to only rows belonging to cluster z.temp_cancelled = temp[temp['booking_status'] == 'Canceled'] — From that cluster, keep only canceled bookings.cancel = (len(temp_cancelled) / len(temp)) * 100 — Calculate: (number of cancellations / total bookings in cluster) × 100 = cancellation percentage.cancellation_rate.append(cancel) — Save the percentage.temp = pd.DataFrame({'cluster': cluster_number, 'cancellation': cancellation_rate}) sns.barplot(x='cluster', y='cancellation', data=temp)
pd.DataFrame({...}) — Create a new table with 2 columns: cluster (0–9) and cancellation (the percentage).sns.barplot(x='cluster', y='cancellation', data=temp) — Draw a bar chart. Each bar is one cluster, and its height is the cancellation rate. You can instantly see which clusters cancel more!data['cluster'] = kmeans.labels_ data.to_csv('clustering_results.csv')
Now we look at what makes each cluster different. We compare both numerical and categorical features across clusters.
plt.figure(figsize=(20, 12)) plt.subplot(3, 3, 1) temp = pd.DataFrame(data.groupby('cluster')['no_of_adults'].mean()).reset_index() sns.barplot(x='cluster', y='no_of_adults', data=temp) plt.subplot(3, 3, 2) temp = pd.DataFrame(data.groupby('cluster')['no_of_children'].mean()).reset_index() sns.barplot(x='cluster', y='no_of_children', data=temp) # ... same pattern for: no_of_weekend_nights, no_of_week_nights, # required_car_parking_space, lead_time, avg_price_per_room, # no_of_special_requests plt.show()
plt.figure(figsize=(20, 12)) — Create a big canvas (20 inches wide, 12 tall) to hold multiple charts.plt.subplot(3, 3, 1) — Divide the canvas into a 3×3 grid of charts. 1 means "put this chart in position 1 (top-left)."data.groupby('cluster')['no_of_adults'].mean() — Group all rows by cluster, then calculate the average number of adults in each cluster..reset_index() — Turn the grouped result back into a regular table (with cluster as a column, not an index).sns.barplot(...) — Draw a bar chart comparing the average across clusters.plt.show() — Display all 8 charts at once.plt.figure(figsize=(20, 12)) plt.subplot(2, 2, 1) sns.countplot(x='cluster', hue='type_of_meal_plan', data=data) plt.subplot(2, 2, 2) sns.countplot(x='cluster', hue='room_type_reserved', data=data) plt.subplot(2, 2, 3) sns.countplot(x='cluster', hue='market_segment_type', data=data) plt.subplot(2, 2, 4) sns.countplot(x='cluster', hue='repeated_guest', data=data) plt.show()
sns.countplot(x='cluster', hue='type_of_meal_plan', data=data) — A count plot with clusters on the x-axis and colored bars for each meal plan type. Unlike barplot (which shows averages), countplot shows raw counts — how many bookings in each cluster chose each meal plan.hue=... — "Color the bars by this category." Each meal plan gets a different color within each cluster.We used K=10 earlier, but how do we know 10 is the right number? The Elbow Method helps us decide.
Important: In practice, you should run the Elbow Method BEFORE finalizing your clusters, not after. The notebook does it at the end for educational purposes.
from sklearn.cluster import KMeans wcss = [] for i in range(2, 30): kmeans = KMeans(n_clusters=i, init='k-means++', n_init='auto', random_state=42) kmeans.fit(x_train_copy) wcss.append(kmeans.inertia_)
wcss = [] — Create an empty list to store the "Within-Cluster Sum of Squares" for each value of K. WCSS measures how spread out the points are within their clusters — lower = tighter clusters.for i in range(2, 30): — Try K = 2, 3, 4, 5, ... all the way to 29. That's 28 different experiments!KMeans(n_clusters=i, init='k-means++', ...) — Create a K-Means model with i clusters. 'k-means++' is a smarter way to pick initial center points (avoids putting two centers right next to each other).kmeans.fit(x_train_copy) — Run K-Means on the clean scaled data. We use x_train_copy because x_train now has extra columns (cluster_labels, booking_status).kmeans.inertia_ — After fitting, this is the WCSS value. It's the sum of (distance from each point to its cluster center)² for all points.wcss.append(...) — Save the WCSS for this K value.import matplotlib.pyplot as plt K = range(2, 30) plt.plot(K, wcss, 'bx-') plt.xlabel('Values of K') plt.ylabel('Within cluster Sum of Squared distances') plt.title('The Elbow Method') plt.show()
K = range(2, 30) — The x-axis values: 2, 3, 4, ..., 29.plt.plot(K, wcss, 'bx-') — Plot K (x-axis) vs WCSS (y-axis). 'bx-' means: b = blue, x = X markers, - = connect with lines.plt.xlabel(...), plt.ylabel(...), plt.title(...) — Label the axes and title.plt.show() — Display the chart.The chart looks like a bent arm. WCSS always goes down as K increases (more clusters = smaller groups = less spread). But at some point, adding more clusters barely helps — that's the "elbow" where the curve bends sharply.
Look for the elbow: If the curve bends around K=5 or K=8, that's your sweet spot. Before the elbow: too few clusters (big, messy groups). After the elbow: too many clusters (splitting hairs for no reason).
Imagine organizing a classroom into study groups. With 2 groups, kids are very different from each other within each group (high WCSS). With 30 groups of 1 person each, WCSS = 0 (everyone is perfectly grouped — with themselves!). The sweet spot is somewhere in between — say 5 groups where kids are similar enough within each group but you haven't over-fragmented.
head(), info(), describe(), and value_counts().booking_status aside for later analysis.