📊 DATA CLEANING

Missing Values & Outliers

Learn how to find and handle missing data and extreme values so your models don’t get fooled!

Part 1: Missing Values

👶 What Are Missing Values?

Missing values are empty cells in your data: no number, no text. Like a form where someone skipped “Age” or “Income.” Many algorithms can’t use rows with missing values, so we either remove those rows or fill them (e.g. with the average).

Why Do We Care?

If we ignore missing data, we might drop too many rows or bias our results. If we fill them carelessly, we might inject wrong information. So we always inspect first (how many? which columns?) then choose a strategy.

Common Strategies

StrategyWhen to Use
Drop rows (dropna())Only a few missing; dropping doesn’t lose much data.
Fill with mean/median (fillna(mean))Numeric column; mean or median is a reasonable guess.
Fill with modeCategorical column; use the most frequent category.

Missing Values in Python

Use pandas to detect and handle missing values. You can use any CSV; we provide one below so you can follow along.

📥 Download a dataset to try: hotel_reservations.csv — save it in the same folder as your script. In the code we use a generic name your_data.csv; replace with hotel_reservations.csv and use a numeric column (e.g. avg_price_per_room) instead of Price if needed.

import pandas as pd

# Load your data (e.g. hotel_reservations.csv from the link above)
df = pd.read_csv("hotel_reservations.csv")

# How many missing values per column?
print(df.isna().sum())

# Option 1: Drop rows with any missing value
df_dropped = df.dropna()

# Option 2: Fill a numeric column with its mean (e.g. avg_price_per_room)
col = "avg_price_per_room"  # or any numeric column in your CSV
if col in df.columns:
    df[col] = df[col].fillna(df[col].mean())

# Option 3: Fill with median (better if column has outliers)
if col in df.columns:
    df[col] = df[col].fillna(df[col].median())

📖 Line-by-line (layman)

pd.read_csv("...") — load the CSV into a table (DataFrame).

df.isna().sum() — for each column, count how many cells are missing (NaN).

df.dropna() — remove any row that has at least one missing value.

df[col].fillna(df[col].mean()) — replace missing values in that column with the column’s average. Use median() instead of mean() when the column has outliers so one extreme value doesn’t distort the fill.

Part 2: Outliers

👶 What Are Outliers?

Outliers are values that sit far from the rest. Example: most customers spend ₹250–₹500 per month, but a few spend ₹10,000+. Those few are outliers. They can skew the mean and confuse models, so we detect them (e.g. boxplot, IQR) then remove or cap them.

Case: Customer Monthly Spend

Imagine you’re analyzing customer transaction data. Most spend is in a narrow range; a few very high values can pull the average up and distort your model. We’ll detect those with visuals and the IQR method.

Step 1: Load Data and Inspect

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample: 95 normal spends (250 to 262+3*94) + 5 extreme (10000, 12000, 15000, 18000, 20000)
data = pd.DataFrame({
    'CustomerID': range(1, 101),
    'MonthlySpend': [250 + i*3 for i in range(95)] + [10000, 12000, 15000, 18000, 20000]
})

print(data.head())
print(data.describe())

Most customers are in the ₹250–₹530 range; the last five are huge. That’s a red flag.

Step 2: Visual Detection – Boxplot and Histogram

Boxplot: Anything beyond the “whiskers” is typically an outlier. Histogram: A long tail or a few bars far from the rest suggest outliers.

plt.figure(figsize=(8, 4))
sns.boxplot(x=data['MonthlySpend'])
plt.title("Boxplot of Monthly Spend")
plt.show()

plt.figure(figsize=(8, 4))
sns.histplot(data['MonthlySpend'], bins=20, kde=True)
plt.title("Histogram of Monthly Spend")
plt.show()

Step 3: IQR Method (Statistical)

IQR = Q3 − Q1 (interquartile range). We often mark as outlier anything below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.

Q1 = data['MonthlySpend'].quantile(0.25)
Q3 = data['MonthlySpend'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = data[(data['MonthlySpend'] < lower_bound) | (data['MonthlySpend'] > upper_bound)]
print(outliers)

📖 Line-by-line (layman)

quantile(0.25) / quantile(0.75) — the value below which 25% (or 75%) of the data lies (Q1 and Q3).

IQR = Q3 - Q1 — the spread of the middle 50% of the data.

lower_bound = Q1 - 1.5*IQR and upper_bound = Q3 + 1.5*IQR — the “fence”; values outside this range are usually treated as outliers.

data[(... < lower_bound) | (... > upper_bound)] — keep only rows where MonthlySpend is below the lower fence or above the upper fence; those rows are the outliers.

This returns the rows with extreme values (e.g. the five with ₹10,000+).

Step 4: Treating Outliers

Option A – Remove: Keep only rows inside the bounds. Option B – Cap: Set values below lower_bound to lower_bound and above upper_bound to upper_bound (winsorize).

# Option A: Remove outliers
cleaned_data = data[(data['MonthlySpend'] >= lower_bound) & (data['MonthlySpend'] <= upper_bound)]

# Option B: Cap (winsorize) – replace extreme values with the bound
import numpy as np
data_capped = data.copy()
data_capped['MonthlySpend'] = np.clip(data_capped['MonthlySpend'], lower_bound, upper_bound)

🚫 Common Mistakes: Missing Values & Outliers

💭 Short reflection

In one sentence: when might you prefer to cap (winsorize) outliers instead of removing them? Give a business example.

✅ CORE (Must know)

📚 NON-CORE (Good to know)

Summary