Learn how to find and handle missing data and extreme values so your models don’t get fooled!
Missing values are empty cells in your data: no number, no text. Like a form where someone skipped “Age” or “Income.” Many algorithms can’t use rows with missing values, so we either remove those rows or fill them (e.g. with the average).
If we ignore missing data, we might drop too many rows or bias our results. If we fill them carelessly, we might inject wrong information. So we always inspect first (how many? which columns?) then choose a strategy.
| Strategy | When to Use |
|---|---|
Drop rows (dropna()) | Only a few missing; dropping doesn’t lose much data. |
Fill with mean/median (fillna(mean)) | Numeric column; mean or median is a reasonable guess. |
| Fill with mode | Categorical column; use the most frequent category. |
Use pandas to detect and handle missing values. You can use any CSV; we provide one below so you can follow along.
📥 Download a dataset to try: hotel_reservations.csv — save it in the same folder as your script. In the code we use a generic name your_data.csv; replace with hotel_reservations.csv and use a numeric column (e.g. avg_price_per_room) instead of Price if needed.
import pandas as pd # Load your data (e.g. hotel_reservations.csv from the link above) df = pd.read_csv("hotel_reservations.csv") # How many missing values per column? print(df.isna().sum()) # Option 1: Drop rows with any missing value df_dropped = df.dropna() # Option 2: Fill a numeric column with its mean (e.g. avg_price_per_room) col = "avg_price_per_room" # or any numeric column in your CSV if col in df.columns: df[col] = df[col].fillna(df[col].mean()) # Option 3: Fill with median (better if column has outliers) if col in df.columns: df[col] = df[col].fillna(df[col].median())
pd.read_csv("...") — load the CSV into a table (DataFrame).
df.isna().sum() — for each column, count how many cells are missing (NaN).
df.dropna() — remove any row that has at least one missing value.
df[col].fillna(df[col].mean()) — replace missing values in that column with the column’s average. Use median() instead of mean() when the column has outliers so one extreme value doesn’t distort the fill.
Outliers are values that sit far from the rest. Example: most customers spend ₹250–₹500 per month, but a few spend ₹10,000+. Those few are outliers. They can skew the mean and confuse models, so we detect them (e.g. boxplot, IQR) then remove or cap them.
Imagine you’re analyzing customer transaction data. Most spend is in a narrow range; a few very high values can pull the average up and distort your model. We’ll detect those with visuals and the IQR method.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Sample: 95 normal spends (250 to 262+3*94) + 5 extreme (10000, 12000, 15000, 18000, 20000) data = pd.DataFrame({ 'CustomerID': range(1, 101), 'MonthlySpend': [250 + i*3 for i in range(95)] + [10000, 12000, 15000, 18000, 20000] }) print(data.head()) print(data.describe())
Most customers are in the ₹250–₹530 range; the last five are huge. That’s a red flag.
Boxplot: Anything beyond the “whiskers” is typically an outlier. Histogram: A long tail or a few bars far from the rest suggest outliers.
plt.figure(figsize=(8, 4)) sns.boxplot(x=data['MonthlySpend']) plt.title("Boxplot of Monthly Spend") plt.show() plt.figure(figsize=(8, 4)) sns.histplot(data['MonthlySpend'], bins=20, kde=True) plt.title("Histogram of Monthly Spend") plt.show()
IQR = Q3 − Q1 (interquartile range). We often mark as outlier anything below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.
Q1 = data['MonthlySpend'].quantile(0.25) Q3 = data['MonthlySpend'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = data[(data['MonthlySpend'] < lower_bound) | (data['MonthlySpend'] > upper_bound)] print(outliers)
quantile(0.25) / quantile(0.75) — the value below which 25% (or 75%) of the data lies (Q1 and Q3).
IQR = Q3 - Q1 — the spread of the middle 50% of the data.
lower_bound = Q1 - 1.5*IQR and upper_bound = Q3 + 1.5*IQR — the “fence”; values outside this range are usually treated as outliers.
data[(... < lower_bound) | (... > upper_bound)] — keep only rows where MonthlySpend is below the lower fence or above the upper fence; those rows are the outliers.
This returns the rows with extreme values (e.g. the five with ₹10,000+).
Option A – Remove: Keep only rows inside the bounds. Option B – Cap: Set values below lower_bound to lower_bound and above upper_bound to upper_bound (winsorize).
# Option A: Remove outliers cleaned_data = data[(data['MonthlySpend'] >= lower_bound) & (data['MonthlySpend'] <= upper_bound)] # Option B: Cap (winsorize) – replace extreme values with the bound import numpy as np data_capped = data.copy() data_capped['MonthlySpend'] = np.clip(data_capped['MonthlySpend'], lower_bound, upper_bound)
In one sentence: when might you prefer to cap (winsorize) outliers instead of removing them? Give a business example.
isna().sum(); handle with drop, fillna(mean/median/mode), or imputation (e.g. KNN).isna().sum() to inspect; then dropna() or fillna(mean/median/mode).