Statistics for Data Science (Super Simple) | Fakhruddin Khambaty's Learning Hub

Chapter 1: What is Statistics? (And Why Should You Care?)

👶 Explain Like I'm 5

Statistics is like being a detective with numbers! 🔍

You have clues (data), and statistics helps you figure out what they mean!

📌 In One Sentence

Statistics is the set of tools we use to summarize data (descriptive) and to make conclusions or predictions from samples (inferential)—including when to trust that a pattern isn't just luck (hypothesis tests, p-values).

🍕 Real Life Example

A pizza restaurant wants to know: "Which pizza is most popular?"

They can't ask every single customer ever. So they look at last month's orders (that's their sample) and use statistics to guess what ALL customers like!

Two Types of Statistics

📋 Descriptive Statistics

What it does: Summarizes data you already have

Example: "Last month, 60% of orders were pepperoni pizza"

Tools: Mean, Median, Mode, Charts

🔮 Inferential Statistics

What it does: Makes predictions about data you DON'T have

Example: "We predict next month will also be ~60% pepperoni"

Tools: Hypothesis tests, Confidence intervals

Chapter 2: The Normal Distribution (The Bell Curve)

👶 Explain Like I'm 5

Imagine measuring the height of 1000 adults:

Very few people are SUPER short (like 4 feet)
Very few people are SUPER tall (like 7 feet)
MOST people are somewhere in the middle (around 5'6" to 5'10")

When you graph this, it makes a bell shape! 🔔

🔔 The Bell Curve Shape – With Clear X and Y Axes

Graph below: X-axis = value (e.g. height, test score); Y-axis = how often that value appears (frequency). The peak is at the mean.

X = value, Y = frequency. Red dashed line = mean (center).

Few   MOST ARE HERE   Few
Short    (Average)    Tall

This shape appears EVERYWHERE in nature!

🌍 Examples of Normal Distribution in Real Life

Test scores: Most students get average grades, few fail, few get perfect
Birth weight: Most babies are 6-8 lbs, very few are 3 lbs or 12 lbs
Shoe sizes: Most people wear size 8-10, very few wear size 4 or 16
Daily temperature: Usually near average, rarely extreme

The 68-95-99.7 Rule (Memorize This!)

📏 How Data Spreads in a Normal Distribution

Distance from Mean	% of Data
Within 1 Standard Deviation	68%
Within 2 Standard Deviations	95%
Within 3 Standard Deviations	99.7%

🎯 What Does This Mean?

If average test score is 75 and standard deviation is 10:

68% of students scored between 65 and 85 (75 ± 10)
95% of students scored between 55 and 95 (75 ± 20)
99.7% of students scored between 45 and 105 (75 ± 30)

If someone scores 20, that's VERY unusual! (an outlier)

import numpy as np
import matplotlib.pyplot as plt

# Generate 10,000 random numbers from a normal distribution
# mean = 75 (average test score)
# std = 10 (how spread out the scores are)

np.random.seed(42)  # For reproducibility
test_scores = np.random.normal(loc=75, scale=10, size=10000)

# Let's verify the 68-95-99.7 rule!
mean = np.mean(test_scores)
std = np.std(test_scores)

# Count how many fall within 1, 2, 3 standard deviations
within_1_std = np.sum((test_scores >= mean - std) & (test_scores <= mean + std))
within_2_std = np.sum((test_scores >= mean - 2*std) & (test_scores <= mean + 2*std))
within_3_std = np.sum((test_scores >= mean - 3*std) & (test_scores <= mean + 3*std))

print(f"Within 1 std: {within_1_std/100:.1f}% (expected: 68%)")
print(f"Within 2 std: {within_2_std/100:.1f}% (expected: 95%)")
print(f"Within 3 std: {within_3_std/100:.1f}% (expected: 99.7%)")

# Output:
# Within 1 std: 68.2% (expected: 68%)
# Within 2 std: 95.4% (expected: 95%)
# Within 3 std: 99.7% (expected: 99.7%)

📘 Same idea in the course notebook

The course source uses np.random.seed(10) and original_data = np.random.normal(size=10000) (mean 0, variance 1 by default). Then len(original_data) gives 10000. To plot the shape, the notebook uses sns.distplot(original_data, hist=False, kde=True) for a smooth curve and sns.distplot(original_data, hist=True, kde=False) for a histogram—both show the bell shape. Loading Excel: data = pd.read_excel("titanic3.xlsx") (download titanic3.xlsx from the datasets page). The course also includes Statistics.pdf for a slide-style summary.

Chapter 3: Correlation (Do Things Move Together?)

👶 Explain Like I'm 5

Correlation tells us: "When one thing changes, does the other thing also change?"

Like: When it's sunny ☀️, do ice cream sales go up? 🍦

Three Types of Correlation

📈 Positive Correlation (+1)

When X goes UP, Y goes UP too!

Examples:

More study hours → Higher grades
Taller person → Usually weighs more
More bedrooms → Higher house price

📉 Negative Correlation (-1)

When X goes UP, Y goes DOWN!

Examples:

More exercise → Less body fat
Higher altitude → Lower temperature
More ads blocked → Less revenue

↔️ No Correlation (0)

X and Y don't affect each other!

Examples:

Shoe size → Intelligence
Your birthday → Test scores
Hair color → Salary

📊 Correlation Coefficient Scale

-1
Perfect Negative 0
No Correlation +1
Perfect Positive

⚠️ SUPER IMPORTANT: Correlation ≠ Causation!

Just because two things move together doesn't mean one CAUSES the other!

Example: Ice cream sales and drowning deaths are correlated. Does ice cream cause drowning? NO!

Both go up in summer because of the heat - that's the real cause!

import pandas as pd
import numpy as np

# Create sample data: Study hours vs Test scores
data = {
    'study_hours': [1, 2, 3, 4, 5, 6, 7, 8],
    'test_score':  [50, 55, 60, 65, 70, 75, 80, 85]
}
df = pd.DataFrame(data)

# Calculate correlation
correlation = df['study_hours'].corr(df['test_score'])
print(f"Correlation: {correlation:.2f}")

# Output: Correlation: 1.00
# Perfect positive correlation! More study = Higher score

# See correlation matrix for all columns
print("\nCorrelation Matrix:")
print(df.corr())

# Output:
#              study_hours  test_score
# study_hours          1.0         1.0
# test_score           1.0         1.0

Chapter 4: Chi-Square Test (Are Categories Related?)

📥 Download datasets for this lesson

Hotel Reservations (Chi-Square & T-Test), Housing, Titanic, and Practice exercises.

Hotel Reservations.csv Housing.csv titanic3.xlsx Practice.xlsx

👶 Explain Like I'm 5

The Chi-Square test answers: "Is there a relationship between two CATEGORIES?"

Like: Is there a connection between Gender (Male/Female) and Favorite Color (Red/Blue)?

🏨 Real Example: Hotel Bookings

We want to know: "Does the type of MEAL PLAN affect whether people CANCEL their booking?"

Both variables are categories:

Meal Plan: Plan 1, Plan 2, Not Selected
Booking Status: Canceled, Not Canceled

How Chi-Square Works (Super Simple)

📝 The Process

1

Create a "Contingency Table"

Count how many fall into each combination of categories

2

Calculate "Expected" counts

If there was NO relationship, what would we expect?

3

Compare Observed vs Expected

Are they very different? If yes → There's a relationship!

4

Check the p-value

p < 0.05 means "Yes, there IS a relationship!"

📊 Example Contingency Table

	Canceled	Not Canceled	Total
Meal Plan 1	8,679	19,156	27,835
Meal Plan 2	1,506	1,799	3,305
Not Selected	1,699	3,431	5,130

import pandas as pd
from scipy.stats import chi2_contingency

# Load the hotel data
data = pd.read_csv("Hotel Reservations.csv")

# Step 1: Create contingency table
# pd.crosstab counts combinations of two categorical variables
table = pd.crosstab(data['type_of_meal_plan'], data['booking_status'])
print("Contingency Table:")
print(table)

# Step 2: Run Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(table)

print(f"\nChi-Square Statistic: {chi2:.2f}")
print(f"p-value: {p_value}")

# Step 3: Interpret the result
if p_value < 0.05:
    print("✅ YES! Meal plan IS related to cancellation!")
else:
    print("❌ NO relationship between meal plan and cancellation")

# Output:
# Chi-Square Statistic: 271.45
# p-value: 4.477e-61  (that's TINY!)
# ✅ YES! Meal plan IS related to cancellation!

🎯 What Does p-value Mean?

p-value = Probability of getting this result by pure chance

If p-value < 0.05 (less than 5% chance), we say "This is NOT just luck - there's a real relationship!"

Our p-value was 0.0000000000000...0004477 - TINY! So definitely not luck!

Chapter 5: T-Test (Are Two Groups Different?)

👶 Explain Like I'm 5

The T-Test answers: "Are these two groups REALLY different, or is it just random chance?"

Example: Do customers who book early (high lead time) cancel more than those who book late?

When to Use T-Test vs Chi-Square

Chi-Square Test

Use when BOTH variables are categories

Example: Gender vs Favorite Color

(Male/Female) vs (Red/Blue/Green)

T-Test (Welch's)

Use when comparing a NUMBER across categories

Example: Lead Time vs Booking Status

(Days: 10, 20, 30...) vs (Canceled/Not Canceled)

🏨 Real Example: Does Lead Time Affect Cancellation?

Lead time = How many days before arrival did they book?

We have two groups:

Group 1: People who CANCELED - what was their average lead time?
Group 2: People who DIDN'T cancel - what was their average lead time?

If the averages are VERY different → Lead time matters for cancellation!

import pandas as pd
from scipy import stats

# Load data
data = pd.read_csv("Hotel Reservations.csv")

# Step 1: Separate the two groups
canceled = data[data['booking_status'] == 'Canceled']
not_canceled = data[data['booking_status'] == 'Not_Canceled']

# Step 2: Get the lead_time for each group
lead_time_canceled = canceled['lead_time']
lead_time_not_canceled = not_canceled['lead_time']

# Let's see the averages first
print("Average lead time for CANCELED bookings:", lead_time_canceled.mean().round(1))
print("Average lead time for NOT CANCELED:", lead_time_not_canceled.mean().round(1))

# Step 3: Run the T-Test
t_stat, p_value = stats.ttest_ind(lead_time_canceled, lead_time_not_canceled)

print(f"\nT-Statistic: {t_stat:.2f}")
print(f"p-value: {p_value}")

# Step 4: Interpret
if p_value < 0.05:
    print("✅ Lead time IS significantly different between groups!")
else:
    print("❌ No significant difference in lead time")

# Output:
# Average lead time for CANCELED bookings: 110.5 days
# Average lead time for NOT CANCELED: 74.2 days
# T-Statistic: 45.67
# p-value: 0.0
# ✅ Lead time IS significantly different between groups!

🎯 What Did We Learn?

People who CANCELED booked ~110 days in advance (on average)

People who DIDN'T cancel booked ~74 days in advance

Conclusion: People who book too early are more likely to cancel!

(Maybe their plans change over time)

Complete code from course notebook: statistics.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
import warnings
warnings.filterwarnings("ignore")

# import modules
import numpy as np
import pandas as pd

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# --- Code cell 2 ---
from IPython.core.display import HTML

HTML("""
<style>
h1 { color: Purple !important; }
h2 { color: green !important; }
h3 { color: blue !important; }
</style>
""")

# --- Code cell 6 ---
np.random.seed(10)
original_data = np.random.normal(size = 10000)  # by default mean is zero and variance is 1

# --- Code cell 7 ---
len(original_data)

# --- Code cell 8 ---
# creating axes to draw plots
fig, ax = plt.subplots(1, 2,figsize = (30,10))
 
sns.distplot(original_data, hist = False, kde = True,
            kde_kws = {'shade': True, 'linewidth': 2},
            color ="green", ax = ax[0])

sns.distplot(original_data, hist = True, kde = False,
            kde_kws = {'shade': True, 'linewidth': 2},
             color ="green", ax = ax[1])
 
# rescaling the subplots
fig.set_figheight(5)
fig.set_figwidth(10)

# --- Code cell 13 ---
#conda install xlrd  / pip install xlrd  ( in anconda comand prompt)
data = pd.read_excel("titanic3.xlsx")

# --- Code cell 14 ---
data.head(10)

# --- Code cell 15 ---
print(data.isnull().sum()) # add example of percentage

# --- Code cell 16 ---
print(data.isnull().sum()*100/len(data))

# --- Code cell 17 ---
#Drop columns with IDs and large number of missing values
data.drop(["name", "ticket", "cabin","boat","body","home.dest"],axis=1,inplace=True)

# --- Code cell 18 ---
data.describe()

# --- Code cell 19 ---
data.describe(include='all')

# --- Code cell 20 ---
data['embarked'].value_counts()

# --- Code cell 21 ---
data['embarked'].fillna('S',inplace=True)

# --- Code cell 22 ---
data['embarked'].value_counts()

# --- Code cell 23 ---
import statistics
age_variance = statistics.variance(data['age'].dropna())
print("Variance of Age: ", age_variance)

# --- Code cell 24 ---
std_dev_age = statistics.sqrt(age_variance)
print("Standard deviation of Age: ", std_dev_age)

# --- Code cell 25 ---

sns.distplot(data['age'], hist = True, kde = False,color ="green")

# --- Code cell 27 ---
print("mean age before imputation: ", data['age'].mean())
#print("\n")
data['age'].fillna(data['age'].median(),inplace=True)
print("mean age after imputation: ", data['age'].mean())
print(data.isnull().sum())

# --- Code cell 28 ---
sns.distplot(data['age'], hist = True, kde = False,color ="green")

# --- Code cell 29 ---
#Another options:
    
    #Drop columns with missing data  60%
    #Fit regresison model to predict missing age data
    # Always check accuracy of main task after imputation 
    # Take business logic into consideration for data imputation

# --- Code cell 35 ---
data = pd.read_csv("Housing.csv")

# --- Code cell 36 ---
data = data[['price','area']]
data.describe()

# --- Code cell 37 ---
sns.boxplot(data['area']).set(xlabel= 'area')

# --- Code cell 38 ---
# outlier treatment for area
# data['area']  / data.area
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1

print("Q1 value: ", Q1)
print("Q3 value: ", Q3)
print("IQR value: ", Q3 - Q1)

# --- Code cell 39 ---
print("Lower threshold of outlier value: ", Q1 - 1.5*IQR)
print("Upper threshold of outlier value: ", Q3 + 1.5*IQR)

# --- Code cell 40 ---
#3650

# --- Code cell 41 ---
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]
data.describe()

# --- Code cell 44 ---
data.describe()

# --- Code cell 46 ---
# 0 to 1 scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data))
data_scaled.columns = ['price','area']
data_scaled.head(10)

# --- Code cell 47 ---
data_scaled.describe()

# --- Code cell 50 ---
data.describe()

# --- Code cell 52 ---
# z score scaling
#  0 mean and variance /standard deviation 1 

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data))
data_scaled.columns = ['price','area']
data_scaled.head(10)

# --- Code cell 53 ---
# if you note - mean will is zero and standard deviation is 1. 
# Min and max vlaue is based on the data after applying transformation 
data_scaled.describe()

# --- Code cell 56 ---
data.head(10)

# --- Code cell 57 ---
sns.scatterplot(data=data, x="area", y="price").set(xlabel= "area",ylabel="price")

# --- Code cell 58 ---
#Correlation of output with numerical variables

# plotting correlation heatmap
dataplot = sns.heatmap(data[['price', 'area']].corr(), cmap="YlGnBu", annot=True)
  
# displaying heatmap
plt.show()

💭 Short reflection

In one sentence: why is it wrong to say “correlation means causation”? Give a real-life example where two things are correlated but one doesn’t cause the other.

🚫 Common Mistakes in Statistics

Confusing correlation with causation — Two things moving together doesn't mean one causes the other (e.g. ice cream sales and drownings both go up in summer).
Thinking p < 0.05 means "proven" — It only means "unlikely to be pure chance"; we never prove the alternative, we only reject the null.
Using the wrong test — Categories vs categories → Chi-Square; numeric mean between two groups → T-Test. Don't use a T-Test for categorical data.

Core & Non-Core Points – Mastery Checklist

Master every core point for exams and real work. Non-core points deepen your statistical thinking.

✅ CORE (Must know)

Descriptive stats: mean, median, mode; variance and standard deviation (spread).
Normal distribution: bell curve; 68–95–99.7 rule (within 1, 2, 3 std of mean).
Correlation: -1 to +1; positive (X↑ Y↑), negative (X↑ Y↓), zero (no linear relation). Correlation ≠ causation.
Chi-Square test: are two categorical variables related? Compare observed vs expected counts; p < 0.05 → significant.
T-Test: is the mean of a numeric variable different between two groups? p < 0.05 → groups differ significantly.
p-value: probability of seeing the result by chance; p < 0.05 → reject “no effect” (significant).

📚 NON-CORE (Good to know)

Degrees of freedom in Chi-Square and T-Test.
Type I error (false positive) vs Type II error (false negative).
When to use paired vs unpaired T-Test.
Other tests: ANOVA (3+ groups), Mann-Whitney (non-normal).

Chapter 6: Summary - When to Use What?

Question You Want to Answer	Test to Use	Example
Is there a relationship between two CATEGORIES?	Chi-Square	Gender vs Product Preference
Is a NUMBER different across two CATEGORIES?	T-Test	Salary of Men vs Women
Do two NUMBERS move together?	Correlation	Height vs Weight
What's the typical value in my data?	Mean/Median	Average house price
How spread out is my data?	Std Dev/Variance	Are test scores consistent?

🎯 The Magic p-value Rule

For ALL these tests, remember:

p < 0.05 → "Yes, there's a real effect!" ✅
p ≥ 0.05 → "No significant effect found" ❌

Think of p-value as asking: "What's the chance this happened by pure luck?"

If less than 5% chance → It's probably NOT luck!

🎬 Animated: p-value scale (0 → 1)

Green zone = significant (p < 0.05). Red zone = not significant. The line shows “where you are” on the scale.

📊 Statistics for Data Science

Chapter 1: What is Statistics? (And Why Should You Care?)

👶 Explain Like I'm 5

📌 In One Sentence

🍕 Real Life Example

Two Types of Statistics

📋 Descriptive Statistics

🔮 Inferential Statistics

Chapter 2: The Normal Distribution (The Bell Curve)

👶 Explain Like I'm 5

🔔 The Bell Curve Shape – With Clear X and Y Axes

🌍 Examples of Normal Distribution in Real Life

The 68-95-99.7 Rule (Memorize This!)

📏 How Data Spreads in a Normal Distribution

🎯 What Does This Mean?

📘 Same idea in the course notebook

Chapter 3: Correlation (Do Things Move Together?)

👶 Explain Like I'm 5

Three Types of Correlation

📈 Positive Correlation (+1)

📉 Negative Correlation (-1)

↔️ No Correlation (0)

📊 Correlation Coefficient Scale

⚠️ SUPER IMPORTANT: Correlation ≠ Causation!

Chapter 4: Chi-Square Test (Are Categories Related?)

📥 Download datasets for this lesson

👶 Explain Like I'm 5

🏨 Real Example: Hotel Bookings

How Chi-Square Works (Super Simple)

📝 The Process

Create a "Contingency Table"

Calculate "Expected" counts

Compare Observed vs Expected

Check the p-value

📊 Example Contingency Table

🎯 What Does p-value Mean?

Chapter 5: T-Test (Are Two Groups Different?)

👶 Explain Like I'm 5

When to Use T-Test vs Chi-Square

Chi-Square Test

T-Test (Welch's)

🏨 Real Example: Does Lead Time Affect Cancellation?

🎯 What Did We Learn?

Complete code from course notebook: statistics.ipynb

💭 Short reflection

🚫 Common Mistakes in Statistics

Core & Non-Core Points – Mastery Checklist

✅ CORE (Must know)

📚 NON-CORE (Good to know)

Chapter 6: Summary - When to Use What?

🎯 The Magic p-value Rule

🎬 Animated: p-value scale (0 → 1)