๐Ÿ‘ถ ABSOLUTE BEGINNER FRIENDLY

๐Ÿ“Š Statistics for Data Science

Statistics sounds scary, but it's just asking questions about data! We'll make it super simple with everyday examples.

Chapter 1: What is Statistics? (And Why Should You Care?)

๐Ÿ‘ถ Explain Like I'm 5

Statistics is like being a detective with numbers! ๐Ÿ”

You have clues (data), and statistics helps you figure out what they mean!

๐Ÿ“Œ In One Sentence

Statistics is the set of tools we use to summarize data (descriptive) and to make conclusions or predictions from samples (inferential)โ€”including when to trust that a pattern isn't just luck (hypothesis tests, p-values).

๐Ÿ• Real Life Example

A pizza restaurant wants to know: "Which pizza is most popular?"

They can't ask every single customer ever. So they look at last month's orders (that's their sample) and use statistics to guess what ALL customers like!

Two Types of Statistics

๐Ÿ“‹ Descriptive Statistics

What it does: Summarizes data you already have

Example: "Last month, 60% of orders were pepperoni pizza"

Tools: Mean, Median, Mode, Charts

๐Ÿ”ฎ Inferential Statistics

What it does: Makes predictions about data you DON'T have

Example: "We predict next month will also be ~60% pepperoni"

Tools: Hypothesis tests, Confidence intervals

Chapter 2: The Normal Distribution (The Bell Curve)

๐Ÿ‘ถ Explain Like I'm 5

Imagine measuring the height of 1000 adults:

  • Very few people are SUPER short (like 4 feet)
  • Very few people are SUPER tall (like 7 feet)
  • MOST people are somewhere in the middle (around 5'6" to 5'10")

When you graph this, it makes a bell shape! ๐Ÿ””

๐Ÿ”” The Bell Curve Shape โ€“ With Clear X and Y Axes

Graph below: X-axis = value (e.g. height, test score); Y-axis = how often that value appears (frequency). The peak is at the mean.

X: Value (e.g. height, score) Y: Frequency Mean

X = value, Y = frequency. Red dashed line = mean (center).

Few   MOST ARE HERE   Few
Short    (Average)    Tall
                

This shape appears EVERYWHERE in nature!

๐ŸŒ Examples of Normal Distribution in Real Life

  • Test scores: Most students get average grades, few fail, few get perfect
  • Birth weight: Most babies are 6-8 lbs, very few are 3 lbs or 12 lbs
  • Shoe sizes: Most people wear size 8-10, very few wear size 4 or 16
  • Daily temperature: Usually near average, rarely extreme

The 68-95-99.7 Rule (Memorize This!)

๐Ÿ“ How Data Spreads in a Normal Distribution

Distance from Mean % of Data
Within 1 Standard Deviation 68%
Within 2 Standard Deviations 95%
Within 3 Standard Deviations 99.7%

๐ŸŽฏ What Does This Mean?

If average test score is 75 and standard deviation is 10:

  • 68% of students scored between 65 and 85 (75 ยฑ 10)
  • 95% of students scored between 55 and 95 (75 ยฑ 20)
  • 99.7% of students scored between 45 and 105 (75 ยฑ 30)

If someone scores 20, that's VERY unusual! (an outlier)

import numpy as np
import matplotlib.pyplot as plt

# Generate 10,000 random numbers from a normal distribution
# mean = 75 (average test score)
# std = 10 (how spread out the scores are)

np.random.seed(42)  # For reproducibility
test_scores = np.random.normal(loc=75, scale=10, size=10000)

# Let's verify the 68-95-99.7 rule!
mean = np.mean(test_scores)
std = np.std(test_scores)

# Count how many fall within 1, 2, 3 standard deviations
within_1_std = np.sum((test_scores >= mean - std) & (test_scores <= mean + std))
within_2_std = np.sum((test_scores >= mean - 2*std) & (test_scores <= mean + 2*std))
within_3_std = np.sum((test_scores >= mean - 3*std) & (test_scores <= mean + 3*std))

print(f"Within 1 std: {within_1_std/100:.1f}% (expected: 68%)")
print(f"Within 2 std: {within_2_std/100:.1f}% (expected: 95%)")
print(f"Within 3 std: {within_3_std/100:.1f}% (expected: 99.7%)")

# Output:
# Within 1 std: 68.2% (expected: 68%)
# Within 2 std: 95.4% (expected: 95%)
# Within 3 std: 99.7% (expected: 99.7%)

๐Ÿ“˜ Same idea in the course notebook

The course source uses np.random.seed(10) and original_data = np.random.normal(size=10000) (mean 0, variance 1 by default). Then len(original_data) gives 10000. To plot the shape, the notebook uses sns.distplot(original_data, hist=False, kde=True) for a smooth curve and sns.distplot(original_data, hist=True, kde=False) for a histogramโ€”both show the bell shape. Loading Excel: data = pd.read_excel("titanic3.xlsx") (download titanic3.xlsx from the datasets page). The course also includes Statistics.pdf for a slide-style summary.

Chapter 3: Correlation (Do Things Move Together?)

๐Ÿ‘ถ Explain Like I'm 5

Correlation tells us: "When one thing changes, does the other thing also change?"

Like: When it's sunny โ˜€๏ธ, do ice cream sales go up? ๐Ÿฆ

Three Types of Correlation

๐Ÿ“ˆ Positive Correlation (+1)

When X goes UP, Y goes UP too!

Examples:

  • More study hours โ†’ Higher grades
  • Taller person โ†’ Usually weighs more
  • More bedrooms โ†’ Higher house price

๐Ÿ“‰ Negative Correlation (-1)

When X goes UP, Y goes DOWN!

Examples:

  • More exercise โ†’ Less body fat
  • Higher altitude โ†’ Lower temperature
  • More ads blocked โ†’ Less revenue

โ†”๏ธ No Correlation (0)

X and Y don't affect each other!

Examples:

  • Shoe size โ†’ Intelligence
  • Your birthday โ†’ Test scores
  • Hair color โ†’ Salary

๐Ÿ“Š Correlation Coefficient Scale

-1
Perfect Negative
0
No Correlation
+1
Perfect Positive

โš ๏ธ SUPER IMPORTANT: Correlation โ‰  Causation!

Just because two things move together doesn't mean one CAUSES the other!

Example: Ice cream sales and drowning deaths are correlated. Does ice cream cause drowning? NO!

Both go up in summer because of the heat - that's the real cause!

import pandas as pd
import numpy as np

# Create sample data: Study hours vs Test scores
data = {
    'study_hours': [1, 2, 3, 4, 5, 6, 7, 8],
    'test_score':  [50, 55, 60, 65, 70, 75, 80, 85]
}
df = pd.DataFrame(data)

# Calculate correlation
correlation = df['study_hours'].corr(df['test_score'])
print(f"Correlation: {correlation:.2f}")

# Output: Correlation: 1.00
# Perfect positive correlation! More study = Higher score

# See correlation matrix for all columns
print("\nCorrelation Matrix:")
print(df.corr())

# Output:
#              study_hours  test_score
# study_hours          1.0         1.0
# test_score           1.0         1.0

Chapter 4: Chi-Square Test (Are Categories Related?)

๐Ÿ“ฅ Download datasets for this lesson

Hotel Reservations (Chi-Square & T-Test), Housing, Titanic, and Practice exercises.

Hotel Reservations.csv Housing.csv titanic3.xlsx Practice.xlsx

๐Ÿ‘ถ Explain Like I'm 5

The Chi-Square test answers: "Is there a relationship between two CATEGORIES?"

Like: Is there a connection between Gender (Male/Female) and Favorite Color (Red/Blue)?

๐Ÿจ Real Example: Hotel Bookings

We want to know: "Does the type of MEAL PLAN affect whether people CANCEL their booking?"

Both variables are categories:

  • Meal Plan: Plan 1, Plan 2, Not Selected
  • Booking Status: Canceled, Not Canceled

How Chi-Square Works (Super Simple)

๐Ÿ“ The Process

1
Create a "Contingency Table"

Count how many fall into each combination of categories

2
Calculate "Expected" counts

If there was NO relationship, what would we expect?

3
Compare Observed vs Expected

Are they very different? If yes โ†’ There's a relationship!

4
Check the p-value

p < 0.05 means "Yes, there IS a relationship!"

๐Ÿ“Š Example Contingency Table

Canceled Not Canceled Total
Meal Plan 1 8,679 19,156 27,835
Meal Plan 2 1,506 1,799 3,305
Not Selected 1,699 3,431 5,130
import pandas as pd
from scipy.stats import chi2_contingency

# Load the hotel data
data = pd.read_csv("Hotel Reservations.csv")

# Step 1: Create contingency table
# pd.crosstab counts combinations of two categorical variables
table = pd.crosstab(data['type_of_meal_plan'], data['booking_status'])
print("Contingency Table:")
print(table)

# Step 2: Run Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(table)

print(f"\nChi-Square Statistic: {chi2:.2f}")
print(f"p-value: {p_value}")

# Step 3: Interpret the result
if p_value < 0.05:
    print("โœ… YES! Meal plan IS related to cancellation!")
else:
    print("โŒ NO relationship between meal plan and cancellation")

# Output:
# Chi-Square Statistic: 271.45
# p-value: 4.477e-61  (that's TINY!)
# โœ… YES! Meal plan IS related to cancellation!

๐ŸŽฏ What Does p-value Mean?

p-value = Probability of getting this result by pure chance

If p-value < 0.05 (less than 5% chance), we say "This is NOT just luck - there's a real relationship!"

Our p-value was 0.0000000000000...0004477 - TINY! So definitely not luck!

Chapter 5: T-Test (Are Two Groups Different?)

๐Ÿ‘ถ Explain Like I'm 5

The T-Test answers: "Are these two groups REALLY different, or is it just random chance?"

Example: Do customers who book early (high lead time) cancel more than those who book late?

When to Use T-Test vs Chi-Square

Chi-Square Test

Use when BOTH variables are categories

Example: Gender vs Favorite Color

(Male/Female) vs (Red/Blue/Green)

T-Test (Welch's)

Use when comparing a NUMBER across categories

Example: Lead Time vs Booking Status

(Days: 10, 20, 30...) vs (Canceled/Not Canceled)

๐Ÿจ Real Example: Does Lead Time Affect Cancellation?

Lead time = How many days before arrival did they book?

We have two groups:

  • Group 1: People who CANCELED - what was their average lead time?
  • Group 2: People who DIDN'T cancel - what was their average lead time?

If the averages are VERY different โ†’ Lead time matters for cancellation!

import pandas as pd
from scipy import stats

# Load data
data = pd.read_csv("Hotel Reservations.csv")

# Step 1: Separate the two groups
canceled = data[data['booking_status'] == 'Canceled']
not_canceled = data[data['booking_status'] == 'Not_Canceled']

# Step 2: Get the lead_time for each group
lead_time_canceled = canceled['lead_time']
lead_time_not_canceled = not_canceled['lead_time']

# Let's see the averages first
print("Average lead time for CANCELED bookings:", lead_time_canceled.mean().round(1))
print("Average lead time for NOT CANCELED:", lead_time_not_canceled.mean().round(1))

# Step 3: Run the T-Test
t_stat, p_value = stats.ttest_ind(lead_time_canceled, lead_time_not_canceled)

print(f"\nT-Statistic: {t_stat:.2f}")
print(f"p-value: {p_value}")

# Step 4: Interpret
if p_value < 0.05:
    print("โœ… Lead time IS significantly different between groups!")
else:
    print("โŒ No significant difference in lead time")

# Output:
# Average lead time for CANCELED bookings: 110.5 days
# Average lead time for NOT CANCELED: 74.2 days
# T-Statistic: 45.67
# p-value: 0.0
# โœ… Lead time IS significantly different between groups!

๐ŸŽฏ What Did We Learn?

People who CANCELED booked ~110 days in advance (on average)

People who DIDN'T cancel booked ~74 days in advance

Conclusion: People who book too early are more likely to cancel!

(Maybe their plans change over time)

Complete code from course notebook: statistics.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
import warnings
warnings.filterwarnings("ignore")

# import modules
import numpy as np
import pandas as pd

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# --- Code cell 2 ---
from IPython.core.display import HTML

HTML("""
<style>
h1 { color: Purple !important; }
h2 { color: green !important; }
h3 { color: blue !important; }
</style>
""")

# --- Code cell 6 ---
np.random.seed(10)
original_data = np.random.normal(size = 10000)  # by default mean is zero and variance is 1

# --- Code cell 7 ---
len(original_data)

# --- Code cell 8 ---
# creating axes to draw plots
fig, ax = plt.subplots(1, 2,figsize = (30,10))
 
sns.distplot(original_data, hist = False, kde = True,
            kde_kws = {'shade': True, 'linewidth': 2},
            color ="green", ax = ax[0])

sns.distplot(original_data, hist = True, kde = False,
            kde_kws = {'shade': True, 'linewidth': 2},
             color ="green", ax = ax[1])
 
# rescaling the subplots
fig.set_figheight(5)
fig.set_figwidth(10)

# --- Code cell 13 ---
#conda install xlrd  / pip install xlrd  ( in anconda comand prompt)
data = pd.read_excel("titanic3.xlsx")

# --- Code cell 14 ---
data.head(10)

# --- Code cell 15 ---
print(data.isnull().sum()) # add example of percentage

# --- Code cell 16 ---
print(data.isnull().sum()*100/len(data))

# --- Code cell 17 ---
#Drop columns with IDs and large number of missing values
data.drop(["name", "ticket", "cabin","boat","body","home.dest"],axis=1,inplace=True)

# --- Code cell 18 ---
data.describe()

# --- Code cell 19 ---
data.describe(include='all')

# --- Code cell 20 ---
data['embarked'].value_counts()

# --- Code cell 21 ---
data['embarked'].fillna('S',inplace=True)

# --- Code cell 22 ---
data['embarked'].value_counts()

# --- Code cell 23 ---
import statistics
age_variance = statistics.variance(data['age'].dropna())
print("Variance of Age: ", age_variance)

# --- Code cell 24 ---
std_dev_age = statistics.sqrt(age_variance)
print("Standard deviation of Age: ", std_dev_age)

# --- Code cell 25 ---

sns.distplot(data['age'], hist = True, kde = False,color ="green")

# --- Code cell 27 ---
print("mean age before imputation: ", data['age'].mean())
#print("\n")
data['age'].fillna(data['age'].median(),inplace=True)
print("mean age after imputation: ", data['age'].mean())
print(data.isnull().sum())

# --- Code cell 28 ---
sns.distplot(data['age'], hist = True, kde = False,color ="green")

# --- Code cell 29 ---
#Another options:
    
    #Drop columns with missing data  60%
    #Fit regresison model to predict missing age data
    # Always check accuracy of main task after imputation 
    # Take business logic into consideration for data imputation

# --- Code cell 35 ---
data = pd.read_csv("Housing.csv")

# --- Code cell 36 ---
data = data[['price','area']]
data.describe()

# --- Code cell 37 ---
sns.boxplot(data['area']).set(xlabel= 'area')

# --- Code cell 38 ---
# outlier treatment for area
# data['area']  / data.area
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1

print("Q1 value: ", Q1)
print("Q3 value: ", Q3)
print("IQR value: ", Q3 - Q1)

# --- Code cell 39 ---
print("Lower threshold of outlier value: ", Q1 - 1.5*IQR)
print("Upper threshold of outlier value: ", Q3 + 1.5*IQR)

# --- Code cell 40 ---
#3650

# --- Code cell 41 ---
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]
data.describe()

# --- Code cell 44 ---
data.describe()

# --- Code cell 46 ---
# 0 to 1 scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data))
data_scaled.columns = ['price','area']
data_scaled.head(10)

# --- Code cell 47 ---
data_scaled.describe()

# --- Code cell 50 ---
data.describe()

# --- Code cell 52 ---
# z score scaling
#  0 mean and variance /standard deviation 1 

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data))
data_scaled.columns = ['price','area']
data_scaled.head(10)

# --- Code cell 53 ---
# if you note - mean will is zero and standard deviation is 1. 
# Min and max vlaue is based on the data after applying transformation 
data_scaled.describe()

# --- Code cell 56 ---
data.head(10)

# --- Code cell 57 ---
sns.scatterplot(data=data, x="area", y="price").set(xlabel= "area",ylabel="price")

# --- Code cell 58 ---
#Correlation of output with numerical variables

# plotting correlation heatmap
dataplot = sns.heatmap(data[['price', 'area']].corr(), cmap="YlGnBu", annot=True)
  
# displaying heatmap
plt.show()

๐Ÿ’ญ Short reflection

In one sentence: why is it wrong to say โ€œcorrelation means causationโ€? Give a real-life example where two things are correlated but one doesnโ€™t cause the other.

๐Ÿšซ Common Mistakes in Statistics

  • Confusing correlation with causation โ€” Two things moving together doesn't mean one causes the other (e.g. ice cream sales and drownings both go up in summer).
  • Thinking p < 0.05 means "proven" โ€” It only means "unlikely to be pure chance"; we never prove the alternative, we only reject the null.
  • Using the wrong test โ€” Categories vs categories โ†’ Chi-Square; numeric mean between two groups โ†’ T-Test. Don't use a T-Test for categorical data.

Core & Non-Core Points โ€“ Mastery Checklist

Master every core point for exams and real work. Non-core points deepen your statistical thinking.

โœ… CORE (Must know)

  • Descriptive stats: mean, median, mode; variance and standard deviation (spread).
  • Normal distribution: bell curve; 68โ€“95โ€“99.7 rule (within 1, 2, 3 std of mean).
  • Correlation: -1 to +1; positive (Xโ†‘ Yโ†‘), negative (Xโ†‘ Yโ†“), zero (no linear relation). Correlation โ‰  causation.
  • Chi-Square test: are two categorical variables related? Compare observed vs expected counts; p < 0.05 โ†’ significant.
  • T-Test: is the mean of a numeric variable different between two groups? p < 0.05 โ†’ groups differ significantly.
  • p-value: probability of seeing the result by chance; p < 0.05 โ†’ reject โ€œno effectโ€ (significant).

๐Ÿ“š NON-CORE (Good to know)

  • Degrees of freedom in Chi-Square and T-Test.
  • Type I error (false positive) vs Type II error (false negative).
  • When to use paired vs unpaired T-Test.
  • Other tests: ANOVA (3+ groups), Mann-Whitney (non-normal).

Chapter 6: Summary - When to Use What?

Question You Want to Answer Test to Use Example
Is there a relationship between two CATEGORIES? Chi-Square Gender vs Product Preference
Is a NUMBER different across two CATEGORIES? T-Test Salary of Men vs Women
Do two NUMBERS move together? Correlation Height vs Weight
What's the typical value in my data? Mean/Median Average house price
How spread out is my data? Std Dev/Variance Are test scores consistent?

๐ŸŽฏ The Magic p-value Rule

For ALL these tests, remember:

  • p < 0.05 โ†’ "Yes, there's a real effect!" โœ…
  • p โ‰ฅ 0.05 โ†’ "No significant effect found" โŒ

Think of p-value as asking: "What's the chance this happened by pure luck?"

If less than 5% chance โ†’ It's probably NOT luck!

๐ŸŽฌ Animated: p-value scale (0 โ†’ 1)

Green zone = significant (p < 0.05). Red zone = not significant. The line shows โ€œwhere you areโ€ on the scale.

0.05 0 1 Significant Not significant