Hypothesis Testing | Fakhruddin Khambaty's Learning Hub

Part 1: The Detective Framework

Hypothesis testing is like being a detective. You have a claim to investigate, evidence to analyze, and a verdict to deliver!

👶 In One Sentence (Like You're 5)

Hypothesis testing answers: "Could this pattern in the data just be luck?" We run a test and get a number called the p-value. If the p-value is very small (usually below 0.05), we say the pattern is probably real; if not, we say we don't have enough evidence. So we never "prove" anything—we only decide whether the evidence is strong enough.

🕵️ The Detective Analogy

Crime Scene: A business question ("Does the new website increase sales?")

Evidence: Data from experiments and observations

Investigation: Statistical tests

Verdict: "Statistically significant" or "Not enough evidence"

The Hypothesis Testing Process

1

State the Hypotheses

H₀ (Null): "Nothing is happening" - The default assumption
H₁ (Alternative): "Something IS happening" - What you want to prove

2

Set the Significance Level (α)

Usually α = 0.05 (5%). This is your "threshold for surprise" - how unlikely must the evidence be to convince you?

3

Collect Data & Calculate Test Statistic

Run your experiment, gather data, and calculate the appropriate test statistic (t, z, χ², etc.)

4

Calculate the P-Value

The probability of seeing data this extreme IF H₀ were true. Small p-value = strong evidence against H₀.

5

Make Your Decision

If p-value < α: Reject H₀ → "Statistically significant!"
If p-value ≥ α: Fail to reject H₀ → "Not enough evidence"

📌 Critical Understanding: P-Value

P-value answers: "If nothing special is happening (H₀ is true), how likely is it to see results this extreme by pure chance?"

Small p-value (< 0.05): Very unlikely by chance → Something IS happening!

Large p-value (≥ 0.05): Could easily happen by chance → Can't conclude anything special

Part 2: Chi-Square Test (Categorical vs Categorical)

Use Chi-Square when comparing categories - like gender vs product preference, or meal plan vs booking cancellation.

🏨 Hotel Booking Example

Question: Is there a relationship between meal plan type and booking cancellation?

H₀: Meal plan and cancellation are independent (no relationship)

H₁: Meal plan and cancellation ARE related

Step 1: Create a Contingency Table

A contingency table shows the frequency count for each combination of categories:

Meal Plan	Canceled	Not Canceled	Total
Meal Plan 1	8,679	19,156	27,835
Meal Plan 2	1,506	1,799	3,305
Not Selected	1,699	3,431	5,130

import pandas as pd
from scipy.stats import chi2_contingency

# Load hotel reservations data
data = pd.read_csv("Hotel Reservations.csv")

# Create contingency table
contingency_table = pd.crosstab(
    data['type_of_meal_plan'], 
    data['booking_status']
)
print("Contingency Table:")
print(contingency_table)

# Perform Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-Square Statistic: {chi2:.2f}")
print(f"P-value: {p_value:.2e}")  # Scientific notation
print(f"Degrees of Freedom: {dof}")

# Interpret
if p_value < 0.05:
    print("✅ REJECT H₀: Meal plan and cancellation ARE related!")
else:
    print("❌ FAIL TO REJECT H₀: No significant relationship found.")

# Output:
# Chi-Square Statistic: 276.35
# P-value: 4.48e-61  ← Extremely small!
# ✅ REJECT H₀: Meal plan and cancellation ARE related!

🎯 What Does This Mean for Business?

Customers with different meal plans have different cancellation rates! You can now:

Offer incentives to high-cancellation meal plan groups
Adjust pricing based on cancellation risk
Target marketing to low-cancellation groups

Multiple Features at Once

# Test multiple categorical features against booking status
categorical_features = ['type_of_meal_plan', 'room_type_reserved', 
                        'no_of_weekend_nights', 'no_of_children']

for feature in categorical_features:
    # Create contingency table
    table = pd.crosstab(data[feature], data['booking_status'])
    
    # Chi-Square test
    chi2, p_value, dof, expected = chi2_contingency(table)
    
    print(f"\n{feature}:")
    print(f"  P-value: {p_value:.2e}")
    
    if p_value < 0.05:
        print(f"  ✅ SIGNIFICANT - Use this feature for prediction!")
    else:
        print(f"  ❌ Not significant - May not be useful alone")

# Output:
# type_of_meal_plan:
#   P-value: 4.48e-61
#   ✅ SIGNIFICANT - Use this feature for prediction!
# 
# room_type_reserved:
#   P-value: 4.43e-11
#   ✅ SIGNIFICANT - Use this feature for prediction!
# 
# no_of_weekend_nights:
#   P-value: 1.12e-40
#   ✅ SIGNIFICANT - Use this feature for prediction!

Part 3: T-Test (Comparing Numerical Means)

Use T-Test when comparing average values between two groups - like average spending by gender, or conversion rates between website versions.

Types of T-Tests

Type	When to Use	Example
Independent T-Test	Comparing two separate groups	Male vs Female average spending
Paired T-Test	Same group measured twice	Weight before vs after diet
Welch's T-Test	Two groups with different sizes/variances	Most real-world scenarios

🏨 Hotel Example: Lead Time & Cancellation

Question: Do customers who cancel book further in advance?

H₀: Lead time is the same for canceled and non-canceled bookings

H₁: Lead time is DIFFERENT between groups

from scipy import stats
import pandas as pd

# Load data
data = pd.read_csv("Hotel Reservations.csv")

# Separate the two groups
canceled = data[data['booking_status'] == 'Canceled']['lead_time']
not_canceled = data[data['booking_status'] == 'Not_Canceled']['lead_time']

# Compare the means first
print("Descriptive Statistics:")
print(f"Canceled bookings - Mean lead time: {canceled.mean():.1f} days")
print(f"Not canceled bookings - Mean lead time: {not_canceled.mean():.1f} days")

# Perform Welch's T-Test (default in scipy)
t_stat, p_value = stats.ttest_ind(canceled, not_canceled)

print(f"\nT-statistic: {t_stat:.2f}")
print(f"P-value: {p_value}")

if p_value < 0.05:
    print("✅ SIGNIFICANT: Lead time differs between groups!")
else:
    print("❌ Not significant.")

# Output:
# Descriptive Statistics:
# Canceled bookings - Mean lead time: 135.2 days
# Not canceled bookings - Mean lead time: 72.8 days
# 
# T-statistic: 54.23
# P-value: 0.0  ← So small it's essentially zero!
# ✅ SIGNIFICANT: Lead time differs between groups!

🎯 Business Insight

Customers who cancel book ~63 days earlier on average! You can:

Send reminder emails for long-lead-time bookings
Require deposits for far-advance bookings
Offer incentives for keeping reservations

Testing Multiple Numerical Features

# Test multiple numerical features
numerical_features = ['lead_time', 'avg_price_per_room']

for feature in numerical_features:
    canceled = data[data['booking_status'] == 'Canceled'][feature]
    not_canceled = data[data['booking_status'] == 'Not_Canceled'][feature]
    
    # T-Test
    t_stat, p_value = stats.ttest_ind(canceled, not_canceled)
    
    print(f"\n{feature}:")
    print(f"  Canceled mean: {canceled.mean():.2f}")
    print(f"  Not canceled mean: {not_canceled.mean():.2f}")
    print(f"  P-value: {p_value:.2e}")
    
    if p_value < 0.05:
        print(f"  ✅ SIGNIFICANT - Great predictor of cancellation!")
    else:
        print(f"  ❌ Not significant")

# Output:
# lead_time:
#   Canceled mean: 135.19
#   Not canceled mean: 72.84
#   P-value: 0.00e+00
#   ✅ SIGNIFICANT - Great predictor of cancellation!
# 
# avg_price_per_room:
#   Canceled mean: 108.71
#   Not canceled mean: 100.56
#   P-value: 5.23e-164
#   ✅ SIGNIFICANT - Great predictor of cancellation!

Part 4: Choosing the Right Test

🧭 Decision Flowchart

What type of data are you comparing?

📊 Categorical vs Categorical?

→ Use Chi-Square Test

Example: Gender vs Product Preference

📈 Numerical, comparing 2 groups?

→ Use T-Test

Example: Average spending - Male vs Female

📈 Numerical, comparing 3+ groups?

→ Use ANOVA

Example: Sales across North, South, East, West regions

📈 Numerical vs Numerical relationship?

→ Use Correlation Test

Example: Hours studied vs Test score

Test	Data Types	Question Answered	Python Function
Chi-Square	Cat vs Cat	Are these categories related?	`chi2_contingency()`
T-Test	Num vs Cat (2 groups)	Are the means different?	`ttest_ind()`
ANOVA	Num vs Cat (3+ groups)	Is any group mean different?	`f_oneway()`
Pearson Correlation	Num vs Num	Do they move together?	`pearsonr()`

Part 5: Common Mistakes to Avoid

❌ Mistake 1: p-value = Probability H₀ is True

"P-value of 0.03 means there's only 3% chance the null is true"

Wrong! P-value is the probability of seeing this data IF H₀ is true, not the probability that H₀ is true.

✅ Correct Interpretation

"IF nothing special is happening, there's only a 3% chance of seeing data this extreme by random chance."

This is strong evidence AGAINST H₀, but not proof.

❌ Mistake 2: "Fail to Reject" = "Accept H₀"

"The p-value is 0.15, so we accept the null hypothesis"

Wrong! We never "accept" the null - we just don't have enough evidence to reject it.

✅ Correct Language

"We fail to reject H₀" or "There is insufficient evidence to conclude..."

Absence of evidence is not evidence of absence!

❌ Mistake 3: p < 0.05 Always Means Important

"The p-value is 0.001, so this is a huge effect!"

Wrong! Statistical significance ≠ Practical significance. With large samples, tiny differences can be "significant".

✅ Correct Approach

Always report EFFECT SIZE alongside p-value.

"Website B increases conversion by 0.1% (p=0.001)" - Is 0.1% worth the effort?

⚠️ p-Hacking Warning

What is it? Running many tests until you find one with p < 0.05 by pure chance.

Why it's bad: If you test 20 hypotheses at α=0.05, you'd expect 1 false positive by chance!

Solution: Pre-register your hypotheses. If testing multiple comparisons, use Bonferroni correction (α/n).

Part 6: Using Hypothesis Tests for Feature Selection

Before building a machine learning model, use hypothesis tests to identify which features are actually related to your target variable!

🎯 Feature Selection Workflow

Goal: Predict hotel booking cancellation

Available features: meal_plan, room_type, lead_time, price, weekend_nights, children

Question: Which features should we include in our model?

import pandas as pd
from scipy.stats import chi2_contingency, ttest_ind

# Load data
data = pd.read_csv("Hotel Reservations.csv")

# Define feature types
categorical_features = ['type_of_meal_plan', 'room_type_reserved']
numerical_features = ['lead_time', 'avg_price_per_room']

print("="*50)
print("FEATURE SELECTION USING HYPOTHESIS TESTS")
print("="*50)

selected_features = []

# Test categorical features with Chi-Square
print("\n📊 CATEGORICAL FEATURES (Chi-Square Test)")
for feature in categorical_features:
    table = pd.crosstab(data[feature], data['booking_status'])
    chi2, p_value, dof, expected = chi2_contingency(table)
    
    if p_value < 0.05:
        selected_features.append(feature)
        print(f"✅ {feature}: p={p_value:.2e} → SELECTED")
    else:
        print(f"❌ {feature}: p={p_value:.2f} → REJECTED")

# Test numerical features with T-Test
print("\n📈 NUMERICAL FEATURES (T-Test)")
for feature in numerical_features:
    canceled = data[data['booking_status'] == 'Canceled'][feature]
    not_canceled = data[data['booking_status'] == 'Not_Canceled'][feature]
    
    t_stat, p_value = ttest_ind(canceled, not_canceled)
    
    if p_value < 0.05:
        selected_features.append(feature)
        print(f"✅ {feature}: p={p_value:.2e} → SELECTED")
    else:
        print(f"❌ {feature}: p={p_value:.2f} → REJECTED")

print(f"\n🎯 FINAL SELECTED FEATURES: {selected_features}")

# Output:
# ==================================================
# FEATURE SELECTION USING HYPOTHESIS TESTS
# ==================================================
# 
# 📊 CATEGORICAL FEATURES (Chi-Square Test)
# ✅ type_of_meal_plan: p=4.48e-61 → SELECTED
# ✅ room_type_reserved: p=4.43e-11 → SELECTED
# 
# 📈 NUMERICAL FEATURES (T-Test)
# ✅ lead_time: p=0.00e+00 → SELECTED
# ✅ avg_price_per_room: p=5.23e-164 → SELECTED
# 
# 🎯 FINAL SELECTED FEATURES: ['type_of_meal_plan', 'room_type_reserved', 
#                              'lead_time', 'avg_price_per_room']

💡 Important Note

A feature with p > 0.05 might still be useful when combined with other features! Use hypothesis tests as a starting point, not the final word. Cross-validation with actual ML models gives the true answer.

📘 From the course notebook (Hypothesis Testing)

The course source covers t-tests, chi-square, ANOVA with examples. Key ideas: scipy.stats.ttest_ind (two groups), chi2_contingency (categorical vs categorical), f_oneway (3+ groups). Use real data (e.g. from the datasets page) to run tests and interpret p-values. See Hypothesis testing.pdf in the course source for slides.

Complete code from course notebook: hypothesis_testing.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>
h1 { color: Purple !important; }
h2 { color: green !important; }
h3 { color: blue !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd

# --- Code cell 3 ---
data = pd.read_csv("Hotel Reservations.csv")

# --- Code cell 4 ---
data.head(10)

# --- Code cell 5 ---
data.columns

# --- Code cell 6 ---
data.info()

# --- Code cell 7 ---
data = data[['no_of_weekend_nights' ,'no_of_children','type_of_meal_plan','room_type_reserved','lead_time','avg_price_per_room','booking_status']]

# --- Code cell 8 ---
data['booking_status'].value_counts()

# --- Code cell 9 ---
data['no_of_children'].value_counts()

# --- Code cell 10 ---
data['type_of_meal_plan'].value_counts()

# --- Code cell 11 ---
data['room_type_reserved'].value_counts()

# --- Code cell 12 ---
data.columns

# --- Code cell 13 ---
categorical_columns = ['no_of_weekend_nights' ,'no_of_children', 'type_of_meal_plan', 'room_type_reserved']

# --- Code cell 17 ---
from scipy.stats import chi2_contingency

# --- Code cell 18 ---

  
# defining the table
for column in categorical_columns:
        temp = pd.DataFrame(data[column].value_counts()).reset_index()
        temp.columns = [column,'frequency']
        categories = list(temp[temp['frequency']>20][column])
        data_new = data[data[column].isin(categories)]
        print(data_new[column].value_counts())
        
        table = pd.crosstab(data_new[column], data_new['booking_status'])
        print(table)
        stat, p, dof, expected = chi2_contingency(table)
        print("Chi-square test for feature: ", column)
        print("p-value : ", p)
        print("")
        print("")

# --- Code cell 19 ---
column = 'no_of_weekend_nights'
temp = pd.DataFrame(data[column].value_counts()).reset_index()
print(temp)

# --- Code cell 20 ---
temp.columns = [column,'frequency']
temp.head(10)

# --- Code cell 21 ---
#Step 1 - remove rows with 20 or less examples
categories = list(temp[temp['frequency']>20][column])
print(categories)

# --- Code cell 22 ---
data_new = data[data[column].isin(categories)]
print(data_new[column].value_counts())

# --- Code cell 23 ---
#Step 2 - create a contigency table using crosstab
table = pd.crosstab(data_new[column], data_new['booking_status'])
print(table)

#Step 3 - Perform chi square test using chi2_contingency function
stat, p, dof, expected = chi2_contingency(table)
print("Chi-square test for feature: ", column)
print("p-value : ", p)
print("")
print("")

# --- Code cell 24 ---
# if p value < 0.05 - alternate hypothesis- atleast 95% cofidence

# --- Code cell 27 ---
numerical_columns = ['lead_time','avg_price_per_room']

# --- Code cell 28 ---
data_cancelled = data[data['booking_status'] == 'Canceled']
data_cancelled[numerical_columns].describe()

# --- Code cell 29 ---
data_not_cancelled = data[data['booking_status'] == 'Not_Canceled']
data_not_cancelled[numerical_columns].describe()

# --- Code cell 30 ---
from scipy import stats
  
# defining the table
for column in numerical_columns:
        samples_set1 = data_cancelled[column]
        samples_set2 = data_not_cancelled[column]
        stat, p = stats.ttest_ind(samples_set1, samples_set2)
    
        print("ttest for feature: ", column)
        print("p-value : ", p)
        print("")
        print("")

# --- Code cell 31 ---
#lead_time - 0.4 -o/p

# --- Code cell 32 ---
#lead_time , price , type plan ,

# --- Code cell 34 ---
from scipy import stats

# --- Code cell 35 ---
column = 'lead_time'

# --- Code cell 36 ---
# Step 1 - create two populations for the numerical feature
samples_set1 = data_cancelled[column] # population1 -for cancelled bookings
samples_set2 = data_not_cancelled[column] # population2 -for not cancelled bookings

# --- Code cell 37 ---
samples_set1

# --- Code cell 38 ---
# Step 2 - pass the populations to ttest function
stat, p = stats.ttest_ind(samples_set1, samples_set2)

print("ttest for feature: ", column)
print("p-value : ", p)
print("")
print("")

# --- Code cell 39 ---
# p value <0.05 - lead time is useful feature for prediction

# --- Code cell 40 ---
# chi-square test
#checking relation beween two categorical columns 
# categorical column like gender and hotel booking status

# welch t-test
#checking relation beween a numerical column and a categorical column
# checking relation between lead time and hotel booking status

# --- Code cell 41 ---
# feature A - p >0.05 -> this feature might not be useful just on its own
# but it can still be useful when combined with other features

💭 Short reflection

In one sentence: what does it mean when we say “we reject the null hypothesis at α = 0.05”? Why not say “we proved the alternative”?

✅ CORE (Must know)

Null (H₀) vs alternative (H₁); p-value = P(observed or more extreme | H₀ true).
p < 0.05 → reject H₀ (significant); p ≥ 0.05 → fail to reject (not “accept H₀”).
Chi-Square: categorical vs categorical; contingency table; chi2_contingency.
T-Test: compare means of two groups; ttest_ind.
ANOVA: compare means across 3+ groups; f_oneway.
Feature selection: combine tests with domain knowledge and ML validation.

📚 NON-CORE (Good to know)

Type I (false positive) vs Type II (false negative) error.
Paired t-test for before/after or matched groups.
Post-hoc tests after ANOVA (e.g. Tukey).

Summary: Your Hypothesis Testing Toolkit

Scenario	Test	Python Code
Is gender related to product choice?	Chi-Square	`chi2_contingency(crosstab)`
Do men spend more than women?	T-Test	`ttest_ind(group1, group2)`
Do sales differ across 4 regions?	ANOVA	`f_oneway(g1, g2, g3, g4)`
Which features predict churn?	Multiple tests	Chi-Square + T-Test loop

📐 The Golden Rule

If p-value < 0.05 → Reject H₀ (statistically significant)
If p-value ≥ 0.05 → Cannot reject H₀ (not enough evidence)

🔬 Hypothesis Testing Deep Dive

📥 Dataset for this lesson