πŸ”¬ Hypothesis Testing Deep Dive

Learn to prove (or disprove!) claims with data. Make decisions with confidence using statistical tests.

πŸ“₯ Dataset for this lesson

Examples use the Hotel Reservations dataset. Download and save it in the same folder as your script so pd.read_csv("Hotel Reservations.csv") works.

Download Hotel Reservations.csv (3.1 MB)

Part 1: The Detective Framework

Hypothesis testing is like being a detective. You have a claim to investigate, evidence to analyze, and a verdict to deliver!

πŸ‘Ά In One Sentence (Like You're 5)

Hypothesis testing answers: "Could this pattern in the data just be luck?" We run a test and get a number called the p-value. If the p-value is very small (usually below 0.05), we say the pattern is probably real; if not, we say we don't have enough evidence. So we never "prove" anythingβ€”we only decide whether the evidence is strong enough.

πŸ•΅οΈ The Detective Analogy

Crime Scene: A business question ("Does the new website increase sales?")

Evidence: Data from experiments and observations

Investigation: Statistical tests

Verdict: "Statistically significant" or "Not enough evidence"

The Hypothesis Testing Process

1
State the Hypotheses

Hβ‚€ (Null): "Nothing is happening" - The default assumption
H₁ (Alternative): "Something IS happening" - What you want to prove

2
Set the Significance Level (Ξ±)

Usually Ξ± = 0.05 (5%). This is your "threshold for surprise" - how unlikely must the evidence be to convince you?

3
Collect Data & Calculate Test Statistic

Run your experiment, gather data, and calculate the appropriate test statistic (t, z, χ², etc.)

4
Calculate the P-Value

The probability of seeing data this extreme IF Hβ‚€ were true. Small p-value = strong evidence against Hβ‚€.

5
Make Your Decision

If p-value < Ξ±: Reject Hβ‚€ β†’ "Statistically significant!"
If p-value β‰₯ Ξ±: Fail to reject Hβ‚€ β†’ "Not enough evidence"

πŸ“Œ Critical Understanding: P-Value

P-value answers: "If nothing special is happening (Hβ‚€ is true), how likely is it to see results this extreme by pure chance?"

Small p-value (< 0.05): Very unlikely by chance β†’ Something IS happening!

Large p-value (β‰₯ 0.05): Could easily happen by chance β†’ Can't conclude anything special

Part 2: Chi-Square Test (Categorical vs Categorical)

Use Chi-Square when comparing categories - like gender vs product preference, or meal plan vs booking cancellation.

🏨 Hotel Booking Example

Question: Is there a relationship between meal plan type and booking cancellation?

Hβ‚€: Meal plan and cancellation are independent (no relationship)

H₁: Meal plan and cancellation ARE related

Step 1: Create a Contingency Table

A contingency table shows the frequency count for each combination of categories:

Meal Plan Canceled Not Canceled Total
Meal Plan 1 8,679 19,156 27,835
Meal Plan 2 1,506 1,799 3,305
Not Selected 1,699 3,431 5,130
import pandas as pd
from scipy.stats import chi2_contingency

# Load hotel reservations data
data = pd.read_csv("Hotel Reservations.csv")

# Create contingency table
contingency_table = pd.crosstab(
    data['type_of_meal_plan'], 
    data['booking_status']
)
print("Contingency Table:")
print(contingency_table)

# Perform Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-Square Statistic: {chi2:.2f}")
print(f"P-value: {p_value:.2e}")  # Scientific notation
print(f"Degrees of Freedom: {dof}")

# Interpret
if p_value < 0.05:
    print("βœ… REJECT Hβ‚€: Meal plan and cancellation ARE related!")
else:
    print("❌ FAIL TO REJECT Hβ‚€: No significant relationship found.")

# Output:
# Chi-Square Statistic: 276.35
# P-value: 4.48e-61  ← Extremely small!
# βœ… REJECT Hβ‚€: Meal plan and cancellation ARE related!

🎯 What Does This Mean for Business?

Customers with different meal plans have different cancellation rates! You can now:

  • Offer incentives to high-cancellation meal plan groups
  • Adjust pricing based on cancellation risk
  • Target marketing to low-cancellation groups

Multiple Features at Once

# Test multiple categorical features against booking status
categorical_features = ['type_of_meal_plan', 'room_type_reserved', 
                        'no_of_weekend_nights', 'no_of_children']

for feature in categorical_features:
    # Create contingency table
    table = pd.crosstab(data[feature], data['booking_status'])
    
    # Chi-Square test
    chi2, p_value, dof, expected = chi2_contingency(table)
    
    print(f"\n{feature}:")
    print(f"  P-value: {p_value:.2e}")
    
    if p_value < 0.05:
        print(f"  βœ… SIGNIFICANT - Use this feature for prediction!")
    else:
        print(f"  ❌ Not significant - May not be useful alone")

# Output:
# type_of_meal_plan:
#   P-value: 4.48e-61
#   βœ… SIGNIFICANT - Use this feature for prediction!
# 
# room_type_reserved:
#   P-value: 4.43e-11
#   βœ… SIGNIFICANT - Use this feature for prediction!
# 
# no_of_weekend_nights:
#   P-value: 1.12e-40
#   βœ… SIGNIFICANT - Use this feature for prediction!

Part 3: T-Test (Comparing Numerical Means)

Use T-Test when comparing average values between two groups - like average spending by gender, or conversion rates between website versions.

Types of T-Tests

Type When to Use Example
Independent T-Test Comparing two separate groups Male vs Female average spending
Paired T-Test Same group measured twice Weight before vs after diet
Welch's T-Test Two groups with different sizes/variances Most real-world scenarios

🏨 Hotel Example: Lead Time & Cancellation

Question: Do customers who cancel book further in advance?

Hβ‚€: Lead time is the same for canceled and non-canceled bookings

H₁: Lead time is DIFFERENT between groups

from scipy import stats
import pandas as pd

# Load data
data = pd.read_csv("Hotel Reservations.csv")

# Separate the two groups
canceled = data[data['booking_status'] == 'Canceled']['lead_time']
not_canceled = data[data['booking_status'] == 'Not_Canceled']['lead_time']

# Compare the means first
print("Descriptive Statistics:")
print(f"Canceled bookings - Mean lead time: {canceled.mean():.1f} days")
print(f"Not canceled bookings - Mean lead time: {not_canceled.mean():.1f} days")

# Perform Welch's T-Test (default in scipy)
t_stat, p_value = stats.ttest_ind(canceled, not_canceled)

print(f"\nT-statistic: {t_stat:.2f}")
print(f"P-value: {p_value}")

if p_value < 0.05:
    print("βœ… SIGNIFICANT: Lead time differs between groups!")
else:
    print("❌ Not significant.")

# Output:
# Descriptive Statistics:
# Canceled bookings - Mean lead time: 135.2 days
# Not canceled bookings - Mean lead time: 72.8 days
# 
# T-statistic: 54.23
# P-value: 0.0  ← So small it's essentially zero!
# βœ… SIGNIFICANT: Lead time differs between groups!

🎯 Business Insight

Customers who cancel book ~63 days earlier on average! You can:

  • Send reminder emails for long-lead-time bookings
  • Require deposits for far-advance bookings
  • Offer incentives for keeping reservations

Testing Multiple Numerical Features

# Test multiple numerical features
numerical_features = ['lead_time', 'avg_price_per_room']

for feature in numerical_features:
    canceled = data[data['booking_status'] == 'Canceled'][feature]
    not_canceled = data[data['booking_status'] == 'Not_Canceled'][feature]
    
    # T-Test
    t_stat, p_value = stats.ttest_ind(canceled, not_canceled)
    
    print(f"\n{feature}:")
    print(f"  Canceled mean: {canceled.mean():.2f}")
    print(f"  Not canceled mean: {not_canceled.mean():.2f}")
    print(f"  P-value: {p_value:.2e}")
    
    if p_value < 0.05:
        print(f"  βœ… SIGNIFICANT - Great predictor of cancellation!")
    else:
        print(f"  ❌ Not significant")

# Output:
# lead_time:
#   Canceled mean: 135.19
#   Not canceled mean: 72.84
#   P-value: 0.00e+00
#   βœ… SIGNIFICANT - Great predictor of cancellation!
# 
# avg_price_per_room:
#   Canceled mean: 108.71
#   Not canceled mean: 100.56
#   P-value: 5.23e-164
#   βœ… SIGNIFICANT - Great predictor of cancellation!

Part 4: Choosing the Right Test

🧭 Decision Flowchart

What type of data are you comparing?

πŸ“Š Categorical vs Categorical?

β†’ Use Chi-Square Test

Example: Gender vs Product Preference

πŸ“ˆ Numerical, comparing 2 groups?

β†’ Use T-Test

Example: Average spending - Male vs Female

πŸ“ˆ Numerical, comparing 3+ groups?

β†’ Use ANOVA

Example: Sales across North, South, East, West regions

πŸ“ˆ Numerical vs Numerical relationship?

β†’ Use Correlation Test

Example: Hours studied vs Test score

Test Data Types Question Answered Python Function
Chi-Square Cat vs Cat Are these categories related? chi2_contingency()
T-Test Num vs Cat (2 groups) Are the means different? ttest_ind()
ANOVA Num vs Cat (3+ groups) Is any group mean different? f_oneway()
Pearson Correlation Num vs Num Do they move together? pearsonr()

Part 5: Common Mistakes to Avoid

❌ Mistake 1: p-value = Probability Hβ‚€ is True

"P-value of 0.03 means there's only 3% chance the null is true"

Wrong! P-value is the probability of seeing this data IF Hβ‚€ is true, not the probability that Hβ‚€ is true.

βœ… Correct Interpretation

"IF nothing special is happening, there's only a 3% chance of seeing data this extreme by random chance."

This is strong evidence AGAINST Hβ‚€, but not proof.

❌ Mistake 2: "Fail to Reject" = "Accept Hβ‚€"

"The p-value is 0.15, so we accept the null hypothesis"

Wrong! We never "accept" the null - we just don't have enough evidence to reject it.

βœ… Correct Language

"We fail to reject Hβ‚€" or "There is insufficient evidence to conclude..."

Absence of evidence is not evidence of absence!

❌ Mistake 3: p < 0.05 Always Means Important

"The p-value is 0.001, so this is a huge effect!"

Wrong! Statistical significance β‰  Practical significance. With large samples, tiny differences can be "significant".

βœ… Correct Approach

Always report EFFECT SIZE alongside p-value.

"Website B increases conversion by 0.1% (p=0.001)" - Is 0.1% worth the effort?

⚠️ p-Hacking Warning

What is it? Running many tests until you find one with p < 0.05 by pure chance.

Why it's bad: If you test 20 hypotheses at Ξ±=0.05, you'd expect 1 false positive by chance!

Solution: Pre-register your hypotheses. If testing multiple comparisons, use Bonferroni correction (Ξ±/n).

Part 6: Using Hypothesis Tests for Feature Selection

Before building a machine learning model, use hypothesis tests to identify which features are actually related to your target variable!

🎯 Feature Selection Workflow

Goal: Predict hotel booking cancellation

Available features: meal_plan, room_type, lead_time, price, weekend_nights, children

Question: Which features should we include in our model?

import pandas as pd
from scipy.stats import chi2_contingency, ttest_ind

# Load data
data = pd.read_csv("Hotel Reservations.csv")

# Define feature types
categorical_features = ['type_of_meal_plan', 'room_type_reserved']
numerical_features = ['lead_time', 'avg_price_per_room']

print("="*50)
print("FEATURE SELECTION USING HYPOTHESIS TESTS")
print("="*50)

selected_features = []

# Test categorical features with Chi-Square
print("\nπŸ“Š CATEGORICAL FEATURES (Chi-Square Test)")
for feature in categorical_features:
    table = pd.crosstab(data[feature], data['booking_status'])
    chi2, p_value, dof, expected = chi2_contingency(table)
    
    if p_value < 0.05:
        selected_features.append(feature)
        print(f"βœ… {feature}: p={p_value:.2e} β†’ SELECTED")
    else:
        print(f"❌ {feature}: p={p_value:.2f} β†’ REJECTED")

# Test numerical features with T-Test
print("\nπŸ“ˆ NUMERICAL FEATURES (T-Test)")
for feature in numerical_features:
    canceled = data[data['booking_status'] == 'Canceled'][feature]
    not_canceled = data[data['booking_status'] == 'Not_Canceled'][feature]
    
    t_stat, p_value = ttest_ind(canceled, not_canceled)
    
    if p_value < 0.05:
        selected_features.append(feature)
        print(f"βœ… {feature}: p={p_value:.2e} β†’ SELECTED")
    else:
        print(f"❌ {feature}: p={p_value:.2f} β†’ REJECTED")

print(f"\n🎯 FINAL SELECTED FEATURES: {selected_features}")

# Output:
# ==================================================
# FEATURE SELECTION USING HYPOTHESIS TESTS
# ==================================================
# 
# πŸ“Š CATEGORICAL FEATURES (Chi-Square Test)
# βœ… type_of_meal_plan: p=4.48e-61 β†’ SELECTED
# βœ… room_type_reserved: p=4.43e-11 β†’ SELECTED
# 
# πŸ“ˆ NUMERICAL FEATURES (T-Test)
# βœ… lead_time: p=0.00e+00 β†’ SELECTED
# βœ… avg_price_per_room: p=5.23e-164 β†’ SELECTED
# 
# 🎯 FINAL SELECTED FEATURES: ['type_of_meal_plan', 'room_type_reserved', 
#                              'lead_time', 'avg_price_per_room']

πŸ’‘ Important Note

A feature with p > 0.05 might still be useful when combined with other features! Use hypothesis tests as a starting point, not the final word. Cross-validation with actual ML models gives the true answer.

πŸ“˜ From the course notebook (Hypothesis Testing)

The course source covers t-tests, chi-square, ANOVA with examples. Key ideas: scipy.stats.ttest_ind (two groups), chi2_contingency (categorical vs categorical), f_oneway (3+ groups). Use real data (e.g. from the datasets page) to run tests and interpret p-values. See Hypothesis testing.pdf in the course source for slides.

Complete code from course notebook: hypothesis_testing.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>
h1 { color: Purple !important; }
h2 { color: green !important; }
h3 { color: blue !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd

# --- Code cell 3 ---
data = pd.read_csv("Hotel Reservations.csv")

# --- Code cell 4 ---
data.head(10)

# --- Code cell 5 ---
data.columns

# --- Code cell 6 ---
data.info()

# --- Code cell 7 ---
data = data[['no_of_weekend_nights' ,'no_of_children','type_of_meal_plan','room_type_reserved','lead_time','avg_price_per_room','booking_status']]

# --- Code cell 8 ---
data['booking_status'].value_counts()

# --- Code cell 9 ---
data['no_of_children'].value_counts()

# --- Code cell 10 ---
data['type_of_meal_plan'].value_counts()

# --- Code cell 11 ---
data['room_type_reserved'].value_counts()

# --- Code cell 12 ---
data.columns

# --- Code cell 13 ---
categorical_columns = ['no_of_weekend_nights' ,'no_of_children', 'type_of_meal_plan', 'room_type_reserved']

# --- Code cell 17 ---
from scipy.stats import chi2_contingency

# --- Code cell 18 ---

  
# defining the table
for column in categorical_columns:
        temp = pd.DataFrame(data[column].value_counts()).reset_index()
        temp.columns = [column,'frequency']
        categories = list(temp[temp['frequency']>20][column])
        data_new = data[data[column].isin(categories)]
        print(data_new[column].value_counts())
        
        table = pd.crosstab(data_new[column], data_new['booking_status'])
        print(table)
        stat, p, dof, expected = chi2_contingency(table)
        print("Chi-square test for feature: ", column)
        print("p-value : ", p)
        print("")
        print("")

# --- Code cell 19 ---
column = 'no_of_weekend_nights'
temp = pd.DataFrame(data[column].value_counts()).reset_index()
print(temp)

# --- Code cell 20 ---
temp.columns = [column,'frequency']
temp.head(10)

# --- Code cell 21 ---
#Step 1 - remove rows with 20 or less examples
categories = list(temp[temp['frequency']>20][column])
print(categories)

# --- Code cell 22 ---
data_new = data[data[column].isin(categories)]
print(data_new[column].value_counts())

# --- Code cell 23 ---
#Step 2 - create a contigency table using crosstab
table = pd.crosstab(data_new[column], data_new['booking_status'])
print(table)

#Step 3 - Perform chi square test using chi2_contingency function
stat, p, dof, expected = chi2_contingency(table)
print("Chi-square test for feature: ", column)
print("p-value : ", p)
print("")
print("")

# --- Code cell 24 ---
# if p value < 0.05 - alternate hypothesis- atleast 95% cofidence

# --- Code cell 27 ---
numerical_columns = ['lead_time','avg_price_per_room']

# --- Code cell 28 ---
data_cancelled = data[data['booking_status'] == 'Canceled']
data_cancelled[numerical_columns].describe()

# --- Code cell 29 ---
data_not_cancelled = data[data['booking_status'] == 'Not_Canceled']
data_not_cancelled[numerical_columns].describe()

# --- Code cell 30 ---
from scipy import stats
  
# defining the table
for column in numerical_columns:
        samples_set1 = data_cancelled[column]
        samples_set2 = data_not_cancelled[column]
        stat, p = stats.ttest_ind(samples_set1, samples_set2)
    
        print("ttest for feature: ", column)
        print("p-value : ", p)
        print("")
        print("")

# --- Code cell 31 ---
#lead_time - 0.4 -o/p

# --- Code cell 32 ---
#lead_time , price , type plan ,

# --- Code cell 34 ---
from scipy import stats

# --- Code cell 35 ---
column = 'lead_time'

# --- Code cell 36 ---
# Step 1 - create two populations for the numerical feature
samples_set1 = data_cancelled[column] # population1 -for cancelled bookings
samples_set2 = data_not_cancelled[column] # population2 -for not cancelled bookings

# --- Code cell 37 ---
samples_set1

# --- Code cell 38 ---
# Step 2 - pass the populations to ttest function
stat, p = stats.ttest_ind(samples_set1, samples_set2)

print("ttest for feature: ", column)
print("p-value : ", p)
print("")
print("")

# --- Code cell 39 ---
# p value <0.05 - lead time is useful feature for prediction

# --- Code cell 40 ---
# chi-square test
#checking relation beween two categorical columns 
# categorical column like gender and hotel booking status

# welch t-test
#checking relation beween a numerical column and a categorical column
# checking relation between lead time and hotel booking status

# --- Code cell 41 ---
# feature A - p >0.05 -> this feature might not be useful just on its own
# but it can still be useful when combined with other features

πŸ’­ Short reflection

In one sentence: what does it mean when we say β€œwe reject the null hypothesis at Ξ± = 0.05”? Why not say β€œwe proved the alternative”?

βœ… CORE (Must know)

  • Null (Hβ‚€) vs alternative (H₁); p-value = P(observed or more extreme | Hβ‚€ true).
  • p < 0.05 β†’ reject Hβ‚€ (significant); p β‰₯ 0.05 β†’ fail to reject (not β€œaccept H₀”).
  • Chi-Square: categorical vs categorical; contingency table; chi2_contingency.
  • T-Test: compare means of two groups; ttest_ind.
  • ANOVA: compare means across 3+ groups; f_oneway.
  • Feature selection: combine tests with domain knowledge and ML validation.

πŸ“š NON-CORE (Good to know)

  • Type I (false positive) vs Type II (false negative) error.
  • Paired t-test for before/after or matched groups.
  • Post-hoc tests after ANOVA (e.g. Tukey).

Summary: Your Hypothesis Testing Toolkit

Scenario Test Python Code
Is gender related to product choice? Chi-Square chi2_contingency(crosstab)
Do men spend more than women? T-Test ttest_ind(group1, group2)
Do sales differ across 4 regions? ANOVA f_oneway(g1, g2, g3, g4)
Which features predict churn? Multiple tests Chi-Square + T-Test loop

πŸ“ The Golden Rule

If p-value < 0.05 β†’ Reject Hβ‚€ (statistically significant)
If p-value β‰₯ 0.05 β†’ Cannot reject Hβ‚€ (not enough evidence)