Learn to prove (or disprove!) claims with data. Make decisions with confidence using statistical tests.
Examples use the Hotel Reservations dataset. Download and save it in the same folder as your script so pd.read_csv("Hotel Reservations.csv") works.
Hypothesis testing is like being a detective. You have a claim to investigate, evidence to analyze, and a verdict to deliver!
Hypothesis testing answers: "Could this pattern in the data just be luck?" We run a test and get a number called the p-value. If the p-value is very small (usually below 0.05), we say the pattern is probably real; if not, we say we don't have enough evidence. So we never "prove" anythingβwe only decide whether the evidence is strong enough.
Crime Scene: A business question ("Does the new website increase sales?")
Evidence: Data from experiments and observations
Investigation: Statistical tests
Verdict: "Statistically significant" or "Not enough evidence"
Hβ (Null): "Nothing is happening" - The default assumption
Hβ (Alternative): "Something IS happening" - What you want to prove
Usually Ξ± = 0.05 (5%). This is your "threshold for surprise" - how unlikely must the evidence be to convince you?
Run your experiment, gather data, and calculate the appropriate test statistic (t, z, ΟΒ², etc.)
The probability of seeing data this extreme IF Hβ were true. Small p-value = strong evidence against Hβ.
If p-value < Ξ±: Reject Hβ β "Statistically significant!"
If p-value β₯ Ξ±: Fail to reject Hβ β "Not enough evidence"
P-value answers: "If nothing special is happening (Hβ is true), how likely is it to see results this extreme by pure chance?"
Small p-value (< 0.05): Very unlikely by chance β Something IS happening!
Large p-value (β₯ 0.05): Could easily happen by chance β Can't conclude anything special
Use Chi-Square when comparing categories - like gender vs product preference, or meal plan vs booking cancellation.
Question: Is there a relationship between meal plan type and booking cancellation?
Hβ: Meal plan and cancellation are independent (no relationship)
Hβ: Meal plan and cancellation ARE related
A contingency table shows the frequency count for each combination of categories:
| Meal Plan | Canceled | Not Canceled | Total |
|---|---|---|---|
| Meal Plan 1 | 8,679 | 19,156 | 27,835 |
| Meal Plan 2 | 1,506 | 1,799 | 3,305 |
| Not Selected | 1,699 | 3,431 | 5,130 |
import pandas as pd from scipy.stats import chi2_contingency # Load hotel reservations data data = pd.read_csv("Hotel Reservations.csv") # Create contingency table contingency_table = pd.crosstab( data['type_of_meal_plan'], data['booking_status'] ) print("Contingency Table:") print(contingency_table) # Perform Chi-Square test chi2, p_value, dof, expected = chi2_contingency(contingency_table) print(f"\nChi-Square Statistic: {chi2:.2f}") print(f"P-value: {p_value:.2e}") # Scientific notation print(f"Degrees of Freedom: {dof}") # Interpret if p_value < 0.05: print("β REJECT Hβ: Meal plan and cancellation ARE related!") else: print("β FAIL TO REJECT Hβ: No significant relationship found.") # Output: # Chi-Square Statistic: 276.35 # P-value: 4.48e-61 β Extremely small! # β REJECT Hβ: Meal plan and cancellation ARE related!
Customers with different meal plans have different cancellation rates! You can now:
# Test multiple categorical features against booking status categorical_features = ['type_of_meal_plan', 'room_type_reserved', 'no_of_weekend_nights', 'no_of_children'] for feature in categorical_features: # Create contingency table table = pd.crosstab(data[feature], data['booking_status']) # Chi-Square test chi2, p_value, dof, expected = chi2_contingency(table) print(f"\n{feature}:") print(f" P-value: {p_value:.2e}") if p_value < 0.05: print(f" β SIGNIFICANT - Use this feature for prediction!") else: print(f" β Not significant - May not be useful alone") # Output: # type_of_meal_plan: # P-value: 4.48e-61 # β SIGNIFICANT - Use this feature for prediction! # # room_type_reserved: # P-value: 4.43e-11 # β SIGNIFICANT - Use this feature for prediction! # # no_of_weekend_nights: # P-value: 1.12e-40 # β SIGNIFICANT - Use this feature for prediction!
Use T-Test when comparing average values between two groups - like average spending by gender, or conversion rates between website versions.
| Type | When to Use | Example |
|---|---|---|
| Independent T-Test | Comparing two separate groups | Male vs Female average spending |
| Paired T-Test | Same group measured twice | Weight before vs after diet |
| Welch's T-Test | Two groups with different sizes/variances | Most real-world scenarios |
Question: Do customers who cancel book further in advance?
Hβ: Lead time is the same for canceled and non-canceled bookings
Hβ: Lead time is DIFFERENT between groups
from scipy import stats import pandas as pd # Load data data = pd.read_csv("Hotel Reservations.csv") # Separate the two groups canceled = data[data['booking_status'] == 'Canceled']['lead_time'] not_canceled = data[data['booking_status'] == 'Not_Canceled']['lead_time'] # Compare the means first print("Descriptive Statistics:") print(f"Canceled bookings - Mean lead time: {canceled.mean():.1f} days") print(f"Not canceled bookings - Mean lead time: {not_canceled.mean():.1f} days") # Perform Welch's T-Test (default in scipy) t_stat, p_value = stats.ttest_ind(canceled, not_canceled) print(f"\nT-statistic: {t_stat:.2f}") print(f"P-value: {p_value}") if p_value < 0.05: print("β SIGNIFICANT: Lead time differs between groups!") else: print("β Not significant.") # Output: # Descriptive Statistics: # Canceled bookings - Mean lead time: 135.2 days # Not canceled bookings - Mean lead time: 72.8 days # # T-statistic: 54.23 # P-value: 0.0 β So small it's essentially zero! # β SIGNIFICANT: Lead time differs between groups!
Customers who cancel book ~63 days earlier on average! You can:
# Test multiple numerical features numerical_features = ['lead_time', 'avg_price_per_room'] for feature in numerical_features: canceled = data[data['booking_status'] == 'Canceled'][feature] not_canceled = data[data['booking_status'] == 'Not_Canceled'][feature] # T-Test t_stat, p_value = stats.ttest_ind(canceled, not_canceled) print(f"\n{feature}:") print(f" Canceled mean: {canceled.mean():.2f}") print(f" Not canceled mean: {not_canceled.mean():.2f}") print(f" P-value: {p_value:.2e}") if p_value < 0.05: print(f" β SIGNIFICANT - Great predictor of cancellation!") else: print(f" β Not significant") # Output: # lead_time: # Canceled mean: 135.19 # Not canceled mean: 72.84 # P-value: 0.00e+00 # β SIGNIFICANT - Great predictor of cancellation! # # avg_price_per_room: # Canceled mean: 108.71 # Not canceled mean: 100.56 # P-value: 5.23e-164 # β SIGNIFICANT - Great predictor of cancellation!
What type of data are you comparing?
π Categorical vs Categorical?
β Use Chi-Square Test
Example: Gender vs Product Preference
π Numerical, comparing 2 groups?
β Use T-Test
Example: Average spending - Male vs Female
π Numerical, comparing 3+ groups?
β Use ANOVA
Example: Sales across North, South, East, West regions
π Numerical vs Numerical relationship?
β Use Correlation Test
Example: Hours studied vs Test score
| Test | Data Types | Question Answered | Python Function |
|---|---|---|---|
| Chi-Square | Cat vs Cat | Are these categories related? | chi2_contingency() |
| T-Test | Num vs Cat (2 groups) | Are the means different? | ttest_ind() |
| ANOVA | Num vs Cat (3+ groups) | Is any group mean different? | f_oneway() |
| Pearson Correlation | Num vs Num | Do they move together? | pearsonr() |
"P-value of 0.03 means there's only 3% chance the null is true"
Wrong! P-value is the probability of seeing this data IF Hβ is true, not the probability that Hβ is true.
"IF nothing special is happening, there's only a 3% chance of seeing data this extreme by random chance."
This is strong evidence AGAINST Hβ, but not proof.
"The p-value is 0.15, so we accept the null hypothesis"
Wrong! We never "accept" the null - we just don't have enough evidence to reject it.
"We fail to reject Hβ" or "There is insufficient evidence to conclude..."
Absence of evidence is not evidence of absence!
"The p-value is 0.001, so this is a huge effect!"
Wrong! Statistical significance β Practical significance. With large samples, tiny differences can be "significant".
Always report EFFECT SIZE alongside p-value.
"Website B increases conversion by 0.1% (p=0.001)" - Is 0.1% worth the effort?
What is it? Running many tests until you find one with p < 0.05 by pure chance.
Why it's bad: If you test 20 hypotheses at Ξ±=0.05, you'd expect 1 false positive by chance!
Solution: Pre-register your hypotheses. If testing multiple comparisons, use Bonferroni correction (Ξ±/n).
Before building a machine learning model, use hypothesis tests to identify which features are actually related to your target variable!
Goal: Predict hotel booking cancellation
Available features: meal_plan, room_type, lead_time, price, weekend_nights, children
Question: Which features should we include in our model?
import pandas as pd from scipy.stats import chi2_contingency, ttest_ind # Load data data = pd.read_csv("Hotel Reservations.csv") # Define feature types categorical_features = ['type_of_meal_plan', 'room_type_reserved'] numerical_features = ['lead_time', 'avg_price_per_room'] print("="*50) print("FEATURE SELECTION USING HYPOTHESIS TESTS") print("="*50) selected_features = [] # Test categorical features with Chi-Square print("\nπ CATEGORICAL FEATURES (Chi-Square Test)") for feature in categorical_features: table = pd.crosstab(data[feature], data['booking_status']) chi2, p_value, dof, expected = chi2_contingency(table) if p_value < 0.05: selected_features.append(feature) print(f"β {feature}: p={p_value:.2e} β SELECTED") else: print(f"β {feature}: p={p_value:.2f} β REJECTED") # Test numerical features with T-Test print("\nπ NUMERICAL FEATURES (T-Test)") for feature in numerical_features: canceled = data[data['booking_status'] == 'Canceled'][feature] not_canceled = data[data['booking_status'] == 'Not_Canceled'][feature] t_stat, p_value = ttest_ind(canceled, not_canceled) if p_value < 0.05: selected_features.append(feature) print(f"β {feature}: p={p_value:.2e} β SELECTED") else: print(f"β {feature}: p={p_value:.2f} β REJECTED") print(f"\nπ― FINAL SELECTED FEATURES: {selected_features}") # Output: # ================================================== # FEATURE SELECTION USING HYPOTHESIS TESTS # ================================================== # # π CATEGORICAL FEATURES (Chi-Square Test) # β type_of_meal_plan: p=4.48e-61 β SELECTED # β room_type_reserved: p=4.43e-11 β SELECTED # # π NUMERICAL FEATURES (T-Test) # β lead_time: p=0.00e+00 β SELECTED # β avg_price_per_room: p=5.23e-164 β SELECTED # # π― FINAL SELECTED FEATURES: ['type_of_meal_plan', 'room_type_reserved', # 'lead_time', 'avg_price_per_room']
A feature with p > 0.05 might still be useful when combined with other features! Use hypothesis tests as a starting point, not the final word. Cross-validation with actual ML models gives the true answer.
The course source covers t-tests, chi-square, ANOVA with examples. Key ideas: scipy.stats.ttest_ind (two groups), chi2_contingency (categorical vs categorical), f_oneway (3+ groups). Use real data (e.g. from the datasets page) to run tests and interpret p-values. See Hypothesis testing.pdf in the course source for slides.
Every line of code (verbatim).
# --- Code cell 1 ---
from IPython.core.display import HTML
HTML("""
<style>
h1 { color: Purple !important; }
h2 { color: green !important; }
h3 { color: blue !important; }
</style>
""")
# --- Code cell 2 ---
import pandas as pd
# --- Code cell 3 ---
data = pd.read_csv("Hotel Reservations.csv")
# --- Code cell 4 ---
data.head(10)
# --- Code cell 5 ---
data.columns
# --- Code cell 6 ---
data.info()
# --- Code cell 7 ---
data = data[['no_of_weekend_nights' ,'no_of_children','type_of_meal_plan','room_type_reserved','lead_time','avg_price_per_room','booking_status']]
# --- Code cell 8 ---
data['booking_status'].value_counts()
# --- Code cell 9 ---
data['no_of_children'].value_counts()
# --- Code cell 10 ---
data['type_of_meal_plan'].value_counts()
# --- Code cell 11 ---
data['room_type_reserved'].value_counts()
# --- Code cell 12 ---
data.columns
# --- Code cell 13 ---
categorical_columns = ['no_of_weekend_nights' ,'no_of_children', 'type_of_meal_plan', 'room_type_reserved']
# --- Code cell 17 ---
from scipy.stats import chi2_contingency
# --- Code cell 18 ---
# defining the table
for column in categorical_columns:
temp = pd.DataFrame(data[column].value_counts()).reset_index()
temp.columns = [column,'frequency']
categories = list(temp[temp['frequency']>20][column])
data_new = data[data[column].isin(categories)]
print(data_new[column].value_counts())
table = pd.crosstab(data_new[column], data_new['booking_status'])
print(table)
stat, p, dof, expected = chi2_contingency(table)
print("Chi-square test for feature: ", column)
print("p-value : ", p)
print("")
print("")
# --- Code cell 19 ---
column = 'no_of_weekend_nights'
temp = pd.DataFrame(data[column].value_counts()).reset_index()
print(temp)
# --- Code cell 20 ---
temp.columns = [column,'frequency']
temp.head(10)
# --- Code cell 21 ---
#Step 1 - remove rows with 20 or less examples
categories = list(temp[temp['frequency']>20][column])
print(categories)
# --- Code cell 22 ---
data_new = data[data[column].isin(categories)]
print(data_new[column].value_counts())
# --- Code cell 23 ---
#Step 2 - create a contigency table using crosstab
table = pd.crosstab(data_new[column], data_new['booking_status'])
print(table)
#Step 3 - Perform chi square test using chi2_contingency function
stat, p, dof, expected = chi2_contingency(table)
print("Chi-square test for feature: ", column)
print("p-value : ", p)
print("")
print("")
# --- Code cell 24 ---
# if p value < 0.05 - alternate hypothesis- atleast 95% cofidence
# --- Code cell 27 ---
numerical_columns = ['lead_time','avg_price_per_room']
# --- Code cell 28 ---
data_cancelled = data[data['booking_status'] == 'Canceled']
data_cancelled[numerical_columns].describe()
# --- Code cell 29 ---
data_not_cancelled = data[data['booking_status'] == 'Not_Canceled']
data_not_cancelled[numerical_columns].describe()
# --- Code cell 30 ---
from scipy import stats
# defining the table
for column in numerical_columns:
samples_set1 = data_cancelled[column]
samples_set2 = data_not_cancelled[column]
stat, p = stats.ttest_ind(samples_set1, samples_set2)
print("ttest for feature: ", column)
print("p-value : ", p)
print("")
print("")
# --- Code cell 31 ---
#lead_time - 0.4 -o/p
# --- Code cell 32 ---
#lead_time , price , type plan ,
# --- Code cell 34 ---
from scipy import stats
# --- Code cell 35 ---
column = 'lead_time'
# --- Code cell 36 ---
# Step 1 - create two populations for the numerical feature
samples_set1 = data_cancelled[column] # population1 -for cancelled bookings
samples_set2 = data_not_cancelled[column] # population2 -for not cancelled bookings
# --- Code cell 37 ---
samples_set1
# --- Code cell 38 ---
# Step 2 - pass the populations to ttest function
stat, p = stats.ttest_ind(samples_set1, samples_set2)
print("ttest for feature: ", column)
print("p-value : ", p)
print("")
print("")
# --- Code cell 39 ---
# p value <0.05 - lead time is useful feature for prediction
# --- Code cell 40 ---
# chi-square test
#checking relation beween two categorical columns
# categorical column like gender and hotel booking status
# welch t-test
#checking relation beween a numerical column and a categorical column
# checking relation between lead time and hotel booking status
# --- Code cell 41 ---
# feature A - p >0.05 -> this feature might not be useful just on its own
# but it can still be useful when combined with other features
In one sentence: what does it mean when we say βwe reject the null hypothesis at Ξ± = 0.05β? Why not say βwe proved the alternativeβ?
chi2_contingency.ttest_ind.f_oneway.| Scenario | Test | Python Code |
|---|---|---|
| Is gender related to product choice? | Chi-Square | chi2_contingency(crosstab) |
| Do men spend more than women? | T-Test | ttest_ind(group1, group2) |
| Do sales differ across 4 regions? | ANOVA | f_oneway(g1, g2, g3, g4) |
| Which features predict churn? | Multiple tests | Chi-Square + T-Test loop |