AB Testing & Experimentation
Design an A/B test, simulate experiment data, run statistical tests (z-test, confidence intervals), and write data-driven recommendations.
📚 Before you start, learn these courses:
Scenario
👶 In plain English
Product wants to change the “Sign up” button from blue to green. Will more people click it? Instead of guessing, we run an A/B test: half the users see blue (control), half see green (variant). After collecting data we compare click rates. But “green had 2% more clicks” isn’t enough—maybe it was luck. Run a proper experiment: define success metric and sample size, run or simulate the test, then use statistics (z-test for proportions, confidence intervals) to say whether the difference is real or noise, and write a clear recommendation.
What You’ll Build
- Hypothesis: H0 vs H1, metric, alpha
- Data: Simulated or real A/B experiment data
- Z-test: Two-proportion z-test with scipy
- Confidence interval: 95% CI for difference in proportions
- Recommendation: Ship, keep, or run longer—backed by stats
Prerequisites
Basic statistics (proportions, hypothesis testing, p-values), Python (Pandas, scipy), and intro to A/B testing. Our Hypothesis Testing and AB Testing lessons cover this.
Step-by-Step Plan
- 1Write hypothesis. H0, H1, metric (conversion), alpha (0.05).
- 2Calculate sample size. Use statsmodels or formula.
- 3Generate/collect data. Simulated 10k users, 50/50 split.
- 4Compute conversion rates. Control vs variant with pandas.
- 5Run z-test. scipy.stats.proportions_ztest.
- 6Build confidence interval. 95% CI for difference.
- 7Write recommendation. Decision framework and summary.
Write Hypothesis Doc
Define H0, H1, metric, and alpha
👶 In plain English
Before touching data, write down what you’re testing. H0 = no change; H1 = green increases conversion. We measure conversion rate and use 5% significance.
# HYPOTHESIS # H0: p_control = p_variant (no difference in conversion) # H1: p_variant > p_control (green button increases conversion) # Metric: conversion = signups / visitors # Alpha: 0.05
What happened
You documented the null and alternative hypothesis, the success metric (conversion rate), and the significance level. This is your experiment design foundation.
Sample Size Calculation
Use statsmodels to determine required n
👶 In plain English
How many users do we need to detect a 2% lift (e.g. 12% vs 14%)? statsmodels has built-in power analysis for proportions.
from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize
import numpy as np
# We want 80% power to detect 12% vs 14% at alpha=0.05
p1, p2 = 0.12, 0.14
effect = proportion_effectsize(p1, p2) # standardized effect for proportions
n = zt_ind_solve_power(effect_size=effect, alpha=0.05, power=0.8,
alternative='larger', ratio=1.0)
print(f"Per group sample size: {int(np.ceil(n))}") # ~2270 per group
💡 Tip
In practice, use ~5000 per group to be safe. Our simulation will use 5000 each (10k total).
What happened
Power analysis tells us roughly 2270 per group gives 80% chance to detect the 2pp lift. We’ll over-sample for safety.
Generate Simulated Data
10k users, 50/50 split, control 12%, variant 14%
👶 In plain English
We create fake experiment data: assign each user to A or B at random, then flip a biased coin (12% or 14%) to decide if they convert. This lets us validate our analysis.
import numpy as np
import pandas as pd
np.random.seed(42)
n_users = 10000
variant = np.random.choice(['control', 'variant'], size=n_users, p=[0.5, 0.5])
p_convert = np.where(variant == 'control', 0.12, 0.14)
converted = np.random.binomial(1, p_convert)
df = pd.DataFrame({'user_id': range(n_users), 'variant': variant, 'converted': converted})
print(df.groupby('variant')['converted'].agg(['sum', 'count', 'mean']))
What happened
We generated 10k rows. Each user is in control or variant with 50/50 probability. Control converts at 12%, variant at 14%. Pandas summary shows counts and conversion rates per group.
Compute Conversion Rates
Aggregate with pandas
👶 In plain English
For each group (control, variant), count how many converted and how many total. Conversion rate = converts / total.
control = df[df['variant'] == 'control']
variant_df = df[df['variant'] == 'variant']
n_A, conv_A = len(control), control['converted'].sum()
n_B, conv_B = len(variant_df), variant_df['converted'].sum()
p_A, p_B = conv_A / n_A, conv_B / n_B
print(f"Control: {conv_A}/{n_A} = {p_A:.4f}")
print(f"Variant: {conv_B}/{n_B} = {p_B:.4f}")
What happened
We extracted counts (conversions and sample size) for each group. These are the inputs for the z-test.
Two-Proportion Z-Test
scipy.stats.proportions_ztest
👶 In plain English
The z-test asks: “Could the difference between p_A and p_B be due to random chance?” If p-value < 0.05, we reject H0 and say the difference is statistically significant.
from scipy import stats
counts = np.array([conv_A, conv_B])
nobs = np.array([n_A, n_B])
z_stat, p_value = stats.proportions_ztest(counts, nobs, alternative='larger')
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant at alpha=0.05? {p_value < 0.05}")
💡 Tip
Use alternative='larger' for one-sided (variant > control). Use 'two-sided' if you want to detect any difference in either direction.
What happened
The z-test compared the two proportions. A small p-value (< 0.05) means we reject “no difference” and conclude the variant has a higher conversion rate. With 10k users and 2pp true lift, you should see p < 0.05.
Confidence Interval for Difference
Manual formula for 95% CI
👶 In plain English
We want to say: “The true difference in conversion is between X% and Y% with 95% confidence.” That’s the confidence interval.
from scipy import stats as sp_stats
diff = p_B - p_A
se = np.sqrt(p_A * (1 - p_A) / n_A + p_B * (1 - p_B) / n_B)
z_crit = sp_stats.norm.ppf(0.975) # 1.96 for 95% CI
ci_low = diff - z_crit * se
ci_high = diff + z_crit * se
print(f"Difference (B - A): {diff:.4f}")
print(f"95% CI: [{ci_low:.4f}, {ci_high:.4f}]")
print(f"CI excludes 0? {ci_low > 0}")
What happened
We computed the standard error of the difference and built a 95% CI. If the CI excludes 0, we’re confident the variant really is better. This backs up the z-test result.
Decision Framework & Recommendation
Write a clear summary
👶 In plain English
Combine p-value and CI into a decision: ship green (significant, positive lift), keep blue (not significant), or run longer (inconclusive). Then write 2–3 sentences for stakeholders.
def recommend(p_value, ci_low, ci_high, alpha=0.05):
if p_value < alpha and ci_low > 0:
return "SHIP: Statistically significant. Ship the green button."
elif p_value >= alpha and ci_high < 0:
return "KEEP: Control is better. Keep the blue button."
elif ci_low <= 0 <= ci_high:
return "INCONCLUSIVE: Run longer or increase sample size."
else:
return "INCONCLUSIVE: Run longer."
rec = recommend(p_value, ci_low, ci_high)
print(rec)
# Summary for stakeholders:
print(f"""
A/B Test Summary
- Metric: Sign-up conversion
- Control (blue): {p_A:.2%} ({conv_A}/{n_A})
- Variant (green): {p_B:.2%} ({conv_B}/{n_B})
- Lift: {(p_B-p_A)/p_A*100:.1f}% relative
- P-value: {p_value:.4f}
- 95% CI: [{ci_low:.2%}, {ci_high:.2%}]
- Recommendation: {rec}
""")
What happened
You built a reusable decision function and a stakeholder-friendly summary. This is how product teams communicate experiment results: metric, numbers, and a clear recommendation.