Intermediate Pro

AB Testing & Experimentation

Design an A/B test, simulate experiment data, run statistical tests (z-test, confidence intervals), and write data-driven recommendations.

PythonStatisticsData Science

📚 Before you start, learn these courses:

Python → Data Science →

Scenario

👶 In plain English

Product wants to change the “Sign up” button from blue to green. Will more people click it? Instead of guessing, we run an A/B test: half the users see blue (control), half see green (variant). After collecting data we compare click rates. But “green had 2% more clicks” isn’t enough—maybe it was luck. Run a proper experiment: define success metric and sample size, run or simulate the test, then use statistics (z-test for proportions, confidence intervals) to say whether the difference is real or noise, and write a clear recommendation.

What You’ll Build

Hypothesis

Data

Z-Test

CI & Recommendation

Hypothesis: H0 vs H1, metric, alpha
Data: Simulated or real A/B experiment data
Z-test: Two-proportion z-test with scipy
Confidence interval: 95% CI for difference in proportions
Recommendation: Ship, keep, or run longer—backed by stats

Prerequisites

Basic statistics (proportions, hypothesis testing, p-values), Python (Pandas, scipy), and intro to A/B testing. Our Hypothesis Testing and AB Testing lessons cover this.

Step-by-Step Plan

1
Write hypothesis. H0, H1, metric (conversion), alpha (0.05).
2
Calculate sample size. Use statsmodels or formula.
3
Generate/collect data. Simulated 10k users, 50/50 split.
4
Compute conversion rates. Control vs variant with pandas.
5
Run z-test. scipy.stats.proportions_ztest.
6
Build confidence interval. 95% CI for difference.
7
Write recommendation. Decision framework and summary.

Write Hypothesis Doc

Define H0, H1, metric, and alpha

👶 In plain English

Before touching data, write down what you’re testing. H0 = no change; H1 = green increases conversion. We measure conversion rate and use 5% significance.

Python

# HYPOTHESIS
# H0: p_control = p_variant  (no difference in conversion)
# H1: p_variant > p_control  (green button increases conversion)
# Metric: conversion = signups / visitors
# Alpha: 0.05

What happened

You documented the null and alternative hypothesis, the success metric (conversion rate), and the significance level. This is your experiment design foundation.

Sample Size Calculation

Use statsmodels to determine required n

👶 In plain English

How many users do we need to detect a 2% lift (e.g. 12% vs 14%)? statsmodels has built-in power analysis for proportions.

Python

from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize
import numpy as np

# We want 80% power to detect 12% vs 14% at alpha=0.05
p1, p2 = 0.12, 0.14
effect = proportion_effectsize(p1, p2)  # standardized effect for proportions
n = zt_ind_solve_power(effect_size=effect, alpha=0.05, power=0.8,
                       alternative='larger', ratio=1.0)
print(f"Per group sample size: {int(np.ceil(n))}")  # ~2270 per group

💡 Tip

In practice, use ~5000 per group to be safe. Our simulation will use 5000 each (10k total).

What happened

Power analysis tells us roughly 2270 per group gives 80% chance to detect the 2pp lift. We’ll over-sample for safety.

Generate Simulated Data

10k users, 50/50 split, control 12%, variant 14%

👶 In plain English

We create fake experiment data: assign each user to A or B at random, then flip a biased coin (12% or 14%) to decide if they convert. This lets us validate our analysis.

Python

import numpy as np
import pandas as pd

np.random.seed(42)
n_users = 10000
variant = np.random.choice(['control', 'variant'], size=n_users, p=[0.5, 0.5])
p_convert = np.where(variant == 'control', 0.12, 0.14)
converted = np.random.binomial(1, p_convert)

df = pd.DataFrame({'user_id': range(n_users), 'variant': variant, 'converted': converted})
print(df.groupby('variant')['converted'].agg(['sum', 'count', 'mean']))

What happened

We generated 10k rows. Each user is in control or variant with 50/50 probability. Control converts at 12%, variant at 14%. Pandas summary shows counts and conversion rates per group.

Compute Conversion Rates

Aggregate with pandas

👶 In plain English

For each group (control, variant), count how many converted and how many total. Conversion rate = converts / total.

Python

control = df[df['variant'] == 'control']
variant_df = df[df['variant'] == 'variant']

n_A, conv_A = len(control), control['converted'].sum()
n_B, conv_B = len(variant_df), variant_df['converted'].sum()
p_A, p_B = conv_A / n_A, conv_B / n_B

print(f"Control:  {conv_A}/{n_A} = {p_A:.4f}")
print(f"Variant:  {conv_B}/{n_B} = {p_B:.4f}")

What happened

We extracted counts (conversions and sample size) for each group. These are the inputs for the z-test.

Two-Proportion Z-Test

scipy.stats.proportions_ztest

👶 In plain English

The z-test asks: “Could the difference between p_A and p_B be due to random chance?” If p-value < 0.05, we reject H0 and say the difference is statistically significant.

Python

from scipy import stats

counts = np.array([conv_A, conv_B])
nobs = np.array([n_A, n_B])
z_stat, p_value = stats.proportions_ztest(counts, nobs, alternative='larger')

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant at alpha=0.05? {p_value < 0.05}")

💡 Tip

Use alternative='larger' for one-sided (variant > control). Use 'two-sided' if you want to detect any difference in either direction.

What happened

The z-test compared the two proportions. A small p-value (< 0.05) means we reject “no difference” and conclude the variant has a higher conversion rate. With 10k users and 2pp true lift, you should see p < 0.05.

Confidence Interval for Difference

Manual formula for 95% CI

👶 In plain English

We want to say: “The true difference in conversion is between X% and Y% with 95% confidence.” That’s the confidence interval.

Python

from scipy import stats as sp_stats

diff = p_B - p_A
se = np.sqrt(p_A * (1 - p_A) / n_A + p_B * (1 - p_B) / n_B)
z_crit = sp_stats.norm.ppf(0.975)  # 1.96 for 95% CI
ci_low = diff - z_crit * se
ci_high = diff + z_crit * se

print(f"Difference (B - A): {diff:.4f}")
print(f"95% CI: [{ci_low:.4f}, {ci_high:.4f}]")
print(f"CI excludes 0? {ci_low > 0}")

What happened

We computed the standard error of the difference and built a 95% CI. If the CI excludes 0, we’re confident the variant really is better. This backs up the z-test result.

Decision Framework & Recommendation

Write a clear summary

👶 In plain English

Combine p-value and CI into a decision: ship green (significant, positive lift), keep blue (not significant), or run longer (inconclusive). Then write 2–3 sentences for stakeholders.

Python

def recommend(p_value, ci_low, ci_high, alpha=0.05):
    if p_value < alpha and ci_low > 0:
        return "SHIP: Statistically significant. Ship the green button."
    elif p_value >= alpha and ci_high < 0:
        return "KEEP: Control is better. Keep the blue button."
    elif ci_low <= 0 <= ci_high:
        return "INCONCLUSIVE: Run longer or increase sample size."
    else:
        return "INCONCLUSIVE: Run longer."

rec = recommend(p_value, ci_low, ci_high)
print(rec)
# Summary for stakeholders:
print(f"""
A/B Test Summary
- Metric: Sign-up conversion
- Control (blue): {p_A:.2%} ({conv_A}/{n_A})
- Variant (green): {p_B:.2%} ({conv_B}/{n_B})
- Lift: {(p_B-p_A)/p_A*100:.1f}% relative
- P-value: {p_value:.4f}
- 95% CI: [{ci_low:.2%}, {ci_high:.2%}]
- Recommendation: {rec}
""")

What happened

You built a reusable decision function and a stakeholder-friendly summary. This is how product teams communicate experiment results: metric, numbers, and a clear recommendation.

Unlock with Pro

Get full step-by-step code, tips, and explanations. Build a complete A/B test analysis you can show in interviews.

Unlock Pro

Back to all projects