Intermediate Pro

AB Testing & Experimentation

Design an A/B test, simulate experiment data, run statistical tests (z-test, confidence intervals), and write data-driven recommendations.

PythonStatisticsData Science

📚 Before you start, learn these courses:

Python → Data Science →

Scenario

👶 In plain English

Product wants to change the “Sign up” button from blue to green. Will more people click it? Instead of guessing, we run an A/B test: half the users see blue (control), half see green (variant). After collecting data we compare click rates. But “green had 2% more clicks” isn’t enough—maybe it was luck. Run a proper experiment: define success metric and sample size, run or simulate the test, then use statistics (z-test for proportions, confidence intervals) to say whether the difference is real or noise, and write a clear recommendation.

What You’ll Build

Hypothesis
Data
Z-Test
CI & Recommendation
  • Hypothesis: H0 vs H1, metric, alpha
  • Data: Simulated or real A/B experiment data
  • Z-test: Two-proportion z-test with scipy
  • Confidence interval: 95% CI for difference in proportions
  • Recommendation: Ship, keep, or run longer—backed by stats

Prerequisites

Basic statistics (proportions, hypothesis testing, p-values), Python (Pandas, scipy), and intro to A/B testing. Our Hypothesis Testing and AB Testing lessons cover this.

Step-by-Step Plan

  1. 1
    Write hypothesis. H0, H1, metric (conversion), alpha (0.05).
  2. 2
    Calculate sample size. Use statsmodels or formula.
  3. 3
    Generate/collect data. Simulated 10k users, 50/50 split.
  4. 4
    Compute conversion rates. Control vs variant with pandas.
  5. 5
    Run z-test. scipy.stats.proportions_ztest.
  6. 6
    Build confidence interval. 95% CI for difference.
  7. 7
    Write recommendation. Decision framework and summary.
1

Write Hypothesis Doc

Define H0, H1, metric, and alpha

👶 In plain English

Before touching data, write down what you’re testing. H0 = no change; H1 = green increases conversion. We measure conversion rate and use 5% significance.

Python
# HYPOTHESIS
# H0: p_control = p_variant  (no difference in conversion)
# H1: p_variant > p_control  (green button increases conversion)
# Metric: conversion = signups / visitors
# Alpha: 0.05

What happened

You documented the null and alternative hypothesis, the success metric (conversion rate), and the significance level. This is your experiment design foundation.

2

Sample Size Calculation

Use statsmodels to determine required n

👶 In plain English

How many users do we need to detect a 2% lift (e.g. 12% vs 14%)? statsmodels has built-in power analysis for proportions.

Python
from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize
import numpy as np

# We want 80% power to detect 12% vs 14% at alpha=0.05
p1, p2 = 0.12, 0.14
effect = proportion_effectsize(p1, p2)  # standardized effect for proportions
n = zt_ind_solve_power(effect_size=effect, alpha=0.05, power=0.8,
                       alternative='larger', ratio=1.0)
print(f"Per group sample size: {int(np.ceil(n))}")  # ~2270 per group

💡 Tip

In practice, use ~5000 per group to be safe. Our simulation will use 5000 each (10k total).

What happened

Power analysis tells us roughly 2270 per group gives 80% chance to detect the 2pp lift. We’ll over-sample for safety.

3

Generate Simulated Data

10k users, 50/50 split, control 12%, variant 14%

👶 In plain English

We create fake experiment data: assign each user to A or B at random, then flip a biased coin (12% or 14%) to decide if they convert. This lets us validate our analysis.

Python
import numpy as np
import pandas as pd

np.random.seed(42)
n_users = 10000
variant = np.random.choice(['control', 'variant'], size=n_users, p=[0.5, 0.5])
p_convert = np.where(variant == 'control', 0.12, 0.14)
converted = np.random.binomial(1, p_convert)

df = pd.DataFrame({'user_id': range(n_users), 'variant': variant, 'converted': converted})
print(df.groupby('variant')['converted'].agg(['sum', 'count', 'mean']))

What happened

We generated 10k rows. Each user is in control or variant with 50/50 probability. Control converts at 12%, variant at 14%. Pandas summary shows counts and conversion rates per group.

4

Compute Conversion Rates

Aggregate with pandas

👶 In plain English

For each group (control, variant), count how many converted and how many total. Conversion rate = converts / total.

Python
control = df[df['variant'] == 'control']
variant_df = df[df['variant'] == 'variant']

n_A, conv_A = len(control), control['converted'].sum()
n_B, conv_B = len(variant_df), variant_df['converted'].sum()
p_A, p_B = conv_A / n_A, conv_B / n_B

print(f"Control:  {conv_A}/{n_A} = {p_A:.4f}")
print(f"Variant:  {conv_B}/{n_B} = {p_B:.4f}")

What happened

We extracted counts (conversions and sample size) for each group. These are the inputs for the z-test.

5

Two-Proportion Z-Test

scipy.stats.proportions_ztest

👶 In plain English

The z-test asks: “Could the difference between p_A and p_B be due to random chance?” If p-value < 0.05, we reject H0 and say the difference is statistically significant.

Python
from scipy import stats

counts = np.array([conv_A, conv_B])
nobs = np.array([n_A, n_B])
z_stat, p_value = stats.proportions_ztest(counts, nobs, alternative='larger')

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant at alpha=0.05? {p_value < 0.05}")

💡 Tip

Use alternative='larger' for one-sided (variant > control). Use 'two-sided' if you want to detect any difference in either direction.

What happened

The z-test compared the two proportions. A small p-value (< 0.05) means we reject “no difference” and conclude the variant has a higher conversion rate. With 10k users and 2pp true lift, you should see p < 0.05.

6

Confidence Interval for Difference

Manual formula for 95% CI

👶 In plain English

We want to say: “The true difference in conversion is between X% and Y% with 95% confidence.” That’s the confidence interval.

Python
from scipy import stats as sp_stats

diff = p_B - p_A
se = np.sqrt(p_A * (1 - p_A) / n_A + p_B * (1 - p_B) / n_B)
z_crit = sp_stats.norm.ppf(0.975)  # 1.96 for 95% CI
ci_low = diff - z_crit * se
ci_high = diff + z_crit * se

print(f"Difference (B - A): {diff:.4f}")
print(f"95% CI: [{ci_low:.4f}, {ci_high:.4f}]")
print(f"CI excludes 0? {ci_low > 0}")

What happened

We computed the standard error of the difference and built a 95% CI. If the CI excludes 0, we’re confident the variant really is better. This backs up the z-test result.

7

Decision Framework & Recommendation

Write a clear summary

👶 In plain English

Combine p-value and CI into a decision: ship green (significant, positive lift), keep blue (not significant), or run longer (inconclusive). Then write 2–3 sentences for stakeholders.

Python
def recommend(p_value, ci_low, ci_high, alpha=0.05):
    if p_value < alpha and ci_low > 0:
        return "SHIP: Statistically significant. Ship the green button."
    elif p_value >= alpha and ci_high < 0:
        return "KEEP: Control is better. Keep the blue button."
    elif ci_low <= 0 <= ci_high:
        return "INCONCLUSIVE: Run longer or increase sample size."
    else:
        return "INCONCLUSIVE: Run longer."

rec = recommend(p_value, ci_low, ci_high)
print(rec)
# Summary for stakeholders:
print(f"""
A/B Test Summary
- Metric: Sign-up conversion
- Control (blue): {p_A:.2%} ({conv_A}/{n_A})
- Variant (green): {p_B:.2%} ({conv_B}/{n_B})
- Lift: {(p_B-p_A)/p_A*100:.1f}% relative
- P-value: {p_value:.4f}
- 95% CI: [{ci_low:.2%}, {ci_high:.2%}]
- Recommendation: {rec}
""")

What happened

You built a reusable decision function and a stakeholder-friendly summary. This is how product teams communicate experiment results: metric, numbers, and a clear recommendation.

Unlock with Pro

Get full step-by-step code, tips, and explanations. Build a complete A/B test analysis you can show in interviews.

Unlock Pro
Back to all projects