Learn how companies like Facebook, Netflix, and Amazon test new features to make data-driven decisions!
Imagine you're selling lemonade and want to know which sign attracts more customers:
You show Sign A to half the people walking by, and Sign B to the other half.
After counting who bought more, you know which sign is better! ๐
That's A/B Testing!
A/B testing means randomly showing one version (A = control) to some users and another version (B = variant) to others, then comparing a metric (e.g. conversion rate) and using a statistical test (e.g. t-test) to decide if the difference is real or just luck. If p < 0.05, we say the result is significant and we can choose the winner.
100% of Users
|
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ
โผ โผ
Version A Version B
(Current Design) (New Design)
[Blue Button] [Red Button]
| |
50 Users 50 Users
| |
5 Purchases 12 Purchases
(10% Convert) (24% Convert)
| |
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โผ
๐ VERSION B WINS!
"Will changing the button color from blue to green increase sign-ups?"
Version A (Control): The current design (blue button)
Version B (Variant): The new design (green button)
50% see Version A, 50% see Version B (randomly assigned)
Count conversions (sign-ups, purchases, clicks) for each version
Use a T-test to check if the difference is REAL or just random chance
Current Website Design
Blue "Sign Up" Button
This is what users see now
New Website Design
Green "Sign Up" Button
This is what we're testing
Version A had 10% conversion rate. Version B had 12% conversion rate.
But wait! Is that 2% difference REAL, or just random luck?
Maybe if we tested again, A might do better? ๐คท
That's why we use statistics!
The T-Test answers: "Is the difference between two groups REAL or just coincidence?"
It gives us a p-value:
| p-value | Meaning | Decision |
|---|---|---|
| p < 0.01 | Very strong evidence | โ โ Definitely implement B! |
| p < 0.05 | Strong evidence | โ Safe to implement B |
| p < 0.10 | Weak evidence | ๐ค Maybe test longer |
| p โฅ 0.10 | No evidence | โ Probably no real difference |
Download this CSV to follow along with the code examples below.
import pandas as pd from scipy import stats # Load A/B test data (download from link above!) # This data has conversion rates for 35 days data = pd.read_csv("AB_testing_data.csv") # Let's look at the data print(data.head(10)) # Output: # Day Conversion fraction A Conversion fraction B # 0 1 0.102 0.189 # 1 2 0.095 0.178 # 2 3 0.108 0.192 # ...
import pandas as pd โ Lets us use DataFrames and read CSV.
from scipy import stats โ For the statistical test (chi-square or t-test) later.
pd.read_csv("AB_testing_data.csv") โ Loads the A/B test file: columns are Day, Conversion fraction A, Conversion fraction B.
print(data.head(10)) โ Shows the first 10 rows so you can see the conversion rates for each day.
# Calculate average conversion rate for each version avg_A = data['Conversion fraction A'].mean() avg_B = data['Conversion fraction B'].mean() print(f"Version A average conversion: {avg_A:.1%}") print(f"Version B average conversion: {avg_B:.1%}") print(f"Difference: {(avg_B - avg_A):.1%}") # Output: # Version A average conversion: 10.2% # Version B average conversion: 18.5% # Difference: 8.3% # Wow! B looks much better! But is it STATISTICALLY significant?
# Get the conversion rates for each version group_A = data['Conversion fraction A'] group_B = data['Conversion fraction B'] # Run the T-Test! t_stat, p_value = stats.ttest_ind(group_A, group_B) print("=" * 50) print(" A/B TEST RESULTS") print("=" * 50) print(f"T-statistic: {t_stat:.2f}") print(f"P-value: {p_value:.6f}") print("=" * 50) # Output: # ================================================== # A/B TEST RESULTS # ================================================== # T-statistic: -3.74 # P-value: 0.000347 # ==================================================
# Make a decision based on p-value alpha = 0.05 # Significance threshold (5%) if p_value < alpha: print("โ STATISTICALLY SIGNIFICANT!") print("The difference is REAL, not random luck.") print(f"We are {(1 - p_value) * 100:.2f}% confident Version B is better!") print("\n๐ RECOMMENDATION: Implement Version B!") else: print("โ NOT statistically significant.") print("The difference might just be random chance.") print("\n๐ RECOMMENDATION: Keep Version A or test longer.") # Output: # โ STATISTICALLY SIGNIFICANT! # The difference is REAL, not random luck. # We are 99.97% confident Version B is better! # # ๐ RECOMMENDATION: Implement Version B!
With p-value = 0.000347 (much less than 0.05), we can confidently say:
"Version B truly performs better - it's not just luck!"
| Mistake | Why It's Bad | What to Do Instead |
|---|---|---|
| Stopping too early | Small sample = unreliable results | Wait for enough data (usually 1000+ users per version) |
| Testing too many things | Can't tell which change made the difference | Change ONE thing at a time |
| Peeking at results | Leads to false positives | Set a fixed end date before starting |
| Not randomizing properly | Biased groups | Use proper random assignment |
| Ignoring seasonality | Weekend vs weekday behavior differs | Test for at least 1-2 full weeks |
The course source uses ab_testing_data.csv (or similar): control vs variant groups, conversion metric. Key steps: split by group, compute conversion rates, run a t-test or z-test for significance. Download ab_testing_data.csv from the datasets page. See AB testing and Market Basket Analysis.pdf in the course source for slides.
Every line of code (verbatim).
# --- Code cell 1 ---
from IPython.core.display import HTML
HTML("""
<style>
h2 { color: blue !important; }
h3 { color: green !important; }
</style>
""")
# --- Code cell 4 ---
import pandas as pd
from scipy import stats
# --- Code cell 5 ---
data = pd.read_csv("AB_testing_data.csv")
# --- Code cell 6 ---
len(data)
# --- Code cell 7 ---
data.head(10)
# --- Code cell 8 ---
data.info()
# --- Code cell 9 ---
data.describe()
# --- Code cell 11 ---
samples_set1 = data['Conversion fraction A']
samples_set2 = data['Conversion fraction B']
stat, p = stats.ttest_ind(samples_set1, samples_set2,equal_var = True)
print("AB test results: ")
print("p-value : ", p)
print("")
print("")
# --- Code cell 12 ---
1-0.00034704350989135126
# --- Code cell 13 ---
1-0.05
# --- Code cell 14 ---
# p value < 0.05 so two versions of website have different means for conversion rate - more than 95% confidence
In one sentence: why is it important to run an A/B test for at least one full week (or more) before deciding a winner?
| Concept | Simple Explanation |
|---|---|
| A/B Test | Comparing two versions to see which performs better |
| Control (A) | The current version (what we're comparing against) |
| Variant (B) | The new version we're testing |
| Conversion Rate | % of users who took the desired action |
| p-value | Probability that the difference is just random luck |
| Significance (p < 0.05) | Less than 5% chance it's random โ real difference! |
You now understand A/B testing - a skill used by data scientists at top tech companies!