Market Basket Analysis | Fakhruddin Khambaty's Learning Hub

What is Market Basket Analysis?

👶 Explain Like I'm 5

Market Basket Analysis is used by retailers to understand customer purchasing patterns. It looks at what people buy together – like "people who buy milk often buy bread too" – so stores can place products nearby, run bundles, or recommend "others also bought."

In one sentence: Market Basket Analysis = Finding which products customers buy together.

It involves analyzing large datasets (e.g. purchase history) to identify products that are likely to be purchased together and to uncover known or unknown associations between items.

Key Metrics (Support, Confidence, Lift)

Metric	Meaning (Layman)
Support	How often a set of items appears together in all transactions (e.g. "bread + butter" in 30% of baskets).
Confidence	When someone buys A, how often do they also buy B? (e.g. "If bread → butter" 65%).
Lift	How much more likely B is bought when A is bought vs. random. Lift > 1 = positive association; < 1 = negative; = 1 = no association.

Dataset for This Lesson

We use a grocery-style transaction dataset: each row is one transaction (one basket), and columns are item names. Empty/NaN means "not in that basket."

📥 Download the datasets: market_baskets_data.csv, Apriori Algorithm.xlsx

Save the CSV in the same folder as your Python script or notebook so pd.read_csv("market_baskets_data.csv") works.

Apriori Algorithm (Super Simple)

Apriori is a classic algorithm to find "frequent itemsets" (sets of items that appear together often enough) and then turn them into rules (e.g. "if milk then bread"). We set a minimum support: only itemsets above that threshold are kept.

Step 1: Load Data and Convert to List of Lists

The Apriori library (apyori) expects transactions as a list of lists: each inner list is one basket (list of item names). We load the CSV and convert each row into a list of non-null items.

# Install: pip install apyori pandas
import pandas as pd
import numpy as np

# Load the dataset (keep the CSV in the same folder as your script)
data = pd.read_csv("market_baskets_data.csv")

# Convert each row to a list of items (drop NaN)
# Each row = one transaction (one shopping basket)
baskets = []
for index, row in data.iterrows():
    items = [str(x).strip() for x in row if pd.notna(x) and str(x).strip() != '']
    baskets.append(items)

# baskets is now a list of lists, e.g. [['shrimp','almonds',...], ['burgers','meatballs',...], ...]
print("Number of transactions:", len(baskets))
print("First 3 baskets:", baskets[:3])

What each line does (in simple words)

pd.read_csv("market_baskets_data.csv") — Loads the CSV; each row is one basket (transaction), columns are item names.

baskets = [] — Empty list we will fill with one list per transaction.

for index, row in data.iterrows(): — Loops over each row of the DataFrame.

items = [str(x).strip() for x in row if ...] — Takes non-empty, non-NaN values in that row and puts them in a list (one basket).

baskets.append(items) — Adds that basket to the list of all baskets.

len(baskets) — Number of transactions; baskets[:3] — First 3 baskets.

Step 2: Run Apriori and Get Association Rules

We pass the list of baskets to apriori with min_support, min_confidence, and min_lift. The function returns relation records we can loop over.

from apyori import apriori

# Run Apriori: find frequent itemsets and rules
# min_support: keep itemsets that appear in at least 1% of transactions
# min_confidence: rule confidence at least 25%
# min_lift: at least 2 (stronger than random)
rules = apriori(baskets, min_support=0.01, min_confidence=0.25, min_lift=2)
rules_list = list(rules)

print("Number of rules found:", len(rules_list))

Step 3: Inspect the Rules

Each element in rules_list is a RelationRecord: it has items (the itemset), support, and ordered_statistics (which give confidence and lift for each rule like "item A → item B").

for rule in rules_list[:10]:
    items = [x for x in rule.items]
    support = rule.support
    for ord_stat in rule.ordered_statistics:
        items_base = [x for x in ord_stat.items_base]
        items_add = [x for x in ord_stat.items_add]
        conf = ord_stat.confidence
        lift = ord_stat.lift
        print(f"Rule: {items_base} -> {items_add} | support={support:.3f}, confidence={conf:.3f}, lift={lift:.3f}")

Alternative: Using mlxtend (apriori + association_rules)

Another popular library is mlxtend. It needs data in one-hot encoded form (one column per item, 1/0). We can get that from the same CSV using a transaction encoder, then call apriori and association_rules to get a nice table.

# Install: pip install mlxtend
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# baskets = list of lists (same as above)
te = TransactionEncoder()
te_ary = te.fit(baskets).transform(baskets)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

# Frequent itemsets
frequent = apriori(df_encoded, min_support=0.01, use_colnames=True)
# Association rules from those itemsets
rules_df = association_rules(frequent, metric="confidence", min_threshold=0.25)
print(rules_df.head(10))

🚫 Common Mistakes in Market Basket Analysis

Using only confidence — High confidence can be misleading if B is very common; use lift to see if the association is stronger than random.
Setting min_support too low — You get too many rules and slow runtimes; set it high enough that itemsets are meaningful (e.g. 0.01–0.05 for large datasets).
Treating rules as causation — "If A then B" is an association, not "A causes B"; use for recommendations, not for claiming cause and effect.

📘 From the course notebook (Market Basket Analysis)

The course source uses market_baskets_data.csv: each row = transaction, columns = items (or one-hot). Use mlxtend.frequent_patterns.apriori and association_rules with min_support, min_threshold; interpret support, confidence, lift. Download market_baskets_data.csv from the datasets page. See AB testing and Market Basket Analysis.pdf in the course source for slides.

Complete code from course notebook: market_basket_analysis.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h2 { color: blue !important; }
h3 { color: green !important; }
</style>
""")

# --- Code cell 4 ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# --- Code cell 5 ---
data = pd.read_csv("market_baskets_data.csv")

# --- Code cell 6 ---
data.head(10)

# --- Code cell 7 ---
data.info()

# --- Code cell 8 ---
#[a,b,c]
#[e,f]
#[a,e,f,k,l ....j]

# --- Code cell 9 ---
len(data.columns)

# --- Code cell 12 ---
# Lets get the data in the form of lists of lists for the apriori algorithm
baskets = [[str(data.values[i, j]) for j in range(0, 20) if str(data.values[i, j])!='nan'] for i in range(0, len(data))]

# --- Code cell 13 ---
#overall_data - should be list
# within list - a list each for every basket

# --- Code cell 14 ---
print(baskets)

# --- Code cell 15 ---
type(baskets)

# --- Code cell 16 ---
len(baskets)

# --- Code cell 17 ---
baskets[0]

# --- Code cell 18 ---
baskets[1]

# --- Code cell 19 ---
from apyori import apriori # pip install apyori ( Go to Anaconda Prompt)
help(apriori)

# --- Code cell 20 ---
rules = apriori(baskets, min_support=0.01, min_confidence=0.25, min_lift=2)
result = list(rules)

# --- Code cell 21 ---
# Frozen set is just an immutable version of a Python set object.
# While elements of a set can be modified at any time, elements of the frozen set remain the same after creation. 
# Due to this, frozen sets can be used as keys in Dictionary or as elements of another set.
result[0]

# --- Code cell 22 ---
print("Number of rules found with given thresholds :", len(result))

# --- Code cell 23 ---

items_rules = result[0]
items_rules

# --- Code cell 24 ---
#RelationRecord is data format in which apriori algorithm returns results
type(items_rules)

# --- Code cell 25 ---
items_rules[0]

# --- Code cell 26 ---
items_rules[1]

# --- Code cell 27 ---
items_rules[2][0][2]

# --- Code cell 28 ---
items_rules[2][0][3]

# --- Code cell 29 ---
results_data = pd.DataFrame(0.0,columns = ['association rule','support','confidence','lift'],
                            index=range(0,20))

# --- Code cell 30 ---
results_data

# --- Code cell 31 ---
for z in range(len(result)):

    items_rules = result[z]
    results_data['association rule'][z] = list(items_rules[0])
    results_data['support'][z] = items_rules[1]
    results_data['confidence'][z] = items_rules[2][0][2]
    results_data['lift'][z] = items_rules[2][0][3]

# --- Code cell 32 ---
results_data

💭 Short reflection

In one sentence: why is “lift” a better measure than “confidence” when deciding which product to recommend next to a customer?

✅ CORE (Must know)

Market basket analysis: find items bought together (frequent itemsets, association rules).
Support: how often itemset appears; confidence: P(B|A); lift: strength vs random.
Apriori: mine frequent itemsets with min_support; then derive rules (confidence/lift).
Data format: transactions as list of items or one-hot matrix; use mlxtend/apyori.

📚 NON-CORE (Good to know)

FP-Growth as alternative; multiple metrics (conviction, leverage).

Summary

Market Basket Analysis finds which products are bought together.
Support = how often; Confidence = when A then B; Lift = strength vs random.
Use apyori (list of lists) or mlxtend (one-hot + apriori + association_rules).
Download market_baskets_data.csv and Apriori Algorithm.xlsx and run the code step by step!

Previous: A/B Testing Course Hub

🛒 Market Basket Analysis