πŸ›’ RETAIL & E-COMMERCE

πŸ›’ Market Basket Analysis

Find which products customers buy together – like "bread + butter" – using association rules and the Apriori algorithm!

What is Market Basket Analysis?

πŸ‘Ά Explain Like I'm 5

Market Basket Analysis is used by retailers to understand customer purchasing patterns. It looks at what people buy together – like "people who buy milk often buy bread too" – so stores can place products nearby, run bundles, or recommend "others also bought."

In one sentence: Market Basket Analysis = Finding which products customers buy together.

It involves analyzing large datasets (e.g. purchase history) to identify products that are likely to be purchased together and to uncover known or unknown associations between items.

Key Metrics (Support, Confidence, Lift)

MetricMeaning (Layman)
SupportHow often a set of items appears together in all transactions (e.g. "bread + butter" in 30% of baskets).
ConfidenceWhen someone buys A, how often do they also buy B? (e.g. "If bread β†’ butter" 65%).
LiftHow much more likely B is bought when A is bought vs. random. Lift > 1 = positive association; < 1 = negative; = 1 = no association.

Dataset for This Lesson

We use a grocery-style transaction dataset: each row is one transaction (one basket), and columns are item names. Empty/NaN means "not in that basket."

πŸ“₯ Download the datasets: market_baskets_data.csv, Apriori Algorithm.xlsx

Save the CSV in the same folder as your Python script or notebook so pd.read_csv("market_baskets_data.csv") works.

Apriori Algorithm (Super Simple)

Apriori is a classic algorithm to find "frequent itemsets" (sets of items that appear together often enough) and then turn them into rules (e.g. "if milk then bread"). We set a minimum support: only itemsets above that threshold are kept.

Step 1: Load Data and Convert to List of Lists

The Apriori library (apyori) expects transactions as a list of lists: each inner list is one basket (list of item names). We load the CSV and convert each row into a list of non-null items.

# Install: pip install apyori pandas
import pandas as pd
import numpy as np

# Load the dataset (keep the CSV in the same folder as your script)
data = pd.read_csv("market_baskets_data.csv")

# Convert each row to a list of items (drop NaN)
# Each row = one transaction (one shopping basket)
baskets = []
for index, row in data.iterrows():
    items = [str(x).strip() for x in row if pd.notna(x) and str(x).strip() != '']
    baskets.append(items)

# baskets is now a list of lists, e.g. [['shrimp','almonds',...], ['burgers','meatballs',...], ...]
print("Number of transactions:", len(baskets))
print("First 3 baskets:", baskets[:3])

What each line does (in simple words)

pd.read_csv("market_baskets_data.csv") β€” Loads the CSV; each row is one basket (transaction), columns are item names.

baskets = [] β€” Empty list we will fill with one list per transaction.

for index, row in data.iterrows(): β€” Loops over each row of the DataFrame.

items = [str(x).strip() for x in row if ...] β€” Takes non-empty, non-NaN values in that row and puts them in a list (one basket).

baskets.append(items) β€” Adds that basket to the list of all baskets.

len(baskets) β€” Number of transactions; baskets[:3] β€” First 3 baskets.

Step 2: Run Apriori and Get Association Rules

We pass the list of baskets to apriori with min_support, min_confidence, and min_lift. The function returns relation records we can loop over.

from apyori import apriori

# Run Apriori: find frequent itemsets and rules
# min_support: keep itemsets that appear in at least 1% of transactions
# min_confidence: rule confidence at least 25%
# min_lift: at least 2 (stronger than random)
rules = apriori(baskets, min_support=0.01, min_confidence=0.25, min_lift=2)
rules_list = list(rules)

print("Number of rules found:", len(rules_list))

Step 3: Inspect the Rules

Each element in rules_list is a RelationRecord: it has items (the itemset), support, and ordered_statistics (which give confidence and lift for each rule like "item A β†’ item B").

for rule in rules_list[:10]:
    items = [x for x in rule.items]
    support = rule.support
    for ord_stat in rule.ordered_statistics:
        items_base = [x for x in ord_stat.items_base]
        items_add = [x for x in ord_stat.items_add]
        conf = ord_stat.confidence
        lift = ord_stat.lift
        print(f"Rule: {items_base} -> {items_add} | support={support:.3f}, confidence={conf:.3f}, lift={lift:.3f}")

Alternative: Using mlxtend (apriori + association_rules)

Another popular library is mlxtend. It needs data in one-hot encoded form (one column per item, 1/0). We can get that from the same CSV using a transaction encoder, then call apriori and association_rules to get a nice table.

# Install: pip install mlxtend
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# baskets = list of lists (same as above)
te = TransactionEncoder()
te_ary = te.fit(baskets).transform(baskets)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

# Frequent itemsets
frequent = apriori(df_encoded, min_support=0.01, use_colnames=True)
# Association rules from those itemsets
rules_df = association_rules(frequent, metric="confidence", min_threshold=0.25)
print(rules_df.head(10))

🚫 Common Mistakes in Market Basket Analysis

πŸ“˜ From the course notebook (Market Basket Analysis)

The course source uses market_baskets_data.csv: each row = transaction, columns = items (or one-hot). Use mlxtend.frequent_patterns.apriori and association_rules with min_support, min_threshold; interpret support, confidence, lift. Download market_baskets_data.csv from the datasets page. See AB testing and Market Basket Analysis.pdf in the course source for slides.

Complete code from course notebook: market_basket_analysis.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h2 { color: blue !important; }
h3 { color: green !important; }
</style>
""")

# --- Code cell 4 ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# --- Code cell 5 ---
data = pd.read_csv("market_baskets_data.csv")

# --- Code cell 6 ---
data.head(10)

# --- Code cell 7 ---
data.info()

# --- Code cell 8 ---
#[a,b,c]
#[e,f]
#[a,e,f,k,l ....j]

# --- Code cell 9 ---
len(data.columns)

# --- Code cell 12 ---
# Lets get the data in the form of lists of lists for the apriori algorithm
baskets = [[str(data.values[i, j]) for j in range(0, 20) if str(data.values[i, j])!='nan'] for i in range(0, len(data))]

# --- Code cell 13 ---
#overall_data - should be list
# within list - a list each for every basket

# --- Code cell 14 ---
print(baskets)

# --- Code cell 15 ---
type(baskets)

# --- Code cell 16 ---
len(baskets)

# --- Code cell 17 ---
baskets[0]

# --- Code cell 18 ---
baskets[1]

# --- Code cell 19 ---
from apyori import apriori # pip install apyori ( Go to Anaconda Prompt)
help(apriori)

# --- Code cell 20 ---
rules = apriori(baskets, min_support=0.01, min_confidence=0.25, min_lift=2)
result = list(rules)

# --- Code cell 21 ---
# Frozen set is just an immutable version of a Python set object.
# While elements of a set can be modified at any time, elements of the frozen set remain the same after creation. 
# Due to this, frozen sets can be used as keys in Dictionary or as elements of another set.
result[0]

# --- Code cell 22 ---
print("Number of rules found with given thresholds :", len(result))

# --- Code cell 23 ---

items_rules = result[0]
items_rules

# --- Code cell 24 ---
#RelationRecord is data format in which apriori algorithm returns results
type(items_rules)

# --- Code cell 25 ---
items_rules[0]

# --- Code cell 26 ---
items_rules[1]

# --- Code cell 27 ---
items_rules[2][0][2]

# --- Code cell 28 ---
items_rules[2][0][3]

# --- Code cell 29 ---
results_data = pd.DataFrame(0.0,columns = ['association rule','support','confidence','lift'],
                            index=range(0,20))

# --- Code cell 30 ---
results_data

# --- Code cell 31 ---
for z in range(len(result)):

    items_rules = result[z]
    results_data['association rule'][z] = list(items_rules[0])
    results_data['support'][z] = items_rules[1]
    results_data['confidence'][z] = items_rules[2][0][2]
    results_data['lift'][z] = items_rules[2][0][3]

# --- Code cell 32 ---
results_data

πŸ’­ Short reflection

In one sentence: why is β€œlift” a better measure than β€œconfidence” when deciding which product to recommend next to a customer?

βœ… CORE (Must know)

πŸ“š NON-CORE (Good to know)

Summary