Advanced Pro

Customer Segmentation Model

Segment customers by behavior using RFM (recency, frequency, monetary) and K-Means clustering. Build a production-ready model in Python and SQL.

PythonSQLData Science

📚 Before you start, learn these courses:

Python → Data Science →

The Scenario

Imagine this…

Your company has thousands of customers. Marketing can't treat everyone the same — some buy often and spend a lot, others browse and rarely convert, and some used to buy but stopped. Sending the same generic email or discount to everyone wastes money and annoys your best customers.

You need to segment customers by behavior. Enter RFM: Recency (how recently they bought), Frequency (how often they buy), and Monetary (how much they spend). This project teaches you to compute RFM in SQL, run K-Means clustering in Python, and turn raw data into actionable segments (Champions, At Risk, New, Lost) so marketing can run targeted campaigns.

What You'll Build

SQL → RFM Features → K-Means → Segments → Dashboard

SQL layer: A query that computes per-customer RFM: days_since_last_order (recency), count_orders (frequency), sum_amount (monetary).
Python feature prep: Load the RFM table, normalize with StandardScaler (so distance-based K-Means isn't dominated by one feature).
K-Means clustering: Choose K with the elbow method, fit the model, assign segment labels to each customer.
Segment profiling: Profile each cluster (Champions, At Risk, New, Lost) with mean RFM and plain-language descriptions.
Save and visualize: Export to CSV or database. Optional: bar chart of segment sizes, scatter plot of recency vs monetary colored by segment.

In plain English

We turn each customer into three numbers: how recently they bought, how often they buy, and how much they spend. We scale them so no single number dominates, then K-Means groups similar customers into segments. Each customer gets a label (Champions, At Risk, New, Lost) you can use for targeting.

Prerequisites

Before you start, make sure you're comfortable with these. We link to the exact courses below.

SQL	SELECT, WHERE, JOIN, GROUP BY, aggregate functions (COUNT, SUM). You should be able to write a query that joins customers and orders and groups by customer.	PostgreSQL Course
Python (Pandas, sklearn)	DataFrames, read_sql or psycopg2, StandardScaler, KMeans. Basic familiarity with numpy/pandas.	Python Course
Clustering intro	What clustering is, how K-Means works (grouping by distance), why we need to choose K.	Clustering Course

Step-by-Step Plan

Here's the high-level road-map. The Build It tab walks you through every single step with copy-paste code, sample data, and explanations.

1
Build RFM features in SQL. Write a SQL query that computes per customer: days_since_last_order (recency), count_orders (frequency), sum_amount (monetary). Start with sample customers and orders tables.
2
Load into Python. Connect to Postgres with psycopg2 or use pd.read_sql. Load the RFM dataframe. Inspect with df.head() and df.describe().
3
Normalize features. K-Means uses distance, so big numbers (e.g. monetary in thousands) would dominate. Use StandardScaler to put recency, frequency, and monetary on the same scale.
4
Choose K with the elbow method. Loop K=2 to 8, compute inertia, plot an elbow curve. Use the "elbow" point to pick the number of segments.
5
Run K-Means. Fit KMeans(n_clusters=4) on the scaled features. Assign cluster labels to each customer. Add a segment column to your DataFrame.
6
Profile segments. groupby('segment').mean() to see average RFM per cluster. Name each segment (Champions, At Risk, New, Lost) and write 2–3 sentences describing who they are and what action to take.
7
Save and visualize. Export to CSV or back to the database. Optional: matplotlib bar chart of segment sizes; seaborn scatter of recency vs monetary colored by segment.

Build RFM Features in SQL

Create sample data and compute recency, frequency, monetary per customer

What we're doing and why

RFM stands for Recency (when did they last buy?), Frequency (how often do they buy?), and Monetary (how much do they spend?). These three numbers capture customer behavior and are the input for K-Means. We compute them in SQL because it's fast on large datasets and keeps logic in one place. First we create sample tables (customers + orders), then write the RFM query.

1a. Create sample tables and insert data

Create a customers and orders table with roughly 50 customers and 200 orders. Run this in your Postgres database:

SQL

-- Create tables
CREATE TABLE customers (
    customer_id   SERIAL PRIMARY KEY,
    name          TEXT,
    email         TEXT,
    created_at    DATE DEFAULT CURRENT_DATE
);

CREATE TABLE orders (
    order_id      SERIAL PRIMARY KEY,
    customer_id   INTEGER REFERENCES customers(customer_id),
    amount        NUMERIC(10,2),
    order_date    DATE
);

-- Insert 50 customers
INSERT INTO customers (name, email) VALUES
('Alice Johnson', 'alice@email.com'),
('Bob Smith', 'bob@email.com'),
('Charlie Davis', 'charlie@email.com'),
('Diana Lee', 'diana@email.com'),
('Eve Martinez', 'eve@email.com'),
('Frank Brown', 'frank@email.com'),
('Grace Wilson', 'grace@email.com'),
('Hank Taylor', 'hank@email.com'),
('Ivy Chen', 'ivy@email.com'),
('Jack White', 'jack@email.com'),
('Kate Green', 'kate@email.com'),
('Leo Adams', 'leo@email.com'),
('Mia Clark', 'mia@email.com'),
('Noah Hall', 'noah@email.com'),
('Olivia King', 'olivia@email.com'),
('Paul Scott', 'paul@email.com'),
('Quinn Turner', 'quinn@email.com'),
('Rachel Moore', 'rachel@email.com'),
('Sam Wright', 'sam@email.com'),
('Tina Brooks', 'tina@email.com'),
('Uma Patel', 'uma@email.com'),
('Victor Singh', 'victor@email.com'),
('Wendy Gupta', 'wendy@email.com'),
('Xander Kumar', 'xander@email.com'),
('Yuki Tanaka', 'yuki@email.com'),
('Zara Khan', 'zara@email.com'),
('Aaron Bell', 'aaron@email.com'),
('Beth Cole', 'beth@email.com'),
('Chris Foster', 'chris@email.com'),
('Dana Hill', 'dana@email.com'),
('Eric Mitchell', 'eric@email.com'),
('Fiona Russell', 'fiona@email.com'),
('George Hayes', 'george@email.com'),
('Hannah Barnes', 'hannah@email.com'),
('Ian Fisher', 'ian@email.com'),
('Jill Powell', 'jill@email.com'),
('Kevin Price', 'kevin@email.com'),
('Laura Reed', 'laura@email.com'),
('Mark Sanders', 'mark@email.com'),
('Nina Watson', 'nina@email.com'),
('Oscar Brooks', 'oscar@email.com'),
('Paula Cooper', 'paula@email.com'),
('Ray Richardson', 'ray@email.com'),
('Sue Cox', 'sue@email.com'),
('Tom Howard', 'tom@email.com'),
('Ursula Ward', 'ursula@email.com'),
('Vince Torres', 'vince@email.com'),
('Wanda Peterson', 'wanda@email.com'),
('Xavier Gray', 'xavier@email.com'),
('Yolanda Ramirez', 'yolanda@email.com');

Now insert ~200 orders spanning the last 12 months (mix of recency, frequency, monetary):

SQL

-- Insert ~200 orders (vary customer_id, amount, order_date for RFM diversity)
INSERT INTO orders (customer_id, amount, order_date) VALUES
(1, 89.99, '2025-03-10'), (1, 45.50, '2025-02-15'), (1, 120.00, '2025-01-20'), (1, 34.99, '2024-12-05'),
(2, 199.99, '2025-03-12'), (2, 78.50, '2025-02-28'), (2, 55.00, '2025-01-10'),
(3, 25.99, '2024-06-01'), (3, 18.50, '2024-05-15'),
(4, 450.00, '2025-03-14'), (4, 320.00, '2025-02-20'), (4, 180.00, '2025-01-25'), (4, 95.00, '2024-12-10'), (4, 210.00, '2024-11-01'),
(5, 67.99, '2025-02-01'), (5, 42.00, '2024-11-15'),
(6, 89.00, '2024-08-20'), (6, 56.00, '2024-07-01'),
(7, 120.50, '2025-03-08'), (7, 88.00, '2025-02-14'), (7, 155.00, '2025-01-05'),
(8, 35.00, '2024-05-10'),
(9, 280.00, '2025-03-15'), (9, 190.00, '2025-02-22'), (9, 95.00, '2025-01-18'), (9, 120.00, '2024-12-01'),
(10, 45.99, '2025-02-25'), (10, 32.00, '2025-01-12'),
(11, 78.50, '2025-03-05'), (11, 92.00, '2025-02-10'), (11, 65.00, '2025-01-08'),
(12, 150.00, '2024-09-15'), (12, 88.00, '2024-08-01'),
(13, 199.99, '2025-03-11'), (13, 145.00, '2025-02-18'), (13, 78.00, '2025-01-22'),
(14, 22.50, '2024-04-20'),
(15, 310.00, '2025-03-13'), (15, 225.00, '2025-02-25'), (15, 180.00, '2025-01-15'), (15, 95.00, '2024-12-20'),
(16, 55.00, '2024-10-01'),
(17, 98.99, '2025-03-09'), (17, 72.00, '2025-02-12'),
(18, 42.50, '2024-07-15'), (18, 28.00, '2024-06-20'),
(19, 165.00, '2025-02-28'), (19, 110.00, '2025-01-20'),
(20, 38.00, '2024-05-25'),
(21, 420.00, '2025-03-14'), (21, 380.00, '2025-02-20'), (21, 290.00, '2025-01-25'), (21, 195.00, '2024-12-15'), (21, 250.00, '2024-11-10'),
(22, 67.50, '2025-02-05'), (22, 54.00, '2024-12-01'),
(23, 135.00, '2025-03-07'), (23, 89.00, '2025-02-18'), (23, 120.00, '2025-01-10'),
(24, 48.99, '2024-08-15'),
(25, 89.00, '2025-03-02'), (25, 62.00, '2025-01-28'),
(26, 178.00, '2025-03-12'), (26, 145.00, '2025-02-22'), (26, 98.00, '2025-01-15'),
(27, 35.50, '2024-06-10'),
(28, 95.00, '2025-02-14'), (28, 72.00, '2024-12-20'),
(29, 210.00, '2025-03-11'), (29, 165.00, '2025-02-25'), (29, 130.00, '2025-01-18'), (29, 88.00, '2024-12-05'),
(30, 52.00, '2024-09-01'),
(31, 145.50, '2025-03-08'), (31, 112.00, '2025-02-15'), (31, 85.00, '2025-01-12'),
(32, 28.99, '2024-05-15'),
(33, 198.00, '2025-03-13'), (33, 156.00, '2025-02-28'), (33, 95.00, '2025-01-20'),
(34, 42.00, '2024-07-20'),
(35, 265.00, '2025-03-15'), (35, 195.00, '2025-02-20'), (35, 150.00, '2025-01-25'), (35, 120.00, '2024-12-10'),
(36, 68.50, '2025-02-08'), (36, 45.00, '2024-11-01'),
(37, 88.00, '2024-08-10'), (37, 55.00, '2024-06-15'),
(38, 112.99, '2025-03-06'), (38, 78.00, '2025-02-12'), (38, 92.00, '2025-01-08'),
(39, 32.50, '2024-04-25'),
(40, 175.00, '2025-03-10'), (40, 130.00, '2025-02-18'), (40, 95.00, '2025-01-15'),
(41, 48.00, '2024-10-15'),
(42, 385.00, '2025-03-14'), (42, 320.00, '2025-02-22'), (42, 245.00, '2025-01-20'), (42, 180.00, '2024-12-01'), (42, 210.00, '2024-11-15'),
(43, 72.50, '2025-02-25'), (43, 58.00, '2025-01-10'),
(44, 125.00, '2025-03-05'), (44, 98.00, '2025-02-14'), (44, 135.00, '2025-01-18'),
(45, 38.99, '2024-06-20'),
(46, 155.00, '2025-03-09'), (46, 118.00, '2025-02-20'), (46, 85.00, '2025-01-12'),
(47, 55.50, '2024-09-10'),
(48, 92.00, '2025-02-28'), (48, 67.00, '2024-12-15'),
(49, 225.00, '2025-03-12'), (49, 178.00, '2025-02-25'), (49, 142.00, '2025-01-22'), (49, 95.00, '2024-12-05'),
(50, 45.00, '2024-08-01');

-- Add more orders for some customers to increase frequency (e.g. champions)
INSERT INTO orders (customer_id, amount, order_date) VALUES
(1, 56.00, '2024-11-15'), (1, 78.50, '2024-10-01'), (4, 165.00, '2024-10-20'),
(21, 320.00, '2024-10-01'), (21, 195.00, '2024-09-15'), (42, 145.00, '2024-10-10');

1b. Write the RFM query

Now write the SQL that computes per-customer RFM. We need: days_since_last_order (recency), count_orders (frequency), sum_amount (monetary).

SQL

-- RFM: Recency, Frequency, Monetary per customer
WITH order_summary AS (
    SELECT
        customer_id,
        COUNT(*)                    AS count_orders,
        SUM(amount)                 AS sum_amount,
        MAX(order_date)             AS last_order_date
    FROM orders
    GROUP BY customer_id
)
SELECT
    c.customer_id,
    c.name,
    c.email,
    COALESCE(o.count_orders, 0)      AS count_orders,
    COALESCE(o.sum_amount, 0)       AS sum_amount,
    CURRENT_DATE - COALESCE(o.last_order_date, c.created_at)::DATE AS days_since_last_order
FROM customers c
LEFT JOIN order_summary o ON c.customer_id = o.customer_id
ORDER BY c.customer_id;

Tip

COALESCE handles customers with no orders: they get 0 for count/sum and recency from their signup date. In real data you might filter them out or treat as "New" segment.

What happened

We created customers and orders tables with 50 customers and ~200 orders.
The RFM query joins orders, aggregates per customer, and computes recency as days since last order. Each customer gets one row with count_orders, sum_amount, days_since_last_order.

Load into Python

Connect to Postgres and load the RFM dataframe

What we're doing and why

We need the RFM data in Python to run clustering. You can use psycopg2 (raw connection + pandas) or pandas read_sql with SQLAlchemy. Either way, we end up with a DataFrame that has customer_id, count_orders, sum_amount, days_since_last_order.

2a. Connect and load with psycopg2 + pandas

Python

import pandas as pd
import psycopg2
from sqlalchemy import create_engine

# Option A: Using psycopg2 directly
conn = psycopg2.connect(
    host='localhost',
    port=5432,
    dbname='your_database',
    user='your_user',
    password='your_password'
)

rfm_query = """
SELECT
    c.customer_id,
    c.name,
    COALESCE(o.count_orders, 0) AS count_orders,
    COALESCE(o.sum_amount, 0) AS sum_amount,
    (CURRENT_DATE - COALESCE(o.last_order_date, c.created_at)::DATE)::INT AS days_since_last_order
FROM customers c
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS count_orders, SUM(amount) AS sum_amount, MAX(order_date) AS last_order_date
    FROM orders GROUP BY customer_id
) o ON c.customer_id = o.customer_id
"""

df = pd.read_sql(rfm_query, conn)
conn.close()

# Option B: Using SQLAlchemy (alternative)
# engine = create_engine('postgresql://user:password@localhost:5432/your_database')
# df = pd.read_sql(rfm_query, engine)

2b. Inspect the dataframe

Python

print(df.head())
#    customer_id        name  count_orders  sum_amount  days_since_last_order
# 0            1  Alice Johnson            7      549.98                     5
# 1            2     Bob Smith            3      333.49                     3
# ...

print(df.describe())
#        customer_id  count_orders    sum_amount  days_since_last_order
# count    50.000000     50.000000     50.000000              50.000000
# mean     25.500000      4.080000    119.248200             142.480000
# std      14.577380      3.234567    100.456789             118.234567
# min       1.000000      0.000000      0.000000               0.000000
# 25%      13.250000      1.000000     45.000000              45.750000
# 50%      25.500000      3.000000     88.000000             115.500000
# 75%      37.750000      6.000000    165.000000             210.250000
# max      50.000000     12.000000    465.000000             335.000000

No database? Use CSV

Export the SQL result to CSV and load with df = pd.read_csv('rfm_data.csv'). The rest of the pipeline stays the same.

What happened

We loaded the RFM query result into a pandas DataFrame.
df.head() shows the first 5 rows; df.describe() shows mean, std, min, max for each numeric column — useful to spot scale differences (monetary in hundreds vs recency in days).

Normalize Features

Use StandardScaler so K-Means distance isn't dominated by one feature

Why normalize?

K-Means uses Euclidean distance between points. If recency is in days (0–365), monetary in dollars (0–2000), and frequency in counts (0–15), the monetary values will dominate: a 500-dollar difference is huge compared to a 30-day difference. StandardScaler puts all features on the same scale (mean=0, std=1) so each dimension contributes fairly to the distance.

3a. StandardScaler code

Python

from sklearn.preprocessing import StandardScaler

# Features for clustering (exclude customer_id and name)
feature_cols = ['days_since_last_order', 'count_orders', 'sum_amount']
X = df[feature_cols].copy()

# Handle any missing values (e.g. fill with 0 for customers with no orders)
X = X.fillna(0)

# Fit and transform
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Optional: put back into a DataFrame for readability
import numpy as np
X_scaled_df = pd.DataFrame(X_scaled, columns=feature_cols, index=df.index)

print("Before scaling (sample):")
print(X.head())
#   days_since_last_order  count_orders  sum_amount
# 0                     5             7      549.98
# 1                     3             3      333.49
# ...

print("After scaling (sample):")
print(X_scaled_df.head())
#   days_since_last_order  count_orders  sum_amount
# 0              -1.164234      1.234567    0.987654
# 1              -1.176543      0.123456    0.456789
# ...

3b. Before/after comparison

Run X.describe() vs X_scaled_df.describe(). After scaling, mean ≈ 0 and std ≈ 1 for each column. That's exactly what we want.

What happened

StandardScaler subtracted the mean and divided by the standard deviation for each feature.
All three columns now have similar scale, so K-Means won't be biased toward the feature with the largest raw values.

Choose K with the Elbow Method

Loop K=2 to 8, compute inertia, plot the elbow

Why choose K?

K-Means needs you to specify the number of clusters (K). Too few: segments are too broad. Too many: overfitting, hard to act on. The elbow method plots inertia (within-cluster sum of squares) vs K. Where the curve "bends" (elbow), adding more clusters doesn't help much — that's a good K.

4a. Elbow plot code

Python

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

inertias = []
K_range = range(2, 9)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of clusters (K)', fontsize=12)
plt.ylabel('Inertia (within-cluster sum of squares)', fontsize=12)
plt.title('Elbow Method for Optimal K', fontsize=14)
plt.grid(True, alpha=0.3)
plt.xticks(K_range)
plt.tight_layout()
plt.savefig('elbow_plot.png', dpi=150, bbox_inches='tight')
plt.show()

How to read the elbow plot

Inertia always decreases as K increases (more clusters = tighter groups). Look for the "elbow": the point where the curve flattens. Often it's around K=4 or 5 for RFM. Before the elbow: big drops (adding a cluster helps). After the elbow: small drops (diminishing returns). Pick the elbow K (e.g. 4) for your final model.

What happened

We ran K-Means for K=2,3,…,8 and stored inertia for each.
The plot shows where adding clusters stops helping. Use that K (often 4) for the next step.

Run K-Means

Fit the model and assign segment labels

What we're doing

We fit KMeans with the chosen K (e.g. 4), get cluster labels for each row, and add them to our DataFrame. Labels are 0, 1, 2, 3 — we'll give them business names in the next step.

5a. Fit and assign labels

Python

# Use K=4 (adjust if your elbow suggested differently)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans.fit(X_scaled)

# Assign segment (cluster) labels to each customer
df['segment'] = kmeans.labels_

# Show first 10 rows with segment
print(df[['customer_id', 'name', 'count_orders', 'sum_amount', 'days_since_last_order', 'segment']].head(10))
#    customer_id        name  count_orders  sum_amount  days_since_last_order  segment
# 0            1  Alice Johnson            7      549.98                     5         0
# 1            2     Bob Smith            3      333.49                     3         0
# 2            3  Charlie Davis            2       44.49                   283         3
# 3            4     Diana Lee            5     1255.00                     1         0
# 4            5     Eve Martinez         2      109.99                   133         2
# ...

Tip

random_state=42 ensures reproducible results. n_init=10 runs 10 different initializations and picks the best — helps avoid bad local minima.

What happened

KMeans partitioned customers into 4 groups based on their scaled RFM values.
Each customer now has a segment (0–3). Next we'll profile each segment and give them names.

Profile Segments

Name each cluster and describe who they are

What we're doing

We compute mean recency, frequency, monetary per segment. Then we map raw cluster IDs (0,1,2,3) to business names (Champions, At Risk, New, Lost) and write 2–3 sentences describing each segment so marketing knows how to treat them.

6a. Group by segment and compute means

Python

# Mean RFM per segment
profile = df.groupby('segment')[['days_since_last_order', 'count_orders', 'sum_amount']].mean()

print(profile)
#         days_since_last_order  count_orders  sum_amount
# segment
# 0                   8.2            5.1        285.3   <- recent, frequent, high spend = Champions
# 1                 245.3            1.2         62.1     <- old, rare, low = Lost
# 2                 105.6            2.5        112.8     <- medium recency = At Risk or Need Attention
# 3                  42.1            1.8         85.4     <- recent, few orders = New

6b. Map segments to names and describe

Based on the profile, map segment IDs to names. The mapping depends on your data — adjust the order if your clusters differ.

Python

# Map segment IDs to names (adjust 0,1,2,3 based on your profile output)
# Typical interpretation: low recency + high freq + high monetary = Champions
segment_names = {
    0: 'Champions',   # Best customers: recent, frequent, high spend
    1: 'Lost',        # Haven't bought in a long time, low engagement
    2: 'At Risk',     # Used to buy, slipping away (medium recency, declining)
    3: 'New'         # Recent but few orders — potential to nurture
}

df['segment_name'] = df['segment'].map(segment_names)

# Segment descriptions for documentation
descriptions = {
    'Champions': 'Your best customers. They buy recently, often, and spend a lot. Reward them with loyalty perks, early access, and exclusive offers. Don\'t over-discount — they\'re already engaged.',
    'At Risk': 'Previously active customers who are slipping away. Recency is moderate but trending up. Win them back with personalized win-back campaigns, surveys, or special incentives.',
    'New': 'Recent buyers with few orders so far. Great opportunity to nurture. Onboarding emails, cross-sell, and second-purchase incentives can turn them into Champions.',
    'Lost': 'Haven\'t purchased in a long time and show low engagement. Either try a re-engagement campaign (last chance offer) or deprioritize — cost per acquisition may not be worth it.'
}

for name, desc in descriptions.items():
    print(f"\n{name}: {desc}")

What happened

groupby('segment').mean() gave us average RFM per cluster.
We mapped numeric labels to business names and wrote action-oriented descriptions. Marketing can now use segment_name for campaign targeting.

Save and Visualize

Export to CSV or database, optional charts

What we're doing

Persist the segmented data so downstream tools (Metabase, CRM, email platforms) can use it. We'll save to CSV or insert into a database table. Optionally add visualizations: a bar chart of segment sizes and a scatter plot of recency vs monetary colored by segment.

7a. Save to CSV or database

Python

# Save to CSV
df.to_csv('customer_segments.csv', index=False)

# Or save back to PostgreSQL
# from sqlalchemy import create_engine
# engine = create_engine('postgresql://user:pass@localhost:5432/db')
# df[['customer_id', 'segment', 'segment_name']].to_sql('customer_segments', engine, if_exists='replace', index=False)

7b. Bar chart of segment sizes

Python

import matplotlib.pyplot as plt

# Segment counts
seg_counts = df['segment_name'].value_counts()
plt.figure(figsize=(8, 5))
plt.bar(seg_counts.index, seg_counts.values, color=['#4f46e5', '#7c3aed', '#a78bfa', '#c4b5fd'])
plt.xlabel('Segment', fontsize=12)
plt.ylabel('Number of customers', fontsize=12)
plt.title('Customer Segment Sizes', fontsize=14)
plt.tight_layout()
plt.savefig('segment_sizes.png', dpi=150, bbox_inches='tight')
plt.show()

7c. Seaborn scatter: recency vs monetary colored by segment

Python

import seaborn as sns

plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df,
    x='days_since_last_order',
    y='sum_amount',
    hue='segment_name',
    palette={'Champions': '#4f46e5', 'At Risk': '#f59e0b', 'New': '#10b981', 'Lost': '#64748b'},
    s=80,
    alpha=0.8
)
plt.xlabel('Days since last order (Recency)', fontsize=12)
plt.ylabel('Total spend (Monetary)', fontsize=12)
plt.title('Recency vs Monetary by Segment', fontsize=14)
plt.legend(title='Segment', loc='upper right')
plt.tight_layout()
plt.savefig('rfm_scatter.png', dpi=150, bbox_inches='tight')
plt.show()

Reading the scatter plot

Champions cluster in the bottom-right (low recency = recent, high monetary). Lost customers in the top-left (high recency, low spend). At Risk and New sit in between.

Project complete!

You built a full customer segmentation pipeline:

RFM features computed in SQL
Data loaded, normalized, and clustered with K-Means
Segments profiled and named (Champions, At Risk, New, Lost)
Results saved to CSV and visualized

What's next?

Connect Metabase to your customer_segments table and build a dashboard
Add more features (tenure, product category preference) and re-run clustering
Try Marketing Analytics Pipeline to automate campaign ROI
Schedule a monthly refresh: cron job or Airflow DAG that runs the SQL → Python pipeline and updates the segment table

Unlock the Full Build Guide

Get the complete step-by-step implementation: copy-paste SQL with sample data, Python code for RFM + K-Means + profiling, tips, and visualizations. Build this project from zero to production.

Upgrade to Pro

Back to all projects