E-Commerce Data Warehouse
Build a medallion architecture (Bronze β Silver β Gold) for orders, customers, and products. Learn how raw data becomes analytics-ready tables.
π Before you start, learn these courses:
The Scenario
Imagine thisβ¦
You just got hired as a junior data engineer at "ShopEasy", a small e-commerce company. They sell electronics, clothing, and home goods online. Every day, thousands of orders come in. Customer data lives in one system, product info in another, and order records in a third. Your boss asks: "How much did we sell last month by region?" and nobody can answer quickly because the data is scattered and messy.
Your job: build a data warehouse β one central place where all this data is cleaned, organized, and ready for dashboards and reports. That's exactly what we're going to build together, step by step.
What You'll Build
- Bronze layer: Raw tables for orders, customers, and products β data exactly as it comes from source systems (CSV files).
- Silver layer: Cleaned versions β correct data types, no duplicates, validated foreign keys.
- Gold layer: Business aggregates like
sales_by_region,top_products,customer_lifetime_value. - DBT models: SQL transformations with documentation and tests (uniqueness, not-null).
- Airflow DAG: Orchestration that runs the pipeline daily on autopilot.
In plain English
Think of it like a kitchen. Bronze = raw groceries straight from the store (uncut, unwashed). Silver = washed, chopped, prepped ingredients. Gold = finished dishes ready to serve. The "warehouse" is the kitchen where this all happens, DBT is the recipe book, and Airflow is the timer that tells you when to cook.
Prerequisites
Before you start, make sure you're comfortable with these. If not, we link to the exact courses below β finish those first, then come back.
| Basic SQL | SELECT, WHERE, JOIN, GROUP BY, ORDER BY. You should be able to write a query that joins two tables and counts rows. | PostgreSQL Course |
| Command line basics | Open a terminal, navigate folders (cd), run commands. Nothing fancy. | β |
| DBT intro | What DBT is, how to run dbt run and dbt test. You'll learn the rest here. |
DBT Course |
| Airflow intro | What a DAG is, how to define tasks. You don't need to be an expert β we'll walk through it. | Airflow Course |
Step-by-Step Plan
Here's the high-level road-map. The Build It tab walks you through every single step with copy-paste code, screenshots, and explanations.
-
1
Set up the warehouse. Install PostgreSQL (or connect to Redshift), create schemas for bronze, silver, and gold.
-
2
Load raw data (Bronze). Download sample CSVs (orders, customers, products) and load them into bronze tables using SQL COPY or a Python script.
-
3
Initialize DBT. Create a new DBT project, configure the connection to your database, and verify it works.
-
4
Build Silver models. Write DBT SQL models that clean, deduplicate, and type-cast each bronze table into silver.
-
5
Build Gold models. Create business-level aggregates (sales_by_region, customer_lifetime_value) from silver tables.
-
6
Add DBT tests. Write schema tests (unique, not_null) and a custom test to ensure referential integrity.
-
7
Orchestrate with Airflow. Create a DAG that runs
dbt runthendbt teston a daily schedule. -
8
Verify & celebrate. Query your Gold tables, check the pipeline ran, and admire what you built!
Set Up the Warehouse
Install PostgreSQL and create schemas
What we're doing and why
A warehouse needs rooms. We create three "schemas" (think: rooms) inside one database: bronze (raw data dump), silver (cleaned data), and gold (business-ready tables). This separation keeps things organized so you always know what's raw and what's ready.
1a. Install PostgreSQL
If you already have Postgres, skip ahead. Otherwise:
- Mac: Open terminal and run
brew install postgresql@15thenbrew services start postgresql@15. - Windows: Download the installer from postgresql.org/download and follow the wizard. Use default port 5432 and remember your password.
- Linux:
sudo apt install postgresql postgresql-contrib.
Tip
Don't have admin access? Use a free cloud database like Neon or Supabase β both give you a Postgres database in seconds with zero install.
1b. Create the database and schemas
Open a terminal (or pgAdmin, DBeaver, etc.) and connect to Postgres. Then run:
-- Create a dedicated database for our project CREATE DATABASE shopeasy_warehouse; -- Connect to it (in psql: \c shopeasy_warehouse) -- Create three schemas: one per layer CREATE SCHEMA bronze; CREATE SCHEMA silver; CREATE SCHEMA gold;
What just happened
We made a new database called shopeasy_warehouse. Inside it, we made three "folders" (schemas). bronze will hold raw CSVs. silver will hold cleaned versions. gold will hold the final aggregated tables that dashboards read from. This is the medallion pattern used at companies like Netflix and Airbnb.
1c. Verify
Run this to see your schemas:
SELECT schema_name
FROM information_schema.schemata
WHERE schema_name IN ('bronze', 'silver', 'gold');
You should see three rows: bronze, silver, gold. If yes β step 1 is done!
Step 1 complete!
- PostgreSQL installed and running
- Database
shopeasy_warehousecreated - Three schemas: bronze, silver, gold
Load Raw Data (Bronze Layer)
Download sample CSVs and load into bronze tables
What we're doing and why
We need data to work with! In real life, data comes from apps, APIs, or file drops. For this project, we'll use three sample CSV files that mimic an e-commerce system: orders, customers, and products. We load them "as-is" into bronze β no cleaning yet. Bronze = raw groceries.
2a. Download the sample CSVs
Create a folder on your computer (e.g. ~/shopeasy_data/) and save these three files there. You can create them yourself or copy-paste the content below.
customers.csv
customer_id,name,email,region,created_at 1,Alice Johnson,alice@email.com,North,2024-01-15 2,Bob Smith,bob@email.com,South,2024-02-20 3,Charlie Davis,charlie@email.com,East,2024-01-10 4,Diana Lee,diana@email.com,West,2024-03-05 5,Eve Martinez,eve@email.com,North,2024-04-12 6,Frank Brown,frank@email.com,South,2024-05-01 7,Grace Wilson,grace@email.com,East,2024-02-28 8,Hank Taylor,hank@email.com,West,2024-06-15 9,Ivy Chen,ivy@email.com,North,2024-07-20 10,Jack White,jack@email.com,South,2024-08-01
products.csv
product_id,product_name,category,price 101,Wireless Mouse,Electronics,29.99 102,USB-C Hub,Electronics,49.99 103,Cotton T-Shirt,Clothing,19.99 104,Running Shoes,Clothing,89.99 105,Desk Lamp,Home,34.99 106,Coffee Maker,Home,79.99 107,Bluetooth Speaker,Electronics,59.99 108,Yoga Mat,Home,24.99
orders.csv
order_id,customer_id,product_id,quantity,amount,order_date,status 1001,1,101,2,59.98,2024-06-01,completed 1002,2,103,1,19.99,2024-06-02,completed 1003,3,106,1,79.99,2024-06-03,completed 1004,1,105,1,34.99,2024-06-05,completed 1005,4,102,3,149.97,2024-06-07,pending 1006,5,107,1,59.99,2024-06-10,completed 1007,6,104,2,179.98,2024-06-12,completed 1008,7,108,1,24.99,2024-06-15,returned 1009,8,101,1,29.99,2024-06-18,completed 1010,9,103,4,79.96,2024-06-20,completed 1011,10,106,1,79.99,2024-06-22,completed 1012,1,107,1,59.99,2024-06-25,completed 1013,3,102,2,99.98,2024-06-28,pending 1014,5,105,1,34.99,2024-07-01,completed 1015,2,104,1,89.99,2024-07-03,completed
2b. Create bronze tables
-- Bronze tables: store raw data exactly as it comes
CREATE TABLE bronze.customers (
customer_id INTEGER,
name TEXT,
email TEXT,
region TEXT,
created_at TEXT -- text on purpose: raw = no type assumptions
);
CREATE TABLE bronze.products (
product_id INTEGER,
product_name TEXT,
category TEXT,
price TEXT -- raw price as text
);
CREATE TABLE bronze.orders (
order_id INTEGER,
customer_id INTEGER,
product_id INTEGER,
quantity INTEGER,
amount TEXT, -- raw amount as text
order_date TEXT, -- raw date as text
status TEXT
);
Why TEXT for everything?
In the bronze layer we keep data as raw as possible β even numbers like price come in as text. Why? Because source systems can send weird stuff: "$29.99" with a dollar sign, "N/A" instead of a number, dates in different formats. By storing as TEXT first, nothing breaks on load. We fix types later in silver. This is a best practice used in production pipelines.
2c. Load CSV files into bronze
Using psql (replace the path with where you saved your CSVs):
\COPY bronze.customers FROM '/path/to/customers.csv' WITH (FORMAT csv, HEADER true); \COPY bronze.products FROM '/path/to/products.csv' WITH (FORMAT csv, HEADER true); \COPY bronze.orders FROM '/path/to/orders.csv' WITH (FORMAT csv, HEADER true);
Tip: Using DBeaver or pgAdmin?
Right-click the table β Import Data β pick your CSV. Make sure "Header" is checked. Same result, just with a GUI instead of commands.
2d. Verify the load
SELECT 'customers' AS table_name, COUNT(*) AS rows FROM bronze.customers UNION ALL SELECT 'products', COUNT(*) FROM bronze.products UNION ALL SELECT 'orders', COUNT(*) FROM bronze.orders;
Expected: customers = 10, products = 8, orders = 15. If your counts match β bronze layer is done!
Step 2 complete!
- Three CSV files created / downloaded
- Bronze tables created with TEXT columns
- Data loaded and verified (10 + 8 + 15 rows)
Initialize DBT Project
Set up DBT and connect it to your database
What is DBT?
DBT (Data Build Tool) lets you write SQL files, and it runs them in the right order against your database. Think of it as a recipe book for your data: each SQL file is a recipe that says "take this table, transform it, and save the result as a new table." DBT also lets you add tests (e.g. "order_id must be unique") and documentation.
3a. Install DBT
pip install dbt-postgres
This installs DBT with the Postgres adapter. If using Redshift, run pip install dbt-redshift instead.
3b. Create a new DBT project
dbt init shopeasy_dbt cd shopeasy_dbt
DBT will ask you which database you want to use β pick postgres. It creates a folder structure like this:
shopeasy_dbt/ βββ dbt_project.yml <-- project config βββ models/ <-- your SQL files go here β βββ example/ βββ tests/ βββ macros/ βββ profiles.yml <-- database connection (in ~/.dbt/)
3c. Configure the connection
Open ~/.dbt/profiles.yml (DBT creates it during init) and make sure it looks like this:
shopeasy_dbt:
target: dev
outputs:
dev:
type: postgres
host: localhost
port: 5432
user: your_username
password: your_password
dbname: shopeasy_warehouse
schema: silver # DBT writes to silver by default
threads: 4
Common mistake
If you get "connection refused", make sure PostgreSQL is running (brew services list on Mac, sudo systemctl status postgresql on Linux). Also double-check the username and password.
3d. Verify the connection
dbt debug
If everything is green (all checks passed), your DBT project is connected to the database!
3e. Set up the models folder
Delete the example folder and create our structure:
rm -rf models/example mkdir -p models/silver mkdir -p models/gold
Step 3 complete!
- DBT installed (
dbt-postgres) - Project initialized (
shopeasy_dbt) - Connection verified with
dbt debug - Folder structure ready: models/silver, models/gold
Build Silver Models (Clean Layer)
Write DBT models to clean and type-cast bronze data
What's a "model" in DBT?
A model is just a .sql file. When you run dbt run, DBT executes each file and saves the result as a table (or view) in your database. The file name becomes the table name. So silver_customers.sql creates a table called silver_customers.
4a. silver_customers.sql
Create the file models/silver/silver_customers.sql:
{{ config(materialized='table', schema='silver') }}
SELECT
customer_id,
TRIM(name) AS customer_name,
LOWER(TRIM(email)) AS email,
INITCAP(TRIM(region)) AS region,
CAST(created_at AS DATE) AS created_at
FROM {{ source('bronze', 'customers') }}
WHERE customer_id IS NOT NULL
Line-by-line explanation
config(materialized='table'): Tells DBT to save the result as a real table (not just a view). schema='silver': Save it in the silver schema.
TRIM(name): Removes spaces from the start and end (e.g. " Alice " β "Alice"). LOWER(email): Makes email lowercase (e.g. "Bob@Email.COM" β "bob@email.com"). INITCAP(region): Capitalizes first letter (e.g. "north" β "North"). CAST(β¦ AS DATE): Converts the text date to a proper DATE type.
WHERE customer_id IS NOT NULL: Drops any rows where customer_id is missing β bad data shouldn't reach silver.
4b. silver_products.sql
{{ config(materialized='table', schema='silver') }}
SELECT
product_id,
TRIM(product_name) AS product_name,
INITCAP(TRIM(category)) AS category,
CAST(price AS DECIMAL(10,2)) AS price
FROM {{ source('bronze', 'products') }}
WHERE product_id IS NOT NULL
AND price IS NOT NULL
4c. silver_orders.sql
{{ config(materialized='table', schema='silver') }}
SELECT
order_id,
customer_id,
product_id,
quantity,
CAST(amount AS DECIMAL(10,2)) AS amount,
CAST(order_date AS DATE) AS order_date,
LOWER(TRIM(status)) AS status
FROM {{ source('bronze', 'orders') }}
WHERE order_id IS NOT NULL
4d. Define the bronze source
DBT needs to know where bronze tables live. Create models/silver/sources.yml:
version: 2
sources:
- name: bronze
schema: bronze
tables:
- name: customers
- name: products
- name: orders
4e. Run the Silver models
dbt run --select silver
You should see three models pass. Check in psql:
SELECT * FROM silver.silver_customers LIMIT 5; SELECT * FROM silver.silver_orders LIMIT 5;
Step 4 complete!
- Three silver models created and executed
- Data types are now correct (DATE, DECIMAL)
- Whitespace trimmed, emails lowercased, regions capitalized
- Null key rows filtered out
Build Gold Models (Business Layer)
Create aggregated tables for dashboards and reports
What are Gold models?
Gold tables answer business questions directly. Instead of "give me all orders and I'll figure it out", Gold gives you "here's total revenue by region, already calculated." Dashboards and analysts read from Gold β they never touch Bronze or Silver.
5a. gold_sales_by_region.sql
{{ config(materialized='table', schema='gold') }}
SELECT
c.region,
COUNT(DISTINCT o.order_id) AS total_orders,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(o.amount) AS total_revenue,
ROUND(AVG(o.amount), 2) AS avg_order_value
FROM {{ ref('silver_orders') }} o
JOIN {{ ref('silver_customers') }} c
ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
GROUP BY c.region
ORDER BY total_revenue DESC
What this gives you
One row per region (North, South, East, West) with: how many orders, how many unique customers, total revenue, and average order value. This is exactly what a "Sales by Region" dashboard card would read from.
5b. gold_top_products.sql
{{ config(materialized='table', schema='gold') }}
SELECT
p.product_id,
p.product_name,
p.category,
COUNT(o.order_id) AS times_ordered,
SUM(o.quantity) AS total_units_sold,
SUM(o.amount) AS total_revenue
FROM {{ ref('silver_orders') }} o
JOIN {{ ref('silver_products') }} p
ON o.product_id = p.product_id
WHERE o.status = 'completed'
GROUP BY p.product_id, p.product_name, p.category
ORDER BY total_revenue DESC
5c. gold_customer_lifetime_value.sql
{{ config(materialized='table', schema='gold') }}
SELECT
c.customer_id,
c.customer_name,
c.region,
COUNT(o.order_id) AS total_orders,
SUM(o.amount) AS lifetime_value,
MIN(o.order_date) AS first_order,
MAX(o.order_date) AS last_order
FROM {{ ref('silver_customers') }} c
LEFT JOIN {{ ref('silver_orders') }} o
ON c.customer_id = o.customer_id
AND o.status = 'completed'
GROUP BY c.customer_id, c.customer_name, c.region
ORDER BY lifetime_value DESC NULLS LAST
5d. Run Gold models
dbt run --select gold
Step 5 complete!
gold.gold_sales_by_regionβ revenue, orders, avg order value by regiongold.gold_top_productsβ best-selling products by revenuegold.gold_customer_lifetime_valueβ each customer's total spend
Add DBT Tests
Make sure your data is correct with automated tests
Why test data?
Imagine you shipped a dashboard and the CEO sees duplicate orders inflating revenue. Embarrassing! Tests catch that before it happens. DBT tests run SQL checks: "is order_id unique?", "is customer_id never null?", etc. If a test fails, the pipeline stops and alerts you.
6a. Schema tests
Create models/silver/schema.yml:
version: 2
models:
- name: silver_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: email
tests:
- unique
- not_null
- name: silver_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('silver_customers')
field: customer_id
- name: silver_products
columns:
- name: product_id
tests:
- unique
- not_null
6b. Run the tests
dbt test
All tests should pass. If one fails, DBT tells you exactly which test and which rows failed β fix the data or the model and re-run.
Step 6 complete!
- Uniqueness tests on primary keys
- Not-null tests on critical columns
- Referential integrity test (orders β customers)
Orchestrate with Airflow
Automate the pipeline to run daily
Why Airflow?
Right now you run dbt run by hand. In production, the pipeline should run automatically every day at 6 AM (before anyone checks dashboards). Airflow is the tool that schedules and monitors these runs. If something fails, it retries and sends you an alert.
7a. Create the DAG file
Save this as dags/shopeasy_pipeline.py in your Airflow home:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id='shopeasy_warehouse',
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False,
tags=['dbt', 'warehouse'],
) as dag:
dbt_run = BashOperator(
task_id='dbt_run',
bash_command='cd /path/to/shopeasy_dbt && dbt run',
)
dbt_test = BashOperator(
task_id='dbt_test',
bash_command='cd /path/to/shopeasy_dbt && dbt test',
)
dbt_run >> dbt_test
What this DAG does
schedule_interval='@daily': Runs once per day. dbt_run: Executes all your models (silver + gold). dbt_test: Runs all tests after the models finish. The >> means "run dbt_test only after dbt_run succeeds." If dbt_run fails, dbt_test doesn't run.
Don't have Airflow installed?
For learning purposes, you can skip this step and run dbt run && dbt test manually. When you're ready, check our Airflow course for full setup instructions.
Step 7 complete!
- Airflow DAG created with two tasks
- dbt_run β dbt_test (sequential)
- Runs daily automatically
Verify & Celebrate!
Query your Gold tables and see the finished product
Let's query the Gold tables and see what we built:
Sales by region
SELECT * FROM gold.gold_sales_by_region;
Top products
SELECT * FROM gold.gold_top_products;
Customer lifetime value
SELECT * FROM gold.gold_customer_lifetime_value;
Project complete!
You just built a full data warehouse with the medallion pattern! Here's what you can put on your resume:
- Designed and implemented a 3-layer medallion data warehouse (Bronze/Silver/Gold)
- Built DBT models with automated tests for data quality
- Created an Airflow DAG for daily pipeline orchestration
- Produced business-ready analytics tables (sales by region, CLV, top products)
What's next?
- Connect Metabase to your Gold schema and build a dashboard
- Add more source tables (e.g. website clicks, inventory)
- Try the Real-Time Sales Dashboard project next
- Add incremental models in DBT (process only new rows each day)