Beginner Free

E-Commerce Data Warehouse

Build a medallion architecture (Bronze → Silver → Gold) for orders, customers, and products. Learn how raw data becomes analytics-ready tables.

Redshift / PostgreSQLDBTAirflowSQL

📚 Before you start, learn these courses:

SQL & PostgreSQL → Python → Airflow → DBT →

The Scenario

Imagine this…

You just got hired as a junior data engineer at "ShopEasy", a small e-commerce company. They sell electronics, clothing, and home goods online. Every day, thousands of orders come in. Customer data lives in one system, product info in another, and order records in a third. Your boss asks: "How much did we sell last month by region?" and nobody can answer quickly because the data is scattered and messy.

Your job: build a data warehouse — one central place where all this data is cleaned, organized, and ready for dashboards and reports. That's exactly what we're going to build together, step by step.

What You'll Build

Bronze (Raw) → Silver (Clean) → Gold (Business) → Dashboards

Bronze layer: Raw tables for orders, customers, and products — data exactly as it comes from source systems (CSV files).
Silver layer: Cleaned versions — correct data types, no duplicates, validated foreign keys.
Gold layer: Business aggregates like sales_by_region, top_products, customer_lifetime_value.
DBT models: SQL transformations with documentation and tests (uniqueness, not-null).
Airflow DAG: Orchestration that runs the pipeline daily on autopilot.

In plain English

Think of it like a kitchen. Bronze = raw groceries straight from the store (uncut, unwashed). Silver = washed, chopped, prepped ingredients. Gold = finished dishes ready to serve. The "warehouse" is the kitchen where this all happens, DBT is the recipe book, and Airflow is the timer that tells you when to cook.

Prerequisites

Before you start, make sure you're comfortable with these. If not, we link to the exact courses below — finish those first, then come back.

Basic SQL	SELECT, WHERE, JOIN, GROUP BY, ORDER BY. You should be able to write a query that joins two tables and counts rows.	PostgreSQL Course
Command line basics	Open a terminal, navigate folders (cd), run commands. Nothing fancy.	—
DBT intro	What DBT is, how to run `dbt run` and `dbt test`. You'll learn the rest here.	DBT Course
Airflow intro	What a DAG is, how to define tasks. You don't need to be an expert — we'll walk through it.	Airflow Course

Step-by-Step Plan

Here's the high-level road-map. The Build It tab walks you through every single step with copy-paste code, screenshots, and explanations.

1
Set up the warehouse. Install PostgreSQL (or connect to Redshift), create schemas for bronze, silver, and gold.
2
Load raw data (Bronze). Download sample CSVs (orders, customers, products) and load them into bronze tables using SQL COPY or a Python script.
3
Initialize DBT. Create a new DBT project, configure the connection to your database, and verify it works.
4
Build Silver models. Write DBT SQL models that clean, deduplicate, and type-cast each bronze table into silver.
5
Build Gold models. Create business-level aggregates (sales_by_region, customer_lifetime_value) from silver tables.
6
Add DBT tests. Write schema tests (unique, not_null) and a custom test to ensure referential integrity.
7
Orchestrate with Airflow. Create a DAG that runs dbt run then dbt test on a daily schedule.
8
Verify & celebrate. Query your Gold tables, check the pipeline ran, and admire what you built!

Set Up the Warehouse

Install PostgreSQL and create schemas

What we're doing and why

A warehouse needs rooms. We create three "schemas" (think: rooms) inside one database: bronze (raw data dump), silver (cleaned data), and gold (business-ready tables). This separation keeps things organized so you always know what's raw and what's ready.

1a. Install PostgreSQL

If you already have Postgres, skip ahead. Otherwise:

Mac: Open terminal and run brew install postgresql@15 then brew services start postgresql@15.
Windows: Download the installer from postgresql.org/download and follow the wizard. Use default port 5432 and remember your password.
Linux: sudo apt install postgresql postgresql-contrib.

Tip

Don't have admin access? Use a free cloud database like Neon or Supabase — both give you a Postgres database in seconds with zero install.

1b. Create the database and schemas

Open a terminal (or pgAdmin, DBeaver, etc.) and connect to Postgres. Then run:

SQL

-- Create a dedicated database for our project
CREATE DATABASE shopeasy_warehouse;

-- Connect to it (in psql: \c shopeasy_warehouse)

-- Create three schemas: one per layer
CREATE SCHEMA bronze;
CREATE SCHEMA silver;
CREATE SCHEMA gold;

What just happened

We made a new database called shopeasy_warehouse. Inside it, we made three "folders" (schemas). bronze will hold raw CSVs. silver will hold cleaned versions. gold will hold the final aggregated tables that dashboards read from. This is the medallion pattern used at companies like Netflix and Airbnb.

1c. Verify

Run this to see your schemas:

SQL

SELECT schema_name
FROM information_schema.schemata
WHERE schema_name IN ('bronze', 'silver', 'gold');

You should see three rows: bronze, silver, gold. If yes — step 1 is done!

Step 1 complete!

PostgreSQL installed and running
Database shopeasy_warehouse created
Three schemas: bronze, silver, gold

Load Raw Data (Bronze Layer)

Download sample CSVs and load into bronze tables

What we're doing and why

We need data to work with! In real life, data comes from apps, APIs, or file drops. For this project, we'll use three sample CSV files that mimic an e-commerce system: orders, customers, and products. We load them "as-is" into bronze — no cleaning yet. Bronze = raw groceries.

2a. Download the sample CSVs

Create a folder on your computer (e.g. ~/shopeasy_data/) and save these three files there. You can create them yourself or copy-paste the content below.

customers.csv

CSV

customer_id,name,email,region,created_at
1,Alice Johnson,alice@email.com,North,2024-01-15
2,Bob Smith,bob@email.com,South,2024-02-20
3,Charlie Davis,charlie@email.com,East,2024-01-10
4,Diana Lee,diana@email.com,West,2024-03-05
5,Eve Martinez,eve@email.com,North,2024-04-12
6,Frank Brown,frank@email.com,South,2024-05-01
7,Grace Wilson,grace@email.com,East,2024-02-28
8,Hank Taylor,hank@email.com,West,2024-06-15
9,Ivy Chen,ivy@email.com,North,2024-07-20
10,Jack White,jack@email.com,South,2024-08-01

products.csv

CSV

product_id,product_name,category,price
101,Wireless Mouse,Electronics,29.99
102,USB-C Hub,Electronics,49.99
103,Cotton T-Shirt,Clothing,19.99
104,Running Shoes,Clothing,89.99
105,Desk Lamp,Home,34.99
106,Coffee Maker,Home,79.99
107,Bluetooth Speaker,Electronics,59.99
108,Yoga Mat,Home,24.99

orders.csv

CSV

order_id,customer_id,product_id,quantity,amount,order_date,status
1001,1,101,2,59.98,2024-06-01,completed
1002,2,103,1,19.99,2024-06-02,completed
1003,3,106,1,79.99,2024-06-03,completed
1004,1,105,1,34.99,2024-06-05,completed
1005,4,102,3,149.97,2024-06-07,pending
1006,5,107,1,59.99,2024-06-10,completed
1007,6,104,2,179.98,2024-06-12,completed
1008,7,108,1,24.99,2024-06-15,returned
1009,8,101,1,29.99,2024-06-18,completed
1010,9,103,4,79.96,2024-06-20,completed
1011,10,106,1,79.99,2024-06-22,completed
1012,1,107,1,59.99,2024-06-25,completed
1013,3,102,2,99.98,2024-06-28,pending
1014,5,105,1,34.99,2024-07-01,completed
1015,2,104,1,89.99,2024-07-03,completed

2b. Create bronze tables

SQL

-- Bronze tables: store raw data exactly as it comes
CREATE TABLE bronze.customers (
    customer_id   INTEGER,
    name          TEXT,
    email         TEXT,
    region        TEXT,
    created_at    TEXT   -- text on purpose: raw = no type assumptions
);

CREATE TABLE bronze.products (
    product_id    INTEGER,
    product_name  TEXT,
    category      TEXT,
    price         TEXT   -- raw price as text
);

CREATE TABLE bronze.orders (
    order_id      INTEGER,
    customer_id   INTEGER,
    product_id    INTEGER,
    quantity       INTEGER,
    amount        TEXT,   -- raw amount as text
    order_date    TEXT,   -- raw date as text
    status        TEXT
);

Why TEXT for everything?

In the bronze layer we keep data as raw as possible — even numbers like price come in as text. Why? Because source systems can send weird stuff: "$29.99" with a dollar sign, "N/A" instead of a number, dates in different formats. By storing as TEXT first, nothing breaks on load. We fix types later in silver. This is a best practice used in production pipelines.

2c. Load CSV files into bronze

Using psql (replace the path with where you saved your CSVs):

SQL (psql)

\COPY bronze.customers FROM '/path/to/customers.csv' WITH (FORMAT csv, HEADER true);
\COPY bronze.products  FROM '/path/to/products.csv'  WITH (FORMAT csv, HEADER true);
\COPY bronze.orders    FROM '/path/to/orders.csv'    WITH (FORMAT csv, HEADER true);

Tip: Using DBeaver or pgAdmin?

Right-click the table → Import Data → pick your CSV. Make sure "Header" is checked. Same result, just with a GUI instead of commands.

2d. Verify the load

SQL

SELECT 'customers' AS table_name, COUNT(*) AS rows FROM bronze.customers
UNION ALL
SELECT 'products',  COUNT(*) FROM bronze.products
UNION ALL
SELECT 'orders',    COUNT(*) FROM bronze.orders;

Expected: customers = 10, products = 8, orders = 15. If your counts match — bronze layer is done!

Step 2 complete!

Three CSV files created / downloaded
Bronze tables created with TEXT columns
Data loaded and verified (10 + 8 + 15 rows)

Initialize DBT Project

Set up DBT and connect it to your database

What is DBT?

DBT (Data Build Tool) lets you write SQL files, and it runs them in the right order against your database. Think of it as a recipe book for your data: each SQL file is a recipe that says "take this table, transform it, and save the result as a new table." DBT also lets you add tests (e.g. "order_id must be unique") and documentation.

3a. Install DBT

Terminal

pip install dbt-postgres

This installs DBT with the Postgres adapter. If using Redshift, run pip install dbt-redshift instead.

3b. Create a new DBT project

Terminal

dbt init shopeasy_dbt
cd shopeasy_dbt

DBT will ask you which database you want to use — pick postgres. It creates a folder structure like this:

Folder structure

shopeasy_dbt/
├── dbt_project.yml       <-- project config
├── models/               <-- your SQL files go here
│   └── example/
├── tests/
├── macros/
└── profiles.yml          <-- database connection (in ~/.dbt/)

3c. Configure the connection

Open ~/.dbt/profiles.yml (DBT creates it during init) and make sure it looks like this:

YAML (~/.dbt/profiles.yml)

shopeasy_dbt:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      user: your_username
      password: your_password
      dbname: shopeasy_warehouse
      schema: silver        # DBT writes to silver by default
      threads: 4

Common mistake

If you get "connection refused", make sure PostgreSQL is running (brew services list on Mac, sudo systemctl status postgresql on Linux). Also double-check the username and password.

3d. Verify the connection

Terminal

dbt debug

If everything is green (all checks passed), your DBT project is connected to the database!

3e. Set up the models folder

Delete the example folder and create our structure:

Terminal

rm -rf models/example
mkdir -p models/silver
mkdir -p models/gold

Step 3 complete!

DBT installed (dbt-postgres)
Project initialized (shopeasy_dbt)
Connection verified with dbt debug
Folder structure ready: models/silver, models/gold

Build Silver Models (Clean Layer)

Write DBT models to clean and type-cast bronze data

What's a "model" in DBT?

A model is just a .sql file. When you run dbt run, DBT executes each file and saves the result as a table (or view) in your database. The file name becomes the table name. So silver_customers.sql creates a table called silver_customers.

4a. silver_customers.sql

Create the file models/silver/silver_customers.sql:

SQL (models/silver/silver_customers.sql)

{{ config(materialized='table', schema='silver') }}

SELECT
    customer_id,
    TRIM(name)                    AS customer_name,
    LOWER(TRIM(email))            AS email,
    INITCAP(TRIM(region))         AS region,
    CAST(created_at AS DATE)      AS created_at
FROM {{ source('bronze', 'customers') }}
WHERE customer_id IS NOT NULL

Line-by-line explanation

config(materialized='table'): Tells DBT to save the result as a real table (not just a view). schema='silver': Save it in the silver schema.

TRIM(name): Removes spaces from the start and end (e.g. " Alice " → "Alice"). LOWER(email): Makes email lowercase (e.g. "Bob@Email.COM" → "bob@email.com"). INITCAP(region): Capitalizes first letter (e.g. "north" → "North"). CAST(… AS DATE): Converts the text date to a proper DATE type.

WHERE customer_id IS NOT NULL: Drops any rows where customer_id is missing — bad data shouldn't reach silver.

4b. silver_products.sql

SQL (models/silver/silver_products.sql)

{{ config(materialized='table', schema='silver') }}

SELECT
    product_id,
    TRIM(product_name)                AS product_name,
    INITCAP(TRIM(category))           AS category,
    CAST(price AS DECIMAL(10,2))      AS price
FROM {{ source('bronze', 'products') }}
WHERE product_id IS NOT NULL
  AND price IS NOT NULL

4c. silver_orders.sql

SQL (models/silver/silver_orders.sql)

{{ config(materialized='table', schema='silver') }}

SELECT
    order_id,
    customer_id,
    product_id,
    quantity,
    CAST(amount AS DECIMAL(10,2))     AS amount,
    CAST(order_date AS DATE)          AS order_date,
    LOWER(TRIM(status))               AS status
FROM {{ source('bronze', 'orders') }}
WHERE order_id IS NOT NULL

4d. Define the bronze source

DBT needs to know where bronze tables live. Create models/silver/sources.yml:

YAML (models/silver/sources.yml)

version: 2

sources:
  - name: bronze
    schema: bronze
    tables:
      - name: customers
      - name: products
      - name: orders

4e. Run the Silver models

Terminal

dbt run --select silver

You should see three models pass. Check in psql:

SQL

SELECT * FROM silver.silver_customers LIMIT 5;
SELECT * FROM silver.silver_orders LIMIT 5;

Step 4 complete!

Three silver models created and executed
Data types are now correct (DATE, DECIMAL)
Whitespace trimmed, emails lowercased, regions capitalized
Null key rows filtered out

Build Gold Models (Business Layer)

Create aggregated tables for dashboards and reports

What are Gold models?

Gold tables answer business questions directly. Instead of "give me all orders and I'll figure it out", Gold gives you "here's total revenue by region, already calculated." Dashboards and analysts read from Gold — they never touch Bronze or Silver.

5a. gold_sales_by_region.sql

SQL (models/gold/gold_sales_by_region.sql)

{{ config(materialized='table', schema='gold') }}

SELECT
    c.region,
    COUNT(DISTINCT o.order_id)        AS total_orders,
    COUNT(DISTINCT o.customer_id)     AS unique_customers,
    SUM(o.amount)                     AS total_revenue,
    ROUND(AVG(o.amount), 2)           AS avg_order_value
FROM {{ ref('silver_orders') }} o
JOIN {{ ref('silver_customers') }} c
    ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
GROUP BY c.region
ORDER BY total_revenue DESC

What this gives you

One row per region (North, South, East, West) with: how many orders, how many unique customers, total revenue, and average order value. This is exactly what a "Sales by Region" dashboard card would read from.

5b. gold_top_products.sql

SQL (models/gold/gold_top_products.sql)

{{ config(materialized='table', schema='gold') }}

SELECT
    p.product_id,
    p.product_name,
    p.category,
    COUNT(o.order_id)                  AS times_ordered,
    SUM(o.quantity)                    AS total_units_sold,
    SUM(o.amount)                      AS total_revenue
FROM {{ ref('silver_orders') }} o
JOIN {{ ref('silver_products') }} p
    ON o.product_id = p.product_id
WHERE o.status = 'completed'
GROUP BY p.product_id, p.product_name, p.category
ORDER BY total_revenue DESC

5c. gold_customer_lifetime_value.sql

SQL (models/gold/gold_customer_lifetime_value.sql)

{{ config(materialized='table', schema='gold') }}

SELECT
    c.customer_id,
    c.customer_name,
    c.region,
    COUNT(o.order_id)                   AS total_orders,
    SUM(o.amount)                       AS lifetime_value,
    MIN(o.order_date)                   AS first_order,
    MAX(o.order_date)                   AS last_order
FROM {{ ref('silver_customers') }} c
LEFT JOIN {{ ref('silver_orders') }} o
    ON c.customer_id = o.customer_id
    AND o.status = 'completed'
GROUP BY c.customer_id, c.customer_name, c.region
ORDER BY lifetime_value DESC NULLS LAST

5d. Run Gold models

Terminal

dbt run --select gold

Step 5 complete!

gold.gold_sales_by_region — revenue, orders, avg order value by region
gold.gold_top_products — best-selling products by revenue
gold.gold_customer_lifetime_value — each customer's total spend

Add DBT Tests

Make sure your data is correct with automated tests

Why test data?

Imagine you shipped a dashboard and the CEO sees duplicate orders inflating revenue. Embarrassing! Tests catch that before it happens. DBT tests run SQL checks: "is order_id unique?", "is customer_id never null?", etc. If a test fails, the pipeline stops and alerts you.

6a. Schema tests

Create models/silver/schema.yml:

YAML (models/silver/schema.yml)

version: 2

models:
  - name: silver_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - unique
          - not_null

  - name: silver_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('silver_customers')
              field: customer_id

  - name: silver_products
    columns:
      - name: product_id
        tests:
          - unique
          - not_null

6b. Run the tests

Terminal

dbt test

All tests should pass. If one fails, DBT tells you exactly which test and which rows failed — fix the data or the model and re-run.

Step 6 complete!

Uniqueness tests on primary keys
Not-null tests on critical columns
Referential integrity test (orders → customers)

Orchestrate with Airflow

Automate the pipeline to run daily

Why Airflow?

Right now you run dbt run by hand. In production, the pipeline should run automatically every day at 6 AM (before anyone checks dashboards). Airflow is the tool that schedules and monitors these runs. If something fails, it retries and sends you an alert.

7a. Create the DAG file

Save this as dags/shopeasy_pipeline.py in your Airflow home:

Python (dags/shopeasy_pipeline.py)

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id='shopeasy_warehouse',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',
    catchup=False,
    tags=['dbt', 'warehouse'],
) as dag:

    dbt_run = BashOperator(
        task_id='dbt_run',
        bash_command='cd /path/to/shopeasy_dbt && dbt run',
    )

    dbt_test = BashOperator(
        task_id='dbt_test',
        bash_command='cd /path/to/shopeasy_dbt && dbt test',
    )

    dbt_run >> dbt_test

What this DAG does

schedule_interval='@daily': Runs once per day. dbt_run: Executes all your models (silver + gold). dbt_test: Runs all tests after the models finish. The >> means "run dbt_test only after dbt_run succeeds." If dbt_run fails, dbt_test doesn't run.

Don't have Airflow installed?

For learning purposes, you can skip this step and run dbt run && dbt test manually. When you're ready, check our Airflow course for full setup instructions.

Step 7 complete!

Airflow DAG created with two tasks
dbt_run → dbt_test (sequential)
Runs daily automatically

Verify & Celebrate!

Query your Gold tables and see the finished product

Let's query the Gold tables and see what we built:

Sales by region

SQL

SELECT * FROM gold.gold_sales_by_region;

Top products

SQL

SELECT * FROM gold.gold_top_products;

Customer lifetime value

SQL

SELECT * FROM gold.gold_customer_lifetime_value;

Project complete!

You just built a full data warehouse with the medallion pattern! Here's what you can put on your resume:

Designed and implemented a 3-layer medallion data warehouse (Bronze/Silver/Gold)
Built DBT models with automated tests for data quality
Created an Airflow DAG for daily pipeline orchestration
Produced business-ready analytics tables (sales by region, CLV, top products)

What's next?

Connect Metabase to your Gold schema and build a dashboard
Add more source tables (e.g. website clicks, inventory)
Try the Real-Time Sales Dashboard project next
Add incremental models in DBT (process only new rows each day)

Unlock the Full Build Guide

Get the complete step-by-step implementation: copy-paste code, line-by-line explanations, tips, warnings, and verification queries. Build this project from zero to production.

Upgrade to Pro

Back to all projects