LESSON 5 ยท CORE

๐Ÿ—๏ธ Models & Materializations โ€” The Heart of dbt

Every SQL file is a model. Every model becomes a table or view. Let's master this.

๐Ÿ“„ What is a Model?

A model is just a SQL SELECT statement saved in a file. That's it. Seriously โ€” that's the whole thing.

dbt takes your SELECT and turns it into a real table or view in your warehouse. It's like writing a recipe on a card โ€” dbt reads the card and cooks the dish for you. You write the instructions, dbt does the cooking!

Let's look at the simplest possible model. Create a file called my_first_model.sql inside your models/ folder:

models/my_first_model.sql
-- This entire file IS the model. Nothing else needed!

SELECT
    1 AS id,
    'Hello dbt!' AS greeting

When you run dbt run, dbt reads this file and executes something like:

What dbt actually runs in your warehouse
CREATE VIEW my_first_model AS (
    SELECT
        1 AS id,
        'Hello dbt!' AS greeting
);

The Golden Rule of dbt Models:

๐Ÿ“„ One .sql file = One model = One table or view in your warehouse

If you have 50 SQL files in your models/ folder, running dbt run creates 50 tables/views. It's that simple. No config files, no registration, no boilerplate โ€” just write SQL and go.

Here's a more realistic model โ€” a staging model that cleans up raw customer data:

models/staging/stg_customers.sql
SELECT
    id            AS customer_id,
    first_name,
    last_name,
    email_addr    AS email,
    created_at::DATE AS signup_date,
    CASE
        WHEN status = 'A' THEN 'active'
        WHEN status = 'I' THEN 'inactive'
        ELSE 'unknown'
    END AS customer_status
FROM
    {{ source('raw', 'customers') }}

Notice there's no CREATE TABLE or CREATE VIEW in the file. You only write the SELECT. dbt wraps it in the appropriate DDL for you based on the materialization you choose. That's the magic!

๐Ÿงฑ Materializations โ€” How dbt Builds Your Models

A "materialization" is just a fancy word for HOW dbt creates your model in the warehouse. Think of it like choosing how to serve food:

๐ŸชŸ View = Look through a window (query runs fresh every time)

๐Ÿ“ธ Table = Print a photo (data is stored physically)

๐Ÿ“ธ+ Incremental = Add new photos to an album (only process new data)

๐Ÿ“ Ephemeral = A sticky note (exists only temporarily)

๐Ÿ“ท Snapshot = A daily photo diary (tracks changes over time)

dbt has 5 materialization types. Choosing the right one is one of the most important decisions you'll make. Let's deep-dive into each one.

๐ŸชŸ Materialization 1: View (Default)

A view is like a window into your data. Every time someone looks through the window, the query runs fresh. Nothing is stored โ€” it's just a saved query. Imagine a window in your house that always shows the live street outside. You don't take a photo of the street; you just look through the window whenever you want to see what's happening right now.

models/staging/stg_orders.sql
{{ config(materialized='view') }}

-- This creates a VIEW in your warehouse
-- The query runs fresh every time someone queries stg_orders

SELECT
    id         AS order_id,
    user_id    AS customer_id,
    order_date,
    status,
    amount / 100.0 AS amount_dollars
FROM
    {{ source('raw', 'orders') }}

When to use views:

โœ… Pros

  • No storage cost (nothing is physically stored)
  • Always shows the latest data
  • Builds instantly (just saves the query)
  • Perfect for staging layers

โŒ Cons

  • Slow to query (runs the full SQL every time)
  • Stacks up โ€” if view A references view B which references view C, querying A runs all three
  • Not suitable for dashboards with many users

๐Ÿ“ธ Materialization 2: Table

A table is like printing a photo. The data is physically stored in your warehouse. It's fast to look at (no waiting for the camera to take a new picture), but you need to reprint the entire photo (rebuild the table) whenever the underlying data changes.

models/marts/fct_daily_revenue.sql
{{ config(materialized='table') }}

-- This creates a physical TABLE in your warehouse
-- Data is stored โ€” queries are fast!

SELECT
    DATE_TRUNC('day', order_date) AS revenue_date,
    COUNT(*) AS total_orders,
    SUM(amount_dollars) AS total_revenue,
    AVG(amount_dollars) AS avg_order_value
FROM
    {{ ref('stg_orders') }}
WHERE
    status = 'completed'
GROUP BY 1

When to use tables:

โœ… Pros

  • Fast to query (data is pre-computed)
  • Great for dashboards and BI tools
  • No cascading query chains

โŒ Cons

  • Uses warehouse storage
  • Full rebuild every dbt run (drops and recreates)
  • Rebuild can be slow for large datasets

View vs Table โ€” The Restaurant Analogy:

๐ŸชŸ View = A made-to-order dish. The chef cooks it fresh every time someone orders. Great for quality, but slow if 100 people order at once.

๐Ÿ“ธ Table = A buffet dish. The chef cooks a big batch in advance. Everyone grabs a plate instantly. But if the recipe changes, you need to cook a whole new batch.

๐Ÿ“ธ+ Materialization 3: Incremental

Imagine you have a photo album with 10,000 photos. Every day, you take 50 new photos. Would you reprint ALL 10,000 photos every single day? Of course not! You'd only print the 50 new ones and add them to the album.

That's exactly what incremental models do. Instead of rebuilding the entire table every run, dbt only processes the new or changed rows and inserts/merges them. This saves massive amounts of time and compute cost.

models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge'
) }}

SELECT
    event_id,
    user_id,
    event_type,
    event_timestamp,
    page_url,
    device_type
FROM
    {{ ref('stg_events') }}

-- This WHERE clause is the magic!
-- It only runs on the FIRST build (full load)
-- On subsequent runs, it filters to only NEW rows
{% if is_incremental() %}
    WHERE event_timestamp > (
        SELECT MAX(event_timestamp)
        FROM {{ this }}
    )
{% endif %}

How It Works โ€” Step by Step

Full Refresh vs Incremental โ€” Visual Comparison

โŒ Table (Full Rebuild)

Day 1: Process 10,000 rows
Day 2: Process 10,050 rows
Day 3: Process 10,100 rows
Day 4: Process 10,150 rows
Total processed: 40,300 rows

โœ… Incremental

Day 1: Process 10,000 rows (first run)
Day 2: Process 50 new rows
Day 3: Process 50 new rows
Day 4: Process 50 new rows
Total processed: 10,150 rows

That's a 75% reduction in processing! For tables with billions of rows, this is the difference between minutes and hours.

The 3 Incremental Strategies

Strategy How It Works Best For
append Simply adds new rows to the table. No duplicate checking. Event logs, immutable data (rows never change)
merge Inserts new rows AND updates existing rows (using unique_key). Data that can be updated (orders, user profiles)
delete+insert Deletes matching rows first, then inserts the new version. When merge isn't supported or you need a clean swap

The Full-Refresh Escape Hatch: If your incremental model gets out of sync or you change the logic, you can force a complete rebuild with:

dbt run --full-refresh -s fct_events

This drops the existing table and rebuilds it from scratch โ€” like reprinting the entire photo album. Use it when you change column definitions or fix bugs in your incremental logic.

The {{ this }} variable is special โ€” it refers to the current model's table in the warehouse. So SELECT MAX(event_timestamp) FROM {{ this }} gets the latest timestamp already in your table, and you only process rows newer than that.

๐Ÿ“ Materialization 4: Ephemeral

An ephemeral model is like a sticky note you use while cooking but throw away after. You write "chop onions, mince garlic" on a sticky note as a reminder, but you don't frame it and hang it on the wall. It exists only to help you while you're cooking.

In dbt, an ephemeral model doesn't create anything in your warehouse. Instead, it gets injected as a CTE (Common Table Expression) inside whatever model references it. It's invisible to your warehouse โ€” only dbt knows it exists.

models/intermediate/int_order_amounts.sql
{{ config(materialized='ephemeral') }}

-- This model won't create a table or view!
-- It becomes a CTE inside any model that ref()s it

SELECT
    order_id,
    customer_id,
    amount_dollars,
    amount_dollars * 0.1 AS tax_amount,
    amount_dollars * 1.1 AS total_with_tax
FROM
    {{ ref('stg_orders') }}

When another model references this ephemeral model, dbt compiles it as a CTE:

What dbt actually compiles (behind the scenes)
-- dbt injects the ephemeral model as a CTE
WITH int_order_amounts AS (
    SELECT
        order_id,
        customer_id,
        amount_dollars,
        amount_dollars * 0.1 AS tax_amount,
        amount_dollars * 1.1 AS total_with_tax
    FROM raw.orders
)
SELECT * FROM int_order_amounts
WHERE total_with_tax > 100

When to use ephemeral:

Careful! You can't query an ephemeral model directly in your SQL editor โ€” it doesn't exist in the warehouse. You also can't use dbt test on it directly. Use ephemeral sparingly and only for truly intermediate calculations.

๐Ÿ“ท Materialization 5: Snapshot

Snapshots are like taking a photo of your fridge every day. Monday: 6 eggs, whole milk, cheddar cheese. Tuesday: 4 eggs, whole milk, swiss cheese. The milk brand didn't change, but the cheese did! By looking at all your daily photos, you can see exactly what changed and when.

In data terms, this is called Slowly Changing Dimension Type 2 (SCD Type 2). Don't let the fancy name scare you โ€” it just means "keep a history of how rows change over time."

snapshots/snapshot_customers.sql
{% snapshot snapshot_customers %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
    )
}}

SELECT
    customer_id,
    email,
    plan_type,
    updated_at
FROM {{ source('raw', 'customers') }}

{% endsnapshot %}

SCD Type 2 โ€” How It Tracks Changes

Imagine customer #42 changes their email and upgrades their plan. Here's what the snapshot table looks like over time:

customer_id email plan_type dbt_valid_from dbt_valid_to
42 old@email.com free 2024-01-01 2024-03-15
42 old@email.com pro 2024-03-15 2024-06-20
42 new@email.com pro 2024-06-20 null (current)

The row with dbt_valid_to = null is always the current version. Previous versions have an end date. It's like a time machine for your data โ€” you can ask "What was customer 42's email on February 1st?" and get the answer!

When to use snapshots:

๐Ÿ“Š Tracking customer plan changes for churn analysis

๐Ÿ’ฐ Tracking product price changes over time

๐Ÿข Tracking employee department transfers

๐Ÿ“ฆ Tracking order status changes (pending โ†’ shipped โ†’ delivered)

๐ŸŒณ The Materialization Decision Tree

Not sure which materialization to pick? Follow this flowchart:

Start Here: What kind of model?
๐Ÿ”
Is it a staging model? (light cleaning, renaming)
โ†’ View
๐Ÿ“Š
Is the table huge? (>1M rows) (events, logs, transactions)
โ†’ Incremental
๐Ÿ“ท
Do you need historical tracking? (what changed and when)
โ†’ Snapshot
๐Ÿ“
Is it only used inside other models? (helper logic, no direct queries)
โ†’ Ephemeral
๐Ÿฝ๏ธ
Otherwise (mart models, dashboards) (needs to be fast to query)
โ†’ Table

When in doubt, start with view (it's the default). If queries are too slow, upgrade to table. If the table is too big to rebuild every run, upgrade to incremental. You can always change the materialization later โ€” it's just one line of config!

๐Ÿ›๏ธ The 3-Layer Architecture

Professional dbt projects organize models into three layers, like a restaurant kitchen with three stations. Raw ingredients come in, get prepped, get cooked, and come out as a beautiful dish. Every model belongs to one of these layers.

๐Ÿฅฌ
Staging
Prep Station
Clean, rename, cast
materialized = view
โ†’
๐Ÿณ
Intermediate
Cooking Station
Join, combine, enrich
materialized = ephemeral / view
โ†’
๐Ÿฝ๏ธ
Marts
Plating Station
Final tables for dashboards
materialized = table / incremental

๐Ÿฅฌ Layer 1: Staging (stg_)

Purpose: One staging model per source table. Clean up the raw data โ€” rename ugly columns, cast data types, filter out junk. No joins, no aggregations, no business logic.

Think of it like washing vegetables. You don't cook them yet. You just wash off the dirt, peel the skin, and chop them into uniform pieces so they're ready for the next station.

models/staging/stg_customers.sql
{{ config(materialized='view') }}

SELECT
    id            AS customer_id,
    first_name,
    last_name,
    LOWER(email_addr)  AS email,
    created_at::DATE AS signup_date,
    CASE
        WHEN status = 'A' THEN 'active'
        WHEN status = 'I' THEN 'inactive'
        ELSE 'unknown'
    END AS customer_status
FROM
    {{ source('raw', 'customers') }}

๐Ÿณ Layer 2: Intermediate (int_)

Purpose: Combine staging models. Join tables, apply business logic, create calculated fields. This is where the real cooking happens.

models/intermediate/int_customer_orders.sql
{{ config(materialized='ephemeral') }}

SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    c.email,
    c.signup_date,
    o.order_id,
    o.order_date,
    o.amount_dollars,
    o.status AS order_status
FROM
    {{ ref('stg_customers') }} c
LEFT JOIN
    {{ ref('stg_orders') }} o
    ON c.customer_id = o.customer_id

๐Ÿฝ๏ธ Layer 3: Marts (fct_ and dim_)

Purpose: The final, polished tables that your dashboards and analysts query. Marts come in two flavors:

๐Ÿ“Š Fact Tables (fct_)

Events that happened โ€” orders placed, payments made, clicks recorded. They have numbers you can count and sum.

Examples: fct_orders, fct_payments, fct_page_views

๐Ÿ‘ค Dimension Tables (dim_)

Descriptive attributes โ€” customer details, product info, store locations. They describe the "who, what, where" of your facts.

Examples: dim_customers, dim_products, dim_stores

models/marts/fct_customer_summary.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    first_name,
    last_name,
    email,
    signup_date,
    COUNT(order_id) AS total_orders,
    SUM(amount_dollars) AS lifetime_revenue,
    AVG(amount_dollars) AS avg_order_value,
    MIN(order_date) AS first_order_date,
    MAX(order_date) AS last_order_date,
    CASE
        WHEN COUNT(order_id) >= 10 THEN 'vip'
        WHEN COUNT(order_id) >= 5  THEN 'regular'
        WHEN COUNT(order_id) >= 1  THEN 'new'
        ELSE 'prospect'
    END AS customer_tier
FROM
    {{ ref('int_customer_orders') }}
GROUP BY
    customer_id, first_name, last_name, email, signup_date

Naming Conventions Cheat Sheet

Prefix Layer Materialization Example Purpose
stg_StagingViewstg_customersClean raw data, rename columns
int_IntermediateEphemeral / Viewint_customer_ordersJoin and combine staging models
fct_Marts (Fact)Table / Incrementalfct_ordersEvent/transaction tables with metrics
dim_Marts (Dimension)Tabledim_customersDescriptive lookup tables

Real-World Example: E-Commerce Data Flow

Let's trace how raw e-commerce data flows through all 3 layers:

๐Ÿ“ฆ
Raw
Sources
โ†’
๐Ÿฅฌ
stg_customers
stg_orders
stg_products
โ†’
๐Ÿณ
int_customer
_orders
int_order
_products
โ†’
๐Ÿฝ๏ธ
fct_orders
dim_customers
dim_products
โ†’
๐Ÿ“Š
Dashboards
& Reports

๐Ÿ› ๏ธ Hands-On: Build Your First 3 Models

Time to get your hands dirty! Let's build a complete 3-layer pipeline from scratch. We'll create a staging model, an intermediate model, and a mart model.

Step 1: Create the Staging Model

Create the file models/staging/stg_customers.sql:

models/staging/stg_customers.sql
{{ config(materialized='view') }}

SELECT
    id            AS customer_id,
    first_name,
    last_name,
    LOWER(email)   AS email,
    created_at::DATE AS signup_date
FROM
    {{ source('raw', 'customers') }}

Step 2: Create the Intermediate Model

Create the file models/intermediate/int_customer_orders.sql:

models/intermediate/int_customer_orders.sql
{{ config(materialized='ephemeral') }}

SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    c.email,
    c.signup_date,
    o.order_id,
    o.order_date,
    o.amount_dollars
FROM
    {{ ref('stg_customers') }} c
LEFT JOIN
    {{ ref('stg_orders') }} o
    ON c.customer_id = o.customer_id

Step 3: Create the Mart Model

Create the file models/marts/fct_customer_summary.sql:

models/marts/fct_customer_summary.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    first_name,
    last_name,
    email,
    signup_date,
    COUNT(order_id)       AS total_orders,
    COALESCE(SUM(amount_dollars), 0) AS lifetime_revenue,
    COALESCE(AVG(amount_dollars), 0) AS avg_order_value,
    MIN(order_date)       AS first_order_date,
    MAX(order_date)       AS last_order_date
FROM
    {{ ref('int_customer_orders') }}
GROUP BY
    customer_id, first_name, last_name, email, signup_date

Step 4: Run It!

Open your terminal and run:

Terminal
$ dbt run

Running with dbt=1.7.0
Found 3 models, 0 tests, 0 snapshots, 0 seeds

Concurrency: 4 threads (target='dev')

1 of 3 START view model dev.stg_customers .............. [RUN]
1 of 3 OK created view model dev.stg_customers .......... [CREATE VIEW in 0.12s]

2 of 3 START table model dev.fct_customer_summary ........ [RUN]
2 of 3 OK created table model dev.fct_customer_summary ... [SELECT 1,247 in 0.45s]

Finished running 1 view model, 1 table model in 0 hours 0 minutes and 0.89 seconds (0.89s).
Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Notice the output says TOTAL=2, not 3! That's because int_customer_orders is ephemeral โ€” it doesn't create anything in the warehouse. It was silently injected as a CTE inside fct_customer_summary. Sneaky and efficient!

๐Ÿ“Š Materialization Comparison Table

Here's the ultimate cheat sheet. Bookmark this โ€” you'll come back to it often!

Feature View Table Incremental Ephemeral Snapshot
Creates in warehouse? Yes (view) Yes (table) Yes (table) No Yes (table)
Stores data? No Yes Yes No Yes
Build speed Instant Slow (full rebuild) Fast (new rows only) N/A Fast (changed rows)
Query speed Slow Fast Fast N/A Fast
Storage cost None High High None High
Tracks history? No No No No Yes
Best for Staging, light transforms Marts, dashboards Large fact tables Helper CTEs SCD Type 2
Typical layer Staging Marts Marts Intermediate Snapshots folder

๐Ÿง  Quick Quiz โ€” Test Your Understanding!

Let's see if you've mastered models and materializations. Click the answer you think is correct:

Question 1: What is a dbt model?

A) A Python script that transforms data
B) A SQL SELECT statement saved in a .sql file
C) A YAML configuration file
D) A stored procedure in the warehouse

Question 2: Which materialization should you use for a staging model?

A) Table
B) View
C) Incremental
D) Snapshot

Question 3: You have a table with 500 million event rows that grows by 1 million per day. Which materialization saves the most compute?

A) View
B) Table
C) Incremental
D) Ephemeral

Question 4: What does an ephemeral model create in the warehouse?

A) A view
B) A table
C) A temporary table
D) Nothing โ€” it becomes a CTE inside other models

Can you explain the difference between a view and a table to a non-technical friend? Try it out loud! If you can explain it using the "window vs photo" analogy, you've truly understood it.

  • A model = a SQL SELECT in a .sql file โ†’ becomes a table or view
  • View = saved query, runs fresh every time (staging layer)
  • Table = physically stored data, fast to query (mart layer)
  • Incremental = only processes new/changed rows (large fact tables)
  • 3-Layer Architecture: staging (stg_) โ†’ intermediate (int_) โ†’ marts (fct_ / dim_)
  • {{ config(materialized='...') }} at the top of a model sets its type
  • Naming conventions: stg_, int_, fct_, dim_
  • Ephemeral = invisible CTE, doesn't create warehouse objects
  • Snapshot = SCD Type 2 history tracking (lives in snapshots/ folder)
  • Incremental strategies: append, merge, delete+insert
  • {{ this }} = refers to the current model's existing table
  • --full-refresh flag to force a complete rebuild of incremental models
{PW}