LESSON 6 ยท CORE

๐Ÿ”— Sources, Refs & Lineage โ€” Connecting the Dots

How dbt knows which models depend on which, and builds them in the right order

๐Ÿšจ The Problem: Hardcoded Table Names

Imagine your recipe says "use the tomatoes from the blue bowl on the third shelf." What if someone moves the bowl to the second shelf? Or paints it green? Everything breaks!

That's exactly what happens when you hardcode table names in SQL. dbt solves this with two magic words: source() and ref().

Here's what bad SQL looks like โ€” the kind written before dbt existed:

BAD: Hardcoded table names
-- โŒ DON'T DO THIS โ€” hardcoded schema and table names!

SELECT
    c.id        AS customer_id,
    c.first_name,
    o.order_id,
    o.amount
FROM
    raw_production.public.customers c    -- What if the schema changes?
LEFT JOIN
    raw_production.public.orders o       -- What if we move to a new database?
    ON c.id = o.customer_id

What Goes Wrong with Hardcoding?

Scenario What Happens Pain Level
Schema name changesEvery query referencing it breaks๐Ÿ˜ฑ High
Move to a new warehouseRewrite every single SQL file๐Ÿคฏ Extreme
Dev vs Prod environmentsCan't easily switch between them๐Ÿ˜ค Medium
Someone renames a tableNo way to find all references๐Ÿ˜ต High
Build order unknowndbt has no idea which model to build first๐Ÿ’€ Critical

Hardcoding table names is the #1 mistake beginners make. It's like writing someone's street address on every letter instead of using their name โ€” if they move, all your letters go to the wrong place!

๐Ÿ“ก source() โ€” Declaring Your Raw Data

source() is like putting a label on each ingredient delivery: "These tomatoes came from Farm ABC, delivered today." Now dbt knows where every ingredient comes from, and it can check if the delivery is fresh!

A source is raw data that lands in your warehouse from external systems โ€” your app's database, Stripe payments, Google Analytics, etc. dbt doesn't create these tables; they already exist. You just need to tell dbt about them.

Step 1: Declare Sources in YAML

First, create a YAML file that lists all your raw data tables. This is like registering your ingredient suppliers:

models/staging/_sources.yml
version: 2

sources:
  - name: raw_shop              # A friendly name for this group
    database: raw_production     # The actual database name
    schema: public              # The actual schema name
    description: "Raw data from our online shop's PostgreSQL database"

    tables:
      - name: customers
        description: "One row per registered customer"
        columns:
          - name: id
            description: "Primary key"

      - name: orders
        description: "One row per order placed"
        loaded_at_field: _etl_loaded_at    # For freshness checks!
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

      - name: payments
        description: "One row per payment transaction"

Step 2: Use source() in Your Models

Now, instead of hardcoding raw_production.public.customers, you write:

models/staging/stg_customers.sql
SELECT
    id          AS customer_id,
    first_name,
    last_name,
    email_addr  AS email,
    created_at::DATE AS signup_date
FROM
    {{ source('raw_shop', 'customers') }}
    --  โ†‘ friendly name   โ†‘ table name

What happens under the hood: When dbt compiles this, it replaces {{ source('raw_shop', 'customers') }} with the actual table path raw_production.public.customers. If the schema ever changes, you update it in one place (the YAML file), and every model that uses this source is automatically fixed!

Source Freshness: Is My Data Stale?

Remember the freshness block in the YAML above? Run this command to check if your raw data is up-to-date:

Terminal
$ dbt source freshness

Running source freshness checks...
PASS  source raw_shop.customers   # Last loaded 2 hours ago โœ…
WARN  source raw_shop.orders      # Last loaded 14 hours ago โš ๏ธ
ERROR source raw_shop.payments    # Last loaded 30 hours ago โŒ

Source freshness is like checking the expiration date on your milk. If the data is too old, something is probably broken upstream (maybe your ETL pipeline stopped running). Catch it before your dashboards show stale numbers!

๐Ÿช„ ref() โ€” The Magic Glue

ref() is like saying "use the chopped tomatoes from the prep station" instead of "use the tomatoes from the blue bowl on shelf 3." If the prep station moves to a different counter, dbt automatically updates the reference. You never have to worry about where something is โ€” just what it's called.

While source() points to raw external data, ref() points to other dbt models. It's how models talk to each other.

How ref() Works

models/intermediate/int_customer_orders.sql
-- This model DEPENDS on stg_customers and stg_orders
-- ref() tells dbt about these dependencies!

SELECT
    c.customer_id,
    c.first_name,
    c.email,
    o.order_id,
    o.order_date,
    o.amount
FROM
    {{ ref('stg_customers') }} c
LEFT JOIN
    {{ ref('stg_orders') }} o
    ON c.customer_id = o.customer_id

Why ref() Is Better Than Hardcoding

โŒ Hardcoded

FROM analytics.public.stg_customers

  • Breaks if schema changes
  • dbt can't detect dependencies
  • No build order awareness

โœ… Using ref()

FROM {{ ref('stg_customers') }}

  • Resolves automatically
  • dbt builds the dependency graph
  • Correct build order guaranteed

๐Ÿง  The Golden Rule

Every time you reference another dbt model, use ref(). Every time you reference raw external data, use source(). Never hardcode table names. Period.

What ref() Does Behind the Scenes

When dbt sees {{ ref('stg_customers') }}, it does two things:

1๏ธโƒฃ
Resolves the Name

Replaces it with the full table path (e.g., analytics.dbt_fk.stg_customers)

2๏ธโƒฃ
Records the Dependency

Adds an edge to the DAG: "this model depends on stg_customers"

This second part is the real magic. By using ref(), you're not just writing SQL โ€” you're drawing a map of how all your models connect. dbt uses this map to figure out the perfect build order. It's like telling a GPS every road connection so it can find the fastest route!

๐ŸŒณ The DAG (Directed Acyclic Graph)

The DAG is like a family tree for your data. It shows who is the parent of whom. dbt uses this tree to figure out the cooking order โ€” you can't make a cake before you've mixed the batter!

The word "Acyclic" means no loops โ€” data flows in one direction, like a river flowing downhill. It never flows back upstream.

What Does a DAG Look Like?

Every source() and ref() call creates a connection. Together, they form a beautiful graph:

raw_customers raw_orders raw_payments stg_customers stg_orders stg_payments int_customer _orders int_order _payments fct_customer _ltv SOURCES STAGING INTERMEDIATE MARTS

How dbt reads this DAG:

1. Build stg_customers, stg_orders, stg_payments first (they only depend on sources)

2. Then build int_customer_orders and int_order_payments (they depend on staging models)

3. Finally build fct_customer_ltv (it depends on intermediate models)

dbt figures this out automatically from your ref() calls. You never have to manually specify the order!

See Your DAG: dbt docs generate

dbt comes with a built-in tool to visualize your entire DAG as an interactive website:

Terminal
# Generate the documentation site (including the lineage graph)
$ dbt docs generate

# Open it in your browser
$ dbt docs serve

Running docs generate...
Catalog written to /target/catalog.json
Serving docs at http://localhost:8080
Press Ctrl+C to exit.

The lineage graph in dbt docs is interactive โ€” you can click on any node to see its SQL, description, columns, and tests. It's like Google Maps for your data warehouse. Every data team should have this running!

๐Ÿ—๏ธ Building a Real DAG โ€” Step by Step

Let's build a complete pipeline from scratch. We'll track customer lifetime value (LTV) โ€” how much money each customer has spent in total. Watch how source() and ref() connect everything:

๐Ÿ“ฆ
raw_customers
source
โ†’
๐Ÿฅฌ
stg_customers
staging
โ†’
๐Ÿณ
int_customer
_orders
intermediate
โ†’
๐Ÿฝ๏ธ
fct_customer
_ltv
mart

Step 1: Declare the Source

models/staging/_sources.yml
version: 2

sources:
  - name: raw_shop
    database: raw_production
    schema: public
    tables:
      - name: customers
      - name: orders

Step 2: Staging Model (uses source())

models/staging/stg_customers.sql
SELECT
    id              AS customer_id,
    first_name,
    last_name,
    email_addr      AS email,
    created_at::DATE AS signup_date
FROM
    {{ source('raw_shop', 'customers') }}
    -- โ†‘ Points to RAW data (external)
models/staging/stg_orders.sql
SELECT
    order_id,
    customer_id,
    order_date,
    amount,
    status
FROM
    {{ source('raw_shop', 'orders') }}
    -- โ†‘ Points to RAW data (external)

Step 3: Intermediate Model (uses ref())

models/intermediate/int_customer_orders.sql
SELECT
    c.customer_id,
    c.first_name,
    c.email,
    o.order_id,
    o.order_date,
    o.amount
FROM
    {{ ref('stg_customers') }} c
    -- โ†‘ Points to ANOTHER dbt model
LEFT JOIN
    {{ ref('stg_orders') }} o
    -- โ†‘ Points to ANOTHER dbt model
    ON c.customer_id = o.customer_id

Step 4: Mart Model (uses ref())

models/marts/fct_customer_ltv.sql
SELECT
    customer_id,
    first_name,
    email,
    COUNT(order_id)  AS total_orders,
    SUM(amount)      AS lifetime_value,
    MIN(order_date)  AS first_order_date,
    MAX(order_date)  AS most_recent_order
FROM
    {{ ref('int_customer_orders') }}
    -- โ†‘ Points to the intermediate model
GROUP BY
    customer_id, first_name, email

Notice the pattern:

๐Ÿ“ฆ Sources โ†’ use source('name', 'table') โ€” raw external data

๐Ÿฅฌ Staging โ†’ reads from sources, referenced by ref('stg_...')

๐Ÿณ Intermediate โ†’ reads from staging via ref()

๐Ÿฝ๏ธ Marts โ†’ reads from intermediate via ref()

Data flows one direction only โ€” downstream. Never upstream. That's what makes it a DAG!

๐ŸŒ Advanced: Cross-Project Refs (dbt Mesh)

Imagine you run a chain of restaurants. Each restaurant has its own kitchen (dbt project), but sometimes the Italian restaurant needs the bread recipe from the French bakery. Cross-project refs let one project reference models from another project โ€” like sharing recipes across restaurants!

When your company grows, you might split one giant dbt project into multiple smaller ones. dbt Mesh (introduced in dbt 1.6+) lets these projects talk to each other:

Cross-project ref (dbt Mesh)
-- In the "marketing" project, reference a model from the "core" project

SELECT
    customer_id,
    lifetime_value,
    marketing_channel
FROM
    {{ ref('core', 'fct_customer_ltv') }}
    -- โ†‘ project name  โ†‘ model name
LEFT JOIN
    {{ ref('marketing_attribution') }}
    ON ...

๐Ÿ“ฆ Single Project

All models in one repo. Simple, but gets unwieldy at 500+ models.

ref('model_name')

๐ŸŒ dbt Mesh (Multi-Project)

Split into domain-specific projects. Each team owns their models.

ref('project', 'model_name')

You don't need dbt Mesh until your project has hundreds of models and multiple teams. For now, just know it exists. It's like knowing that highways exist even though you're still learning to ride a bicycle!

๐Ÿšซ Common Mistakes

Even experienced dbt developers trip on these. Learn them now so you don't have to learn them the hard way!

Mistake #1: Circular References

Model A references Model B, and Model B references Model A. This creates an infinite loop โ€” dbt can't figure out which to build first!

model_a.sql: SELECT * FROM {{ ref('model_b') }}
model_b.sql: SELECT * FROM {{ ref('model_a') }}

ERROR: Found a cycle: model_a โ†’ model_b โ†’ model_a

Fix: Restructure so data flows in one direction. If A and B both need the same data, create a shared parent model C that both reference.

Mistake #2: Forgetting to Declare Sources

You use {{ source('raw_shop', 'customers') }} in a model but never defined raw_shop in a YAML file.

Compilation Error: source 'raw_shop' was not found

Fix: Always create a _sources.yml file in your staging folder that declares every source.

Mistake #3: Hardcoding Instead of Using ref()

Writing FROM analytics.dbt_fk.stg_customers instead of FROM {{ ref('stg_customers') }}. The query works, but dbt doesn't know about the dependency โ€” so it might try to build your model before the staging model exists!

Fix: Search your project for any FROM clause that doesn't use ref() or source(). Replace them all.

Quick self-check: Open any model in your project. Does every FROM and JOIN clause use either ref() or source()? If you see a raw table name like database.schema.table, that's a bug waiting to happen!

๐Ÿง  Quick Quiz โ€” Test Your Understanding!

Let's see if you can connect the dots (pun intended). Click the answer you think is correct:

Question 1: When should you use source() vs ref()?

A) source() for everything, ref() is deprecated
B) source() for raw external data, ref() for other dbt models
C) ref() for raw data, source() for dbt models
D) They're interchangeable โ€” use whichever you prefer

Question 2: What does the DAG help dbt figure out?

A) Which warehouse to connect to
B) How much each model costs to run
C) The correct order to build models (dependencies)
D) Which columns to include in each model

Question 3: What happens if Model A refs Model B, and Model B refs Model A?

A) dbt builds them simultaneously
B) dbt picks one randomly to build first
C) dbt throws a circular dependency error
D) dbt ignores one of the references

Without looking at your notes, can you explain the difference between source() and ref() to a friend? Can you draw a simple DAG with 4 nodes and explain the build order? If yes, you've nailed this lesson!

  • source() โ€” Declares and references raw external data; defined in YAML
  • ref() โ€” References other dbt models; creates the dependency graph
  • Never hardcode table names โ€” always use source() or ref()
  • DAG โ€” Directed Acyclic Graph; dbt's map of model dependencies
  • Build order โ€” dbt automatically determines the correct sequence from the DAG
  • Source freshness โ€” Check if raw data is stale with dbt source freshness
  • Cross-project refs โ€” dbt Mesh for multi-project setups (advanced)
  • dbt docs serve โ€” Interactive lineage graph visualization
  • Specific freshness configuration options (warn_after, error_after)
  • The two-argument form of ref() for cross-project references
{PW}