How dbt knows which models depend on which, and builds them in the right order
Imagine your recipe says "use the tomatoes from the blue bowl on the third shelf." What if someone moves the bowl to the second shelf? Or paints it green? Everything breaks!
That's exactly what happens when you hardcode table names in SQL. dbt solves this with two magic words: source() and ref().
Here's what bad SQL looks like โ the kind written before dbt existed:
-- โ DON'T DO THIS โ hardcoded schema and table names!
SELECT
c.id AS customer_id,
c.first_name,
o.order_id,
o.amount
FROM
raw_production.public.customers c -- What if the schema changes?
LEFT JOIN
raw_production.public.orders o -- What if we move to a new database?
ON c.id = o.customer_id
| Scenario | What Happens | Pain Level |
|---|---|---|
| Schema name changes | Every query referencing it breaks | ๐ฑ High |
| Move to a new warehouse | Rewrite every single SQL file | ๐คฏ Extreme |
| Dev vs Prod environments | Can't easily switch between them | ๐ค Medium |
| Someone renames a table | No way to find all references | ๐ต High |
| Build order unknown | dbt has no idea which model to build first | ๐ Critical |
Hardcoding table names is the #1 mistake beginners make. It's like writing someone's street address on every letter instead of using their name โ if they move, all your letters go to the wrong place!
source() is like putting a label on each ingredient delivery: "These tomatoes came from Farm ABC, delivered today." Now dbt knows where every ingredient comes from, and it can check if the delivery is fresh!
A source is raw data that lands in your warehouse from external systems โ your app's database, Stripe payments, Google Analytics, etc. dbt doesn't create these tables; they already exist. You just need to tell dbt about them.
First, create a YAML file that lists all your raw data tables. This is like registering your ingredient suppliers:
version: 2
sources:
- name: raw_shop # A friendly name for this group
database: raw_production # The actual database name
schema: public # The actual schema name
description: "Raw data from our online shop's PostgreSQL database"
tables:
- name: customers
description: "One row per registered customer"
columns:
- name: id
description: "Primary key"
- name: orders
description: "One row per order placed"
loaded_at_field: _etl_loaded_at # For freshness checks!
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
- name: payments
description: "One row per payment transaction"
Now, instead of hardcoding raw_production.public.customers, you write:
SELECT
id AS customer_id,
first_name,
last_name,
email_addr AS email,
created_at::DATE AS signup_date
FROM
{{ source('raw_shop', 'customers') }}
-- โ friendly name โ table name
What happens under the hood: When dbt compiles this, it replaces {{ source('raw_shop', 'customers') }} with the actual table path raw_production.public.customers. If the schema ever changes, you update it in one place (the YAML file), and every model that uses this source is automatically fixed!
Remember the freshness block in the YAML above? Run this command to check if your raw data is up-to-date:
$ dbt source freshness
Running source freshness checks...
PASS source raw_shop.customers # Last loaded 2 hours ago โ
WARN source raw_shop.orders # Last loaded 14 hours ago โ ๏ธ
ERROR source raw_shop.payments # Last loaded 30 hours ago โ
Source freshness is like checking the expiration date on your milk. If the data is too old, something is probably broken upstream (maybe your ETL pipeline stopped running). Catch it before your dashboards show stale numbers!
ref() is like saying "use the chopped tomatoes from the prep station" instead of "use the tomatoes from the blue bowl on shelf 3." If the prep station moves to a different counter, dbt automatically updates the reference. You never have to worry about where something is โ just what it's called.
While source() points to raw external data, ref() points to other dbt models. It's how models talk to each other.
-- This model DEPENDS on stg_customers and stg_orders
-- ref() tells dbt about these dependencies!
SELECT
c.customer_id,
c.first_name,
c.email,
o.order_id,
o.order_date,
o.amount
FROM
{{ ref('stg_customers') }} c
LEFT JOIN
{{ ref('stg_orders') }} o
ON c.customer_id = o.customer_id
FROM analytics.public.stg_customers
FROM {{ ref('stg_customers') }}
๐ง The Golden Rule
Every time you reference another dbt model, use ref(). Every time you reference raw external data, use source(). Never hardcode table names. Period.
When dbt sees {{ ref('stg_customers') }}, it does two things:
Replaces it with the full table path (e.g., analytics.dbt_fk.stg_customers)
Adds an edge to the DAG: "this model depends on stg_customers"
This second part is the real magic. By using ref(), you're not just writing SQL โ you're drawing a map of how all your models connect. dbt uses this map to figure out the perfect build order. It's like telling a GPS every road connection so it can find the fastest route!
The DAG is like a family tree for your data. It shows who is the parent of whom. dbt uses this tree to figure out the cooking order โ you can't make a cake before you've mixed the batter!
The word "Acyclic" means no loops โ data flows in one direction, like a river flowing downhill. It never flows back upstream.
Every source() and ref() call creates a connection. Together, they form a beautiful graph:
How dbt reads this DAG:
1. Build stg_customers, stg_orders, stg_payments first (they only depend on sources)
2. Then build int_customer_orders and int_order_payments (they depend on staging models)
3. Finally build fct_customer_ltv (it depends on intermediate models)
dbt figures this out automatically from your ref() calls. You never have to manually specify the order!
dbt comes with a built-in tool to visualize your entire DAG as an interactive website:
# Generate the documentation site (including the lineage graph)
$ dbt docs generate
# Open it in your browser
$ dbt docs serve
Running docs generate...
Catalog written to /target/catalog.json
Serving docs at http://localhost:8080
Press Ctrl+C to exit.
The lineage graph in dbt docs is interactive โ you can click on any node to see its SQL, description, columns, and tests. It's like Google Maps for your data warehouse. Every data team should have this running!
Let's build a complete pipeline from scratch. We'll track customer lifetime value (LTV) โ how much money each customer has spent in total. Watch how source() and ref() connect everything:
version: 2
sources:
- name: raw_shop
database: raw_production
schema: public
tables:
- name: customers
- name: orders
SELECT
id AS customer_id,
first_name,
last_name,
email_addr AS email,
created_at::DATE AS signup_date
FROM
{{ source('raw_shop', 'customers') }}
-- โ Points to RAW data (external)
SELECT
order_id,
customer_id,
order_date,
amount,
status
FROM
{{ source('raw_shop', 'orders') }}
-- โ Points to RAW data (external)
SELECT
c.customer_id,
c.first_name,
c.email,
o.order_id,
o.order_date,
o.amount
FROM
{{ ref('stg_customers') }} c
-- โ Points to ANOTHER dbt model
LEFT JOIN
{{ ref('stg_orders') }} o
-- โ Points to ANOTHER dbt model
ON c.customer_id = o.customer_id
SELECT
customer_id,
first_name,
email,
COUNT(order_id) AS total_orders,
SUM(amount) AS lifetime_value,
MIN(order_date) AS first_order_date,
MAX(order_date) AS most_recent_order
FROM
{{ ref('int_customer_orders') }}
-- โ Points to the intermediate model
GROUP BY
customer_id, first_name, email
Notice the pattern:
๐ฆ Sources โ use source('name', 'table') โ raw external data
๐ฅฌ Staging โ reads from sources, referenced by ref('stg_...')
๐ณ Intermediate โ reads from staging via ref()
๐ฝ๏ธ Marts โ reads from intermediate via ref()
Data flows one direction only โ downstream. Never upstream. That's what makes it a DAG!
Imagine you run a chain of restaurants. Each restaurant has its own kitchen (dbt project), but sometimes the Italian restaurant needs the bread recipe from the French bakery. Cross-project refs let one project reference models from another project โ like sharing recipes across restaurants!
When your company grows, you might split one giant dbt project into multiple smaller ones. dbt Mesh (introduced in dbt 1.6+) lets these projects talk to each other:
-- In the "marketing" project, reference a model from the "core" project
SELECT
customer_id,
lifetime_value,
marketing_channel
FROM
{{ ref('core', 'fct_customer_ltv') }}
-- โ project name โ model name
LEFT JOIN
{{ ref('marketing_attribution') }}
ON ...
All models in one repo. Simple, but gets unwieldy at 500+ models.
ref('model_name')
Split into domain-specific projects. Each team owns their models.
ref('project', 'model_name')
You don't need dbt Mesh until your project has hundreds of models and multiple teams. For now, just know it exists. It's like knowing that highways exist even though you're still learning to ride a bicycle!
Even experienced dbt developers trip on these. Learn them now so you don't have to learn them the hard way!
Mistake #1: Circular References
Model A references Model B, and Model B references Model A. This creates an infinite loop โ dbt can't figure out which to build first!
model_a.sql: SELECT * FROM {{ ref('model_b') }}
model_b.sql: SELECT * FROM {{ ref('model_a') }}
ERROR: Found a cycle: model_a โ model_b โ model_a
Fix: Restructure so data flows in one direction. If A and B both need the same data, create a shared parent model C that both reference.
Mistake #2: Forgetting to Declare Sources
You use {{ source('raw_shop', 'customers') }} in a model but never defined raw_shop in a YAML file.
Compilation Error: source 'raw_shop' was not found
Fix: Always create a _sources.yml file in your staging folder that declares every source.
Mistake #3: Hardcoding Instead of Using ref()
Writing FROM analytics.dbt_fk.stg_customers instead of FROM {{ ref('stg_customers') }}. The query works, but dbt doesn't know about the dependency โ so it might try to build your model before the staging model exists!
Fix: Search your project for any FROM clause that doesn't use ref() or source(). Replace them all.
Quick self-check: Open any model in your project. Does every FROM and JOIN clause use either ref() or source()? If you see a raw table name like database.schema.table, that's a bug waiting to happen!
Let's see if you can connect the dots (pun intended). Click the answer you think is correct:
Without looking at your notes, can you explain the difference between source() and ref() to a friend? Can you draw a simple DAG with 4 nodes and explain the build order? If yes, you've nailed this lesson!
dbt source freshness