Understanding your dbt project is like learning the layout of a new kitchen โ once you know where everything is, cooking becomes effortless.
A dbt project is like a well-organized kitchen. Every drawer has a purpose, every shelf holds specific items. The spoons don't go in the fridge, and the milk doesn't go in the cutlery drawer!
Right now, you're standing at the kitchen door for the first time. Let's open every drawer and see what's inside. By the end of this lesson, you'll know exactly where everything goes โ and why.
When you run dbt init my_project, dbt creates a folder structure that looks like this. Don't panic โ we'll explain every single line:
Color code: Cyan = SQL files | Green = YAML config files | Orange = CSV data files. You'll see this pattern everywhere in dbt!
Think of this folder tree as a map of your kitchen. Each folder is a different station, and each file is a specific tool or ingredient. Let's tour every station now.
This is like the cover page of your recipe book โ it says what the book is called, who wrote it, and how the recipes are organized. Every dbt project must have this file. Without it, dbt doesn't even know it's looking at a dbt project!
This is the single most important file in your project. It lives at the root of your project folder and tells dbt everything it needs to know about your project. Let's look at a real example:
# โโ What is this project called? โโ
name: 'my_online_store' # Your project's unique name (no spaces!)
version: '1.0.0' # Version number (like a book edition)
# โโ Who wrote this? โโ
profile: 'my_online_store' # Links to your profiles.yml (warehouse connection)
# โโ Where are things stored? โโ
model-paths: ["models"] # Where dbt looks for SQL models
analysis-paths: ["analyses"] # Where ad-hoc queries live
test-paths: ["tests"] # Where custom tests live
seed-paths: ["seeds"] # Where CSV seed files live
macro-paths: ["macros"] # Where reusable macros live
snapshot-paths: ["snapshots"] # Where snapshot definitions live
# โโ How should models be built by default? โโ
models:
my_online_store:
staging:
+materialized: view # Staging models = lightweight views
intermediate:
+materialized: view # Intermediate = also views
marts:
+materialized: table # Marts = full tables (fast for dashboards!)
| Setting | What It Does | Kitchen Analogy |
|---|---|---|
name | Unique identifier for your project | The title on your recipe book cover |
version | Tracks which edition of your project this is | "2nd Edition" on the book spine |
profile | Links to your warehouse credentials in profiles.yml | Which kitchen (warehouse) you're cooking in |
model-paths | Tells dbt where to find your SQL models | Which shelf has the recipe cards |
+materialized | How dbt should build models (view, table, incremental) | Should we serve on a plate (table) or a tray (view)? |
The name field must match the folder name under the models: section. If your project is called my_online_store, then the models config must also start with my_online_store:. Mismatch = errors!
This is the heart and soul of your dbt project. Every SQL file in this folder is a "model" โ a transformation that turns raw data into something useful. Most of your time as a dbt developer will be spent here.
Think of the models/ folder as the main cooking area of your restaurant kitchen. It has three stations, and every dish passes through them in order โ just like an assembly line!
Professional dbt projects organize models into three layers. This isn't just a suggestion โ it's the industry-standard best practice used by companies like Spotify, GitLab, and Shopify.
Restaurant analogy โ making a Caesar salad:
๐ฅฌ Staging (Prep Station) โ Wash the lettuce, chop the croutons, grate the parmesan. You're just cleaning and renaming raw ingredients. No mixing yet!
๐ณ Intermediate (Cooking Station) โ Toss the lettuce with dressing, mix in croutons and parmesan. You're combining prepped ingredients into something new.
๐ฝ๏ธ Marts (Plating Station) โ Arrange beautifully on a plate, add a lemon wedge. This is the final dish that goes to the customer (your dashboard).
stg_)Staging models are your first touch on raw data. They do simple, boring-but-essential work: rename columns, cast data types, and apply light cleaning. One staging model per source table.
-- Staging model: clean up raw customer data
-- One staging model per source table. Keep it simple!
SELECT
id AS customer_id,
first_name,
last_name,
email_addr AS email, -- Rename for clarity
created_at::DATE AS signup_date -- Cast to date
FROM
{{ source('raw', 'customers') }}
int_)Intermediate models combine staging models. They join tables, aggregate data, and create business logic. Think of them as the "cooking" step.
-- Intermediate: join customers with their orders
SELECT
o.order_id,
o.order_date,
o.amount,
c.customer_id,
c.first_name,
c.email
FROM
{{ ref('stg_orders') }} o
LEFT JOIN
{{ ref('stg_customers') }} c
ON o.customer_id = c.customer_id
fct_ and dim_)Mart models are the final, polished tables that your dashboards and analysts query directly. They come in two flavors:
fct_)Events that happened โ orders placed, payments made, clicks recorded. They have numbers you can count and sum.
dim_)Descriptive attributes โ customer details, product info, store locations. They describe the "who, what, where" of your facts.
-- Mart: final orders fact table for dashboards
SELECT
order_id,
customer_id,
order_date,
amount,
CASE
WHEN amount > 100 THEN 'high_value'
WHEN amount > 50 THEN 'medium_value'
ELSE 'low_value'
END AS order_tier
FROM
{{ ref('int_orders_joined') }}
| Prefix | Layer | Example | Purpose |
|---|---|---|---|
stg_ | Staging | stg_customers | Clean raw data, rename columns |
int_ | Intermediate | int_orders_joined | Join and combine staging models |
fct_ | Marts (Fact) | fct_orders | Event/transaction tables with metrics |
dim_ | Marts (Dimension) | dim_customers | Descriptive lookup tables |
Naming conventions aren't just cosmetic โ they let anyone on your team instantly understand what a model does just by reading its name. stg_orders? That's a staging model for orders. fct_revenue? That's a final fact table for revenue. No guessing needed!
Macros are like those cooking shortcuts your grandma taught you โ instead of writing the same 10 steps every time you make gravy, you just say "make gravy the grandma way" and everyone knows what to do. Write once, reuse everywhere!
A macro is a reusable piece of SQL written with Jinja (a templating language). You define it once in the macros/ folder and then call it from any model. This keeps your code DRY (Don't Repeat Yourself).
-- Macro: convert cents to dollars (reusable!)
{% macro cents_to_dollars(column_name) %}
(ROUND({{ column_name }} / 100.0, 2))
{% endmacro %}
Now you can use it in any model like this:
SELECT
order_id,
{{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM
{{ ref('stg_payments') }}
Without a macro: You write ROUND(amount_cents / 100.0, 2) in 15 different models. One day the boss says "use 4 decimal places." You have to find and change all 15 files. ๐ฑ
With a macro: You change the formula in one file (cents_to_dollars.sql), and all 15 models are instantly updated. ๐
Seeds are like the salt, pepper, and spices in your pantry โ small, rarely changing reference data that you always keep handy. You don't buy new salt every day; it just sits there, ready when you need it.
Seeds are CSV files that dbt loads directly into your warehouse as tables. They're perfect for small, static reference data that doesn't come from a source system.
country_code,country_name,region
US,United States,North America
GB,United Kingdom,Europe
IN,India,Asia
AU,Australia,Oceania
BR,Brazil,South America
Run dbt seed and this CSV becomes a real table in your warehouse that you can join with any model using {{ ref('country_codes') }}.
Snapshots are like taking a photo of your fridge every day. Yesterday there were 6 eggs. Today there are 4. Tomorrow there might be 12 (you went shopping!). By looking at all the photos, you can see exactly what changed and when.
In the data world, this is called Slowly Changing Dimension Type 2 (SCD Type 2). Sounds scary, but it's simple: dbt keeps a history of how rows change over time.
{% snapshot snapshot_orders %}
{{
config(
target_schema='snapshots',
unique_key='order_id',
strategy='timestamp',
updated_at='updated_at',
)
}}
SELECT * FROM {{ source('raw', 'orders') }}
{% endsnapshot %}
Imagine a customer changes their email address. Here's what the snapshot table looks like:
| customer_id | dbt_valid_from | dbt_valid_to | |
|---|---|---|---|
| 42 | old@email.com | 2024-01-01 | 2024-06-15 |
| 42 | new@email.com | 2024-06-15 | null (current) |
The row with dbt_valid_to = null is always the current version. Previous versions have an end date, so you can see the full history. It's like a time machine for your data!
Tests are like the health inspector visiting your kitchen โ they check that everything is safe, clean, and correct before any food reaches a customer. If something is wrong, the inspector shuts things down and tells you exactly what to fix.
The tests/ folder holds custom SQL tests โ queries that should return zero rows if everything is correct. If any rows come back, the test fails (something is wrong!).
-- This test FAILS if any order has negative revenue
-- (A passing test returns zero rows)
SELECT
order_id,
amount
FROM
{{ ref('fct_orders') }}
WHERE
amount < 0
Two types of tests in dbt:
๐ Schema tests โ Defined in YAML files (e.g., unique, not_null). Quick, declarative, no SQL needed.
๐ Custom tests โ SQL files in the tests/ folder. For complex business logic that schema tests can't cover.
Run dbt test after every dbt run. Think of it as tasting your food before serving it. Never skip this step!
This is your notepad where you try new recipe ideas before adding them to the official recipe book. Maybe you want to experiment with a new sauce โ you scribble notes here first. If it works, you promote it to a real model. If not, no harm done!
Analysis files are SQL files that dbt compiles but does NOT run. They're perfect for:
-- Quick check: monthly revenue trend
-- This compiles (so ref() works) but doesn't create a table
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS total_revenue,
COUNT(*) AS order_count
FROM
{{ ref('fct_orders') }}
GROUP BY 1
ORDER BY 1 DESC
The key difference: models create tables/views in your warehouse. Analyses just compile SQL (resolving ref() and source()) so you can copy-paste the compiled query into your SQL editor. They're for humans, not for the warehouse.
Imagine you have a label maker in your kitchen. You stick labels on every jar: "Sugar", "Salt", "Flour โ do NOT confuse with powdered sugar!" The schema.yml file is your label maker โ it describes every model, every column, and adds tests to make sure nothing goes wrong.
Schema YAML files (usually named _models.yml or schema.yml) live alongside your models and serve three purposes:
Describe what each model and column means
Define tests like unique, not_null, accepted_values
Define how models connect to each other
version: 2
models:
- name: fct_orders
description: "Final orders fact table. One row per order."
columns:
- name: order_id
description: "Unique identifier for each order"
tests:
- unique # No duplicate order IDs!
- not_null # Every row must have an ID
- name: amount
description: "Order total in dollars"
tests:
- not_null
- name: order_tier
description: "Classification: high_value, medium_value, low_value"
tests:
- accepted_values:
values: ['high_value', 'medium_value', 'low_value']
- name: dim_customers
description: "Customer dimension table. One row per customer."
columns:
- name: customer_id
description: "Unique customer identifier"
tests:
- unique
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
How it all connects: When you run dbt test, dbt reads these YAML files and automatically generates SQL queries to check every test you defined. It's like having a health inspector who reads your labels and verifies every jar actually contains what it says.
Let's see if you can navigate the dbt kitchen with your eyes closed. Click the answer you think is correct:
Close your eyes and try to draw the dbt project folder tree from memory. Can you name all 7 main folders and explain what each one does? If you can, you're ready for the next lesson!
stg_ (staging), int_ (intermediate), fct_ (facts), dim_ (dimensions)