LESSON 4 ยท BEGINNER

๐Ÿ“ dbt Project Structure โ€” Every File Explained

Understanding your dbt project is like learning the layout of a new kitchen โ€” once you know where everything is, cooking becomes effortless.

๐Ÿ  The Big Picture

A dbt project is like a well-organized kitchen. Every drawer has a purpose, every shelf holds specific items. The spoons don't go in the fridge, and the milk doesn't go in the cutlery drawer!

Right now, you're standing at the kitchen door for the first time. Let's open every drawer and see what's inside. By the end of this lesson, you'll know exactly where everything goes โ€” and why.

Your Entire dbt Project at a Glance

When you run dbt init my_project, dbt creates a folder structure that looks like this. Don't panic โ€” we'll explain every single line:

๐Ÿ“ my_dbt_project/ ๐Ÿ“„ dbt_project.yml  โ† The recipe book cover ๐Ÿ“„ packages.yml  โ† Your shopping list for add-ons ๐Ÿ“ models/  โ† Where the magic happens (SQL files) ๐Ÿ“ staging/  โ† Prep station (wash & chop) ๐Ÿ“„ stg_customers.sql ๐Ÿ“„ stg_orders.sql ๐Ÿ“„ _stg_models.yml ๐Ÿ“ intermediate/  โ† Cooking station (combine) ๐Ÿ“„ int_orders_joined.sql ๐Ÿ“ marts/  โ† Plating station (serve!) ๐Ÿ“„ fct_orders.sql ๐Ÿ“„ dim_customers.sql ๐Ÿ“„ _mart_models.yml ๐Ÿ“ macros/  โ† Secret recipes (reusable SQL) ๐Ÿ“„ cents_to_dollars.sql ๐Ÿ“ seeds/  โ† Pantry staples (CSV data) ๐Ÿ“„ country_codes.csv ๐Ÿ“ snapshots/  โ† Food diary (point-in-time) ๐Ÿ“„ snapshot_orders.sql ๐Ÿ“ tests/  โ† Quality control (custom tests) ๐Ÿ“„ assert_positive_revenue.sql ๐Ÿ“ analyses/  โ† Scratch pad (ad-hoc queries) ๐Ÿ“„ monthly_revenue_check.sql

Color code: Cyan = SQL files  |  Green = YAML config files  |  Orange = CSV data files. You'll see this pattern everywhere in dbt!

Think of this folder tree as a map of your kitchen. Each folder is a different station, and each file is a specific tool or ingredient. Let's tour every station now.

๐Ÿ“– dbt_project.yml โ€” The Recipe Book Cover

This is like the cover page of your recipe book โ€” it says what the book is called, who wrote it, and how the recipes are organized. Every dbt project must have this file. Without it, dbt doesn't even know it's looking at a dbt project!

This is the single most important file in your project. It lives at the root of your project folder and tells dbt everything it needs to know about your project. Let's look at a real example:

dbt_project.yml
# โ”€โ”€ What is this project called? โ”€โ”€
name: 'my_online_store'       # Your project's unique name (no spaces!)
version: '1.0.0'              # Version number (like a book edition)

# โ”€โ”€ Who wrote this? โ”€โ”€
profile: 'my_online_store'    # Links to your profiles.yml (warehouse connection)

# โ”€โ”€ Where are things stored? โ”€โ”€
model-paths: ["models"]       # Where dbt looks for SQL models
analysis-paths: ["analyses"]  # Where ad-hoc queries live
test-paths: ["tests"]         # Where custom tests live
seed-paths: ["seeds"]         # Where CSV seed files live
macro-paths: ["macros"]       # Where reusable macros live
snapshot-paths: ["snapshots"] # Where snapshot definitions live

# โ”€โ”€ How should models be built by default? โ”€โ”€
models:
  my_online_store:
    staging:
      +materialized: view      # Staging models = lightweight views
    intermediate:
      +materialized: view      # Intermediate = also views
    marts:
      +materialized: table     # Marts = full tables (fast for dashboards!)

Key Settings Explained

Setting What It Does Kitchen Analogy
nameUnique identifier for your projectThe title on your recipe book cover
versionTracks which edition of your project this is"2nd Edition" on the book spine
profileLinks to your warehouse credentials in profiles.ymlWhich kitchen (warehouse) you're cooking in
model-pathsTells dbt where to find your SQL modelsWhich shelf has the recipe cards
+materializedHow dbt should build models (view, table, incremental)Should we serve on a plate (table) or a tray (view)?

The name field must match the folder name under the models: section. If your project is called my_online_store, then the models config must also start with my_online_store:. Mismatch = errors!

โœจ models/ โ€” Where the Magic Happens

This is the heart and soul of your dbt project. Every SQL file in this folder is a "model" โ€” a transformation that turns raw data into something useful. Most of your time as a dbt developer will be spent here.

Think of the models/ folder as the main cooking area of your restaurant kitchen. It has three stations, and every dish passes through them in order โ€” just like an assembly line!

The 3-Layer Architecture

Professional dbt projects organize models into three layers. This isn't just a suggestion โ€” it's the industry-standard best practice used by companies like Spotify, GitLab, and Shopify.

๐Ÿฅฌ
Staging
Prep Station
Wash, chop, rename
โ†’
๐Ÿณ
Intermediate
Cooking Station
Combine, join, enrich
โ†’
๐Ÿฝ๏ธ
Marts
Plating Station
Ready to serve!

Restaurant analogy โ€” making a Caesar salad:

๐Ÿฅฌ Staging (Prep Station) โ€” Wash the lettuce, chop the croutons, grate the parmesan. You're just cleaning and renaming raw ingredients. No mixing yet!

๐Ÿณ Intermediate (Cooking Station) โ€” Toss the lettuce with dressing, mix in croutons and parmesan. You're combining prepped ingredients into something new.

๐Ÿฝ๏ธ Marts (Plating Station) โ€” Arrange beautifully on a plate, add a lemon wedge. This is the final dish that goes to the customer (your dashboard).

๐Ÿฅฌ Layer 1: Staging Models (stg_)

Staging models are your first touch on raw data. They do simple, boring-but-essential work: rename columns, cast data types, and apply light cleaning. One staging model per source table.

models/staging/stg_customers.sql
-- Staging model: clean up raw customer data
-- One staging model per source table. Keep it simple!

SELECT
    id          AS customer_id,
    first_name,
    last_name,
    email_addr  AS email,           -- Rename for clarity
    created_at::DATE AS signup_date  -- Cast to date
FROM
    {{ source('raw', 'customers') }}

๐Ÿณ Layer 2: Intermediate Models (int_)

Intermediate models combine staging models. They join tables, aggregate data, and create business logic. Think of them as the "cooking" step.

models/intermediate/int_orders_joined.sql
-- Intermediate: join customers with their orders

SELECT
    o.order_id,
    o.order_date,
    o.amount,
    c.customer_id,
    c.first_name,
    c.email
FROM
    {{ ref('stg_orders') }} o
LEFT JOIN
    {{ ref('stg_customers') }} c
    ON o.customer_id = c.customer_id

๐Ÿฝ๏ธ Layer 3: Mart Models (fct_ and dim_)

Mart models are the final, polished tables that your dashboards and analysts query directly. They come in two flavors:

๐Ÿ“Š Fact Tables (fct_)

Events that happened โ€” orders placed, payments made, clicks recorded. They have numbers you can count and sum.

๐Ÿ‘ค Dimension Tables (dim_)

Descriptive attributes โ€” customer details, product info, store locations. They describe the "who, what, where" of your facts.

models/marts/fct_orders.sql
-- Mart: final orders fact table for dashboards

SELECT
    order_id,
    customer_id,
    order_date,
    amount,
    CASE
        WHEN amount > 100 THEN 'high_value'
        WHEN amount > 50  THEN 'medium_value'
        ELSE 'low_value'
    END AS order_tier
FROM
    {{ ref('int_orders_joined') }}

Naming Conventions Cheat Sheet

Prefix Layer Example Purpose
stg_Stagingstg_customersClean raw data, rename columns
int_Intermediateint_orders_joinedJoin and combine staging models
fct_Marts (Fact)fct_ordersEvent/transaction tables with metrics
dim_Marts (Dimension)dim_customersDescriptive lookup tables

Naming conventions aren't just cosmetic โ€” they let anyone on your team instantly understand what a model does just by reading its name. stg_orders? That's a staging model for orders. fct_revenue? That's a final fact table for revenue. No guessing needed!

๐Ÿง™ macros/ โ€” Your Secret Recipes

Macros are like those cooking shortcuts your grandma taught you โ€” instead of writing the same 10 steps every time you make gravy, you just say "make gravy the grandma way" and everyone knows what to do. Write once, reuse everywhere!

A macro is a reusable piece of SQL written with Jinja (a templating language). You define it once in the macros/ folder and then call it from any model. This keeps your code DRY (Don't Repeat Yourself).

macros/cents_to_dollars.sql
-- Macro: convert cents to dollars (reusable!)

{% macro cents_to_dollars(column_name) %}
    (ROUND({{ column_name }} / 100.0, 2))
{% endmacro %}

Now you can use it in any model like this:

Using the macro in a model
SELECT
    order_id,
    {{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM
    {{ ref('stg_payments') }}

Without a macro: You write ROUND(amount_cents / 100.0, 2) in 15 different models. One day the boss says "use 4 decimal places." You have to find and change all 15 files. ๐Ÿ˜ฑ

With a macro: You change the formula in one file (cents_to_dollars.sql), and all 15 models are instantly updated. ๐ŸŽ‰

๐Ÿง‚ seeds/ โ€” Your Pantry Staples

Seeds are like the salt, pepper, and spices in your pantry โ€” small, rarely changing reference data that you always keep handy. You don't buy new salt every day; it just sits there, ready when you need it.

Seeds are CSV files that dbt loads directly into your warehouse as tables. They're perfect for small, static reference data that doesn't come from a source system.

seeds/country_codes.csv
country_code,country_name,region
US,United States,North America
GB,United Kingdom,Europe
IN,India,Asia
AU,Australia,Oceania
BR,Brazil,South America

Run dbt seed and this CSV becomes a real table in your warehouse that you can join with any model using {{ ref('country_codes') }}.

When to Use Seeds vs. Sources

๐ŸŒฑ Use Seeds When...

  • Data is small (< 1,000 rows)
  • Data rarely changes
  • Data doesn't come from a source system
  • Examples: country codes, status mappings, category labels

๐Ÿ“ก Use Sources When...

  • Data is large or growing
  • Data changes frequently
  • Data comes from apps, APIs, or databases
  • Examples: orders, users, transactions, events

๐Ÿ“ธ snapshots/ โ€” Your Food Diary

Snapshots are like taking a photo of your fridge every day. Yesterday there were 6 eggs. Today there are 4. Tomorrow there might be 12 (you went shopping!). By looking at all the photos, you can see exactly what changed and when.

In the data world, this is called Slowly Changing Dimension Type 2 (SCD Type 2). Sounds scary, but it's simple: dbt keeps a history of how rows change over time.

snapshots/snapshot_orders.sql
{% snapshot snapshot_orders %}

{{
    config(
        target_schema='snapshots',
        unique_key='order_id',
        strategy='timestamp',
        updated_at='updated_at',
    )
}}

SELECT * FROM {{ source('raw', 'orders') }}

{% endsnapshot %}

How SCD Type 2 Works โ€” Visually

Imagine a customer changes their email address. Here's what the snapshot table looks like:

customer_id email dbt_valid_from dbt_valid_to
42old@email.com2024-01-012024-06-15
42new@email.com2024-06-15null (current)

The row with dbt_valid_to = null is always the current version. Previous versions have an end date, so you can see the full history. It's like a time machine for your data!

๐Ÿงช tests/ โ€” Quality Control

Tests are like the health inspector visiting your kitchen โ€” they check that everything is safe, clean, and correct before any food reaches a customer. If something is wrong, the inspector shuts things down and tells you exactly what to fix.

The tests/ folder holds custom SQL tests โ€” queries that should return zero rows if everything is correct. If any rows come back, the test fails (something is wrong!).

tests/assert_positive_revenue.sql
-- This test FAILS if any order has negative revenue
-- (A passing test returns zero rows)

SELECT
    order_id,
    amount
FROM
    {{ ref('fct_orders') }}
WHERE
    amount < 0

Two types of tests in dbt:

๐Ÿ“‹ Schema tests โ€” Defined in YAML files (e.g., unique, not_null). Quick, declarative, no SQL needed.

๐Ÿ“ Custom tests โ€” SQL files in the tests/ folder. For complex business logic that schema tests can't cover.

Run dbt test after every dbt run. Think of it as tasting your food before serving it. Never skip this step!

๐Ÿ“ analyses/ โ€” Your Scratch Pad

This is your notepad where you try new recipe ideas before adding them to the official recipe book. Maybe you want to experiment with a new sauce โ€” you scribble notes here first. If it works, you promote it to a real model. If not, no harm done!

Analysis files are SQL files that dbt compiles but does NOT run. They're perfect for:

analyses/monthly_revenue_check.sql
-- Quick check: monthly revenue trend
-- This compiles (so ref() works) but doesn't create a table

SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(amount) AS total_revenue,
    COUNT(*) AS order_count
FROM
    {{ ref('fct_orders') }}
GROUP BY 1
ORDER BY 1 DESC

The key difference: models create tables/views in your warehouse. Analyses just compile SQL (resolving ref() and source()) so you can copy-paste the compiled query into your SQL editor. They're for humans, not for the warehouse.

๐Ÿท๏ธ schema.yml โ€” The Label Maker

Imagine you have a label maker in your kitchen. You stick labels on every jar: "Sugar", "Salt", "Flour โ€” do NOT confuse with powdered sugar!" The schema.yml file is your label maker โ€” it describes every model, every column, and adds tests to make sure nothing goes wrong.

Schema YAML files (usually named _models.yml or schema.yml) live alongside your models and serve three purposes:

๐Ÿ“–
Documentation

Describe what each model and column means

๐Ÿงช
Testing

Define tests like unique, not_null, accepted_values

๐Ÿ”—
Relationships

Define how models connect to each other

models/marts/_mart_models.yml
version: 2

models:
  - name: fct_orders
    description: "Final orders fact table. One row per order."
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        tests:
          - unique            # No duplicate order IDs!
          - not_null          # Every row must have an ID

      - name: amount
        description: "Order total in dollars"
        tests:
          - not_null

      - name: order_tier
        description: "Classification: high_value, medium_value, low_value"
        tests:
          - accepted_values:
              values: ['high_value', 'medium_value', 'low_value']

  - name: dim_customers
    description: "Customer dimension table. One row per customer."
    columns:
      - name: customer_id
        description: "Unique customer identifier"
        tests:
          - unique
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

How it all connects: When you run dbt test, dbt reads these YAML files and automatically generates SQL queries to check every test you defined. It's like having a health inspector who reads your labels and verifies every jar actually contains what it says.

๐Ÿง  Quick Quiz โ€” Test Your Understanding!

Let's see if you can navigate the dbt kitchen with your eyes closed. Click the answer you think is correct:

Question 1: Where do your SQL transformation files live?

A) seeds/
B) models/
C) macros/
D) analyses/

Question 2: What naming prefix indicates a staging model?

A) fct_
B) dim_
C) stg_
D) raw_

Question 3: What is the purpose of the seeds/ folder?

A) Store your SQL transformation models
B) Store reusable macros
C) Store small, static CSV reference data
D) Store point-in-time snapshots

Close your eyes and try to draw the dbt project folder tree from memory. Can you name all 7 main folders and explain what each one does? If you can, you're ready for the next lesson!

  • dbt_project.yml โ€” The main config file; every project must have one
  • models/ โ€” SQL transformations organized in 3 layers: staging โ†’ intermediate โ†’ marts
  • Naming: stg_ (staging), int_ (intermediate), fct_ (facts), dim_ (dimensions)
  • macros/ โ€” Reusable SQL snippets (write once, use everywhere)
  • seeds/ โ€” Small CSV reference data loaded into the warehouse
  • tests/ โ€” Custom SQL tests that should return zero rows when passing
  • schema.yml โ€” Describes models, columns, and defines tests
  • snapshots/ โ€” SCD Type 2 history tracking (important but advanced)
  • analyses/ โ€” Ad-hoc queries that compile but don't create tables
  • packages.yml โ€” Third-party package management (covered in a later lesson)
  • The specific materialization settings in dbt_project.yml (covered in Models lesson)
{PW}