Auto-generated docs that never go stale. Plus the magical lineage graph.
Imagine joining a new job and finding 500 SQL files with no explanation. No comments. No descriptions. No clue what anything does. You'd feel like you walked into a library where every book cover is blank.
Documentation is like leaving a trail of breadcrumbs so the next person (or future you in 6 months) doesn't get lost in the forest of SQL files.
Every data team should ask themselves this uncomfortable question:
If the person who built this model disappears tomorrow, can someone else understand it?
If the answer is "no" โ you have a documentation problem. And it's not a matter of if someone will leave, it's when. People switch jobs, go on vacation, or simply forget what they built 3 months ago.
Teams with good docs onboard new members 5x faster
Fewer Slack messages asking about model logic
dbt auto-generates docs from your actual code
The magic of dbt documentation: Unlike a Google Doc or Confluence page that someone writes once and never updates, dbt docs are generated from your actual project. Column names, data types, test results โ all pulled directly from the code. The docs can never drift from reality because they are reality.
schema.yml is like the label on a food container in your fridge. It tells you what's inside, when it was made, and what ingredients were used. Without the label, you're opening mystery containers and hoping for the best.
You've already seen YAML files for tests and sources. The same files are where you add descriptions โ the human-readable explanations that turn cryptic column names into understandable documentation.
At the top level, you describe what the model does and why it exists:
version: 2
models:
- name: fct_customer_ltv
description: "
Customer lifetime value (LTV) fact table.
One row per customer, showing their total spend,
order count, and first/last order dates.
Used by the Marketing team for segmentation
and the Finance team for revenue forecasting.
"
Go deeper โ describe every column so anyone can understand the data without reading SQL:
columns:
- name: customer_id
description: "Unique identifier for each customer. Sourced from raw_shop.customers.id"
tests:
- unique
- not_null
- name: lifetime_value
description: "
Total amount (in USD) the customer has spent across all orders.
Calculated as SUM(order_amount). NULL if the customer has
never placed an order.
"
- name: first_order_date
description: "Date of the customer's very first order. NULL if never ordered."
- name: total_orders
description: "Count of distinct orders placed by this customer. 0 if never ordered."
tests:
- not_null
Good descriptions answer three questions:
1. What is this? (a customer ID, a dollar amount, a date)
2. Where does it come from? (sourced from X table, calculated as Y)
3. What are the edge cases? (NULL if never ordered, 0 for new customers)
Here's a complete, annotated YAML file showing models, columns, descriptions, and tests all working together:
version: 2
models:
- name: stg_customers
description: "Cleaned customer data. One row per customer. Renames raw columns to standard naming."
columns:
- name: customer_id
description: "Primary key โ maps to raw_shop.customers.id"
tests: [unique, not_null]
- name: email
description: "Customer email address. Renamed from email_addr."
- name: signup_date
description: "Date the customer registered. Cast from timestamp to date."
- name: stg_orders
description: "Cleaned order data. One row per order. Filters out cancelled orders."
columns:
- name: order_id
description: "Primary key for orders"
tests: [unique, not_null]
- name: customer_id
description: "Foreign key to stg_customers"
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
Think of schema.yml as a nutrition label:
๐ท๏ธ Model name = Product name ("Organic Tomato Soup")
๐ Model description = What it is ("A hearty soup made from...")
๐ Column descriptions = Ingredients list (each ingredient explained)
โ Tests = Quality certifications (organic, non-GMO, etc.)
Doc blocks are like templates. Instead of writing the same description for customer_id in 20 different YAML files, you write it once and reference it everywhere. It's like having a dictionary โ you define a word once, and everyone looks it up when they need it.
When you have columns that appear in many models (like customer_id, created_at, or amount), writing the same description over and over is tedious and error-prone. Doc blocks solve this.
Create a .md file anywhere in your models/ directory:
{% docs customer_id %}
Unique identifier for a customer. This is the primary key
in `stg_customers` and appears as a foreign key in most
downstream models.
**Source:** `raw_shop.customers.id`
**Type:** INTEGER
**Example:** 10042
{% enddocs %}
{% docs amount_usd %}
A monetary amount in US Dollars (USD). Always stored as
DECIMAL(10,2). Negative values indicate refunds.
**Example:** 49.99, -12.50
{% enddocs %}
{% docs created_at_date %}
The date a record was created, cast from the original
timestamp to DATE. Timezone is UTC.
{% enddocs %}
Now use {{ doc('block_name') }} in your descriptions:
version: 2
models:
- name: fct_customer_ltv
description: "Customer lifetime value fact table"
columns:
- name: customer_id
description: "{{ doc('customer_id') }}"
# โ Pulls from the markdown file!
- name: lifetime_value
description: "{{ doc('amount_usd') }}"
- name: signup_date
description: "{{ doc('created_at_date') }}"
Same description copy-pasted in 20 files. Update one? You have to find and update all 20.
description: "Unique customer ID..."
description: "Unique customer ID..."
description: "Unique customer ID..."
ร 20 files
Define once in a .md file. Reference everywhere. Update once, changes propagate automatically.
description: "{{ doc('customer_id') }}"
description: "{{ doc('customer_id') }}"
description: "{{ doc('customer_id') }}"
โ All point to one source
Doc blocks support full Markdown โ bold, italic, links, lists, even tables. Your documentation can be as rich as a wiki page, but it lives right next to your code and is always in sync.
It's like pressing a magic button that turns all your recipe cards, ingredient labels, and cooking notes into a beautiful cookbook website. One command, and you get a fully searchable, interactive documentation site โ no extra work needed!
$ dbt docs generate
Running with dbt=1.7.0
Found 12 models, 6 sources, 24 tests, 3 doc blocks
Building catalog...
Catalog written to /target/catalog.json
Manifest written to /target/manifest.json
$ dbt docs serve
Serving docs at http://localhost:8080
Press Ctrl+C to exit.
Open your browser and you'll see a full documentation website โ automatically generated from your project!
Your entire project: models, sources, tests, macros, dependencies โ everything dbt knows
Column names, data types, row counts โ pulled directly from your warehouse
A single-page app that combines both JSON files into a beautiful, searchable UI
| Feature | What It Shows | Why It's Useful |
|---|---|---|
| Model list | Every model with its description | Find any model instantly |
| Column details | Name, type, description, tests | Understand data without reading SQL |
| Source info | Raw tables, freshness, schemas | Know where data originates |
| Compiled SQL | The actual SQL that runs | Debug without running dbt compile |
| Lineage graph | Visual DAG of all dependencies | See how data flows end-to-end |
| Search | Full-text search across everything | Find anything in seconds |
Think of it this way: You know how Google Maps doesn't just show you streets โ it shows restaurants, gas stations, traffic, and reviews? dbt docs is like Google Maps for your data warehouse. It shows you every table, every column, every relationship, and every test โ all in one interactive place.
The lineage graph is like a family tree for your data. Just like you can trace your ancestry back through parents, grandparents, and great-grandparents, you can trace any number on a dashboard back to its original source.
"Why does this revenue number look wrong?" โ Follow the lineage graph backwards and find exactly where the problem is!
It visualizes how data flows from raw sources, through transformations, to final tables that power dashboards:
Raw data from external systems. This is where data enters your warehouse. You don't control these tables โ they're loaded by ETL tools.
Staging, intermediate, and other models that clean, join, and reshape the data. This is where dbt does its work.
Marts and fact tables that power dashboards and reports. This is what business users actually see.
With hundreds of models, the full graph can be overwhelming. Use filters to focus on what matters:
# Show only a specific model and everything upstream
+fct_customer_ltv โ "What feeds into this model?"
# Show a model and everything downstream
fct_customer_ltv+ โ "What depends on this model?"
# Show 2 levels upstream of a model
2+fct_customer_ltv โ "Show me parents and grandparents"
# Show a specific model and its immediate neighbors
+fct_customer_ltv+ โ "Show upstream AND downstream"
# Show all models in a specific directory
path:models/marts โ "Show me all mart models"
The lineage graph is your most powerful debugging tool. When a dashboard number looks wrong, click on the final model in the graph, then trace backwards through each parent. You'll find the bug much faster than reading SQL files one by one. It's like following a river upstream to find where the pollution started!
Regular lineage tells you that your cake came from the kitchen. Column-level lineage tells you that the eggs came from Farm A, the flour came from Mill B, and the sugar came from Plantation C. It tracks individual ingredients, not just the dish!
Standard lineage shows connections between models (tables). Column-level lineage goes deeper โ it shows which specific source columns feed into which specific final columns.
| Scenario | Without Column Lineage | With Column Lineage |
|---|---|---|
| "Revenue looks wrong" | Check every model manually ๐ฐ | Click on revenue โ see it comes from orders.amount โ check that column โ |
| "Can I rename this column?" | Grep through 200 SQL files ๐ฑ | See every downstream model that uses it instantly โ |
| "What feeds into this metric?" | Read SQL, follow the chain ๐ค | One click โ full column ancestry โ |
๐ Availability
Column-level lineage is available in dbt Cloud (Explorer). For dbt Core (open source), you can use third-party tools like SQLLineage, dbt-osmosis, or Elementary to get similar functionality.
dbt docs serve is great for local development, but how do you share docs with your whole team? Here are the most popular options:
Built-in hosting. Docs auto-update on every run. Zero setup.
Best for: Teams already using dbt Cloud
Easiest โจ
Push the generated files to a GitHub Pages branch. Free and version-controlled.
Best for: Open-source projects or small teams
Free ๐
Upload to an S3 bucket or GCS bucket with static website hosting enabled.
Best for: Enterprise teams with cloud infrastructure
Flexible ๐ง
Embed in Confluence, Notion, or your company wiki. Link to the hosted docs site.
Best for: Teams with existing wiki culture
Integrated ๐
# 1. Generate the docs
$ dbt docs generate
# 2. Copy the generated files
$ cp target/manifest.json target/catalog.json target/index.html docs/
# 3. Commit and push
$ git add docs/
$ git commit -m "Update dbt docs"
$ git push
# 4. Enable GitHub Pages in repo settings โ Source: /docs folder
# Your docs are now live at https://yourorg.github.io/your-repo/ ๐
Automate this! Add dbt docs generate to your CI/CD pipeline so docs are updated on every merge to main. Your documentation will never be stale because it's regenerated from the actual code every time.
Let's see how well you absorbed this lesson. Click the answer you think is correct:
Can you explain to a colleague why dbt documentation is better than a manually maintained wiki? Can you describe what the lineage graph shows and how to filter it? If yes โ you've mastered this lesson!
{{ doc('name') }}+model, model+, +model+ to focus the lineage view