Documentation & Lineage | dbt Course

🤔 Why Document?

Imagine joining a new job and finding 500 SQL files with no explanation. No comments. No descriptions. No clue what anything does. You'd feel like you walked into a library where every book cover is blank.

Documentation is like leaving a trail of breadcrumbs so the next person (or future you in 6 months) doesn't get lost in the forest of SQL files.

The "Hit by a Bus" Test

Every data team should ask themselves this uncomfortable question:

If the person who built this model disappears tomorrow, can someone else understand it?

If the answer is "no" — you have a documentation problem. And it's not a matter of if someone will leave, it's when. People switch jobs, go on vacation, or simply forget what they built 3 months ago.

The Numbers Don't Lie

5x

Faster Onboarding

Teams with good docs onboard new members 5x faster

60%

Less "What Does This Do?"

Fewer Slack messages asking about model logic

0

Stale Docs

dbt auto-generates docs from your actual code

The magic of dbt documentation: Unlike a Google Doc or Confluence page that someone writes once and never updates, dbt docs are generated from your actual project. Column names, data types, test results — all pulled directly from the code. The docs can never drift from reality because they are reality.

📋 schema.yml — The Heart of Documentation

schema.yml is like the label on a food container in your fridge. It tells you what's inside, when it was made, and what ingredients were used. Without the label, you're opening mystery containers and hoping for the best.

You've already seen YAML files for tests and sources. The same files are where you add descriptions — the human-readable explanations that turn cryptic column names into understandable documentation.

Model Descriptions

At the top level, you describe what the model does and why it exists:

models/marts/_schema.yml

version: 2

models:
  - name: fct_customer_ltv
    description: "
      Customer lifetime value (LTV) fact table.
      One row per customer, showing their total spend,
      order count, and first/last order dates.
      Used by the Marketing team for segmentation
      and the Finance team for revenue forecasting.
    "

Column Descriptions

Go deeper — describe every column so anyone can understand the data without reading SQL:

models/marts/_schema.yml (continued)

    columns:
      - name: customer_id
        description: "Unique identifier for each customer. Sourced from raw_shop.customers.id"
        tests:
          - unique
          - not_null

      - name: lifetime_value
        description: "
          Total amount (in USD) the customer has spent across all orders.
          Calculated as SUM(order_amount). NULL if the customer has
          never placed an order.
        "

      - name: first_order_date
        description: "Date of the customer's very first order. NULL if never ordered."

      - name: total_orders
        description: "Count of distinct orders placed by this customer. 0 if never ordered."
        tests:
          - not_null

Good descriptions answer three questions:

1. What is this? (a customer ID, a dollar amount, a date)

2. Where does it come from? (sourced from X table, calculated as Y)

3. What are the edge cases? (NULL if never ordered, 0 for new customers)

The Full Picture

Here's a complete, annotated YAML file showing models, columns, descriptions, and tests all working together:

models/staging/_schema.yml — Full Example

version: 2

models:
  - name: stg_customers
    description: "Cleaned customer data. One row per customer. Renames raw columns to standard naming."
    columns:
      - name: customer_id
        description: "Primary key — maps to raw_shop.customers.id"
        tests: [unique, not_null]
      - name: email
        description: "Customer email address. Renamed from email_addr."
      - name: signup_date
        description: "Date the customer registered. Cast from timestamp to date."

  - name: stg_orders
    description: "Cleaned order data. One row per order. Filters out cancelled orders."
    columns:
      - name: order_id
        description: "Primary key for orders"
        tests: [unique, not_null]
      - name: customer_id
        description: "Foreign key to stg_customers"
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Think of schema.yml as a nutrition label:

🏷️ Model name = Product name ("Organic Tomato Soup")

📝 Model description = What it is ("A hearty soup made from...")

📊 Column descriptions = Ingredients list (each ingredient explained)

✅ Tests = Quality certifications (organic, non-GMO, etc.)

📦 Doc Blocks — Reusable Documentation

Doc blocks are like templates. Instead of writing the same description for customer_id in 20 different YAML files, you write it once and reference it everywhere. It's like having a dictionary — you define a word once, and everyone looks it up when they need it.

When you have columns that appear in many models (like customer_id, created_at, or amount), writing the same description over and over is tedious and error-prone. Doc blocks solve this.

Step 1: Create a Markdown File

Create a .md file anywhere in your models/ directory:

models/docs/common_columns.md

{% docs customer_id %}

Unique identifier for a customer. This is the primary key
in `stg_customers` and appears as a foreign key in most
downstream models.

**Source:** `raw_shop.customers.id`
**Type:** INTEGER
**Example:** 10042

{% enddocs %}

{% docs amount_usd %}

A monetary amount in US Dollars (USD). Always stored as
DECIMAL(10,2). Negative values indicate refunds.

**Example:** 49.99, -12.50

{% enddocs %}

{% docs created_at_date %}

The date a record was created, cast from the original
timestamp to DATE. Timezone is UTC.

{% enddocs %}

Step 2: Reference Doc Blocks in YAML

Now use {{ doc('block_name') }} in your descriptions:

models/marts/_schema.yml

version: 2

models:
  - name: fct_customer_ltv
    description: "Customer lifetime value fact table"
    columns:
      - name: customer_id
        description: "{{ doc('customer_id') }}"
        #  ↑ Pulls from the markdown file!

      - name: lifetime_value
        description: "{{ doc('amount_usd') }}"

      - name: signup_date
        description: "{{ doc('created_at_date') }}"

❌ Without Doc Blocks

Same description copy-pasted in 20 files. Update one? You have to find and update all 20.

description: "Unique customer ID..."
description: "Unique customer ID..."
description: "Unique customer ID..."
× 20 files

✅ With Doc Blocks

Define once in a .md file. Reference everywhere. Update once, changes propagate automatically.

description: "{{ doc('customer_id') }}"
description: "{{ doc('customer_id') }}"
description: "{{ doc('customer_id') }}"
✓ All point to one source

Doc blocks support full Markdown — bold, italic, links, lists, even tables. Your documentation can be as rich as a wiki page, but it lives right next to your code and is always in sync.

🚀 dbt docs generate & serve

It's like pressing a magic button that turns all your recipe cards, ingredient labels, and cooking notes into a beautiful cookbook website. One command, and you get a fully searchable, interactive documentation site — no extra work needed!

Step 1: Generate the Docs

Terminal

$ dbt docs generate

Running with dbt=1.7.0
Found 12 models, 6 sources, 24 tests, 3 doc blocks
Building catalog...
Catalog written to /target/catalog.json
Manifest written to /target/manifest.json

Step 2: Serve the Docs

Terminal

$ dbt docs serve

Serving docs at http://localhost:8080
Press Ctrl+C to exit.

Open your browser and you'll see a full documentation website — automatically generated from your project!

What Gets Generated?

📄

manifest.json

Your entire project: models, sources, tests, macros, dependencies — everything dbt knows

📊

catalog.json

Column names, data types, row counts — pulled directly from your warehouse

🌐

index.html

A single-page app that combines both JSON files into a beautiful, searchable UI

What You'll See in the Docs Site

Feature	What It Shows	Why It's Useful
Model list	Every model with its description	Find any model instantly
Column details	Name, type, description, tests	Understand data without reading SQL
Source info	Raw tables, freshness, schemas	Know where data originates
Compiled SQL	The actual SQL that runs	Debug without running dbt compile
Lineage graph	Visual DAG of all dependencies	See how data flows end-to-end
Search	Full-text search across everything	Find anything in seconds

Think of it this way: You know how Google Maps doesn't just show you streets — it shows restaurants, gas stations, traffic, and reviews? dbt docs is like Google Maps for your data warehouse. It shows you every table, every column, every relationship, and every test — all in one interactive place.

🌳 The Lineage Graph — The Crown Jewel

The lineage graph is like a family tree for your data. Just like you can trace your ancestry back through parents, grandparents, and great-grandparents, you can trace any number on a dashboard back to its original source.

"Why does this revenue number look wrong?" → Follow the lineage graph backwards and find exactly where the problem is!

What Does the Lineage Graph Show?

It visualizes how data flows from raw sources, through transformations, to final tables that power dashboards:

How to Read the Lineage Graph

⬅️ Left Side = Sources

Raw data from external systems. This is where data enters your warehouse. You don't control these tables — they're loaded by ETL tools.

🔄 Middle = Transformations

Staging, intermediate, and other models that clean, join, and reshape the data. This is where dbt does its work.

➡️ Right Side = Final Models

Marts and fact tables that power dashboards and reports. This is what business users actually see.

Filtering the Graph

With hundreds of models, the full graph can be overwhelming. Use filters to focus on what matters:

Lineage Graph Filters (in the docs UI search bar)

# Show only a specific model and everything upstream
+fct_customer_ltv       ← "What feeds into this model?"

# Show a model and everything downstream
fct_customer_ltv+       ← "What depends on this model?"

# Show 2 levels upstream of a model
2+fct_customer_ltv      ← "Show me parents and grandparents"

# Show a specific model and its immediate neighbors
+fct_customer_ltv+      ← "Show upstream AND downstream"

# Show all models in a specific directory
path:models/marts       ← "Show me all mart models"

The lineage graph is your most powerful debugging tool. When a dashboard number looks wrong, click on the final model in the graph, then trace backwards through each parent. You'll find the bug much faster than reading SQL files one by one. It's like following a river upstream to find where the pollution started!

🔬 Column-Level Lineage

Regular lineage tells you that your cake came from the kitchen. Column-level lineage tells you that the eggs came from Farm A, the flour came from Mill B, and the sugar came from Plantation C. It tracks individual ingredients, not just the dish!

Standard lineage shows connections between models (tables). Column-level lineage goes deeper — it shows which specific source columns feed into which specific final columns.

Why Does This Matter?

Scenario	Without Column Lineage	With Column Lineage
"Revenue looks wrong"	Check every model manually 😰	Click on revenue → see it comes from orders.amount → check that column ✅
"Can I rename this column?"	Grep through 200 SQL files 😱	See every downstream model that uses it instantly ✅
"What feeds into this metric?"	Read SQL, follow the chain 😤	One click → full column ancestry ✅

How It Works

🔑 Availability

Column-level lineage is available in dbt Cloud (Explorer). For dbt Core (open source), you can use third-party tools like SQLLineage, dbt-osmosis, or Elementary to get similar functionality.

🏠 Hosting Your Docs

dbt docs serve is great for local development, but how do you share docs with your whole team? Here are the most popular options:

☁️ dbt Cloud

Built-in hosting. Docs auto-update on every run. Zero setup.

Best for: Teams already using dbt Cloud

Easiest ✨

📄 GitHub Pages

Push the generated files to a GitHub Pages branch. Free and version-controlled.

Best for: Open-source projects or small teams

Free 🆓

🪣 S3 / GCS Static

Upload to an S3 bucket or GCS bucket with static website hosting enabled.

Best for: Enterprise teams with cloud infrastructure

Flexible 🔧

📚 Internal Wiki

Embed in Confluence, Notion, or your company wiki. Link to the hosted docs site.

Best for: Teams with existing wiki culture

Integrated 🔗

Quick GitHub Pages Setup

Terminal — Deploy docs to GitHub Pages

# 1. Generate the docs
$ dbt docs generate

# 2. Copy the generated files
$ cp target/manifest.json target/catalog.json target/index.html docs/

# 3. Commit and push
$ git add docs/
$ git commit -m "Update dbt docs"
$ git push

# 4. Enable GitHub Pages in repo settings → Source: /docs folder
# Your docs are now live at https://yourorg.github.io/your-repo/ 🎉

Automate this! Add dbt docs generate to your CI/CD pipeline so docs are updated on every merge to main. Your documentation will never be stale because it's regenerated from the actual code every time.

🧠 Quick Quiz — Test Your Understanding!

Let's see how well you absorbed this lesson. Click the answer you think is correct:

Question 1: What is the purpose of schema.yml descriptions?

A) They make dbt run faster by caching query results

B) They provide human-readable explanations of models and columns that appear in auto-generated docs

C) They define the SQL logic for each model

D) They are only used for testing, not documentation

Question 2: What do doc blocks ({% docs %}) allow you to do?

A) Write SQL inside Markdown files

B) Generate PDF reports from your models

C) Write reusable documentation once and reference it in multiple YAML files

D) Block certain users from viewing documentation

Question 3: In the lineage graph, what does the left side represent?

A) Final dashboard tables (marts)

B) Intermediate transformations

C) Raw source data from external systems

D) Test results and data quality checks

Can you explain to a colleague why dbt documentation is better than a manually maintained wiki? Can you describe what the lineage graph shows and how to filter it? If yes — you've mastered this lesson!

schema.yml descriptions — Add human-readable explanations to models and columns
Doc blocks — Write reusable descriptions in .md files, reference with {{ doc('name') }}
dbt docs generate — Creates manifest.json + catalog.json from your project
dbt docs serve — Launches a local documentation website
Lineage graph — Visual DAG showing data flow from sources to final models
Graph filters — Use +model, model+, +model+ to focus the lineage view

Column-level lineage — Track individual columns through transformations (dbt Cloud)
Hosting options — GitHub Pages, S3, GCS, or internal wiki integration
manifest.json vs catalog.json — Project metadata vs warehouse metadata
CI/CD automation — Auto-regenerate docs on every merge

📖 Documentation & Lineage — Your Data's Wikipedia