Advanced dbt & Best Practices | dbt Course

🪝 Hooks — Pre & Post Actions

Hooks are like the prep work before cooking and the cleanup after. Before you cook dinner, you preheat the oven and wash your hands. After dinner, you wash the dishes and wipe the counter. You don't think about it — it just happens every time.

dbt hooks work the same way. They're SQL statements that run automatically before or after your model builds. You set them up once, and they fire every single time — no manual work needed!

What Are pre-hook and post-hook?

A pre-hook runs before your model's SQL executes. A post-hook runs after. You define them in the model's config block or in dbt_project.yml.

models/marts/fct_orders.sql — Hooks in model config

-- pre-hook: runs BEFORE the model builds
-- post-hook: runs AFTER the model builds

{{
  config(
    materialized='table',
    pre_hook=[
      "INSERT INTO audit_log (event, ts) VALUES ('fct_orders_start', CURRENT_TIMESTAMP)"
    ],
    post_hook=[
      "GRANT SELECT ON {{ this }} TO ROLE reporting_role",
      "INSERT INTO audit_log (event, ts) VALUES ('fct_orders_done', CURRENT_TIMESTAMP)"
    ]
  )
}}

SELECT
    order_id,
    customer_id,
    amount,
    order_date
FROM {{ ref('stg_orders') }}

Common Use Cases

🔐 Granting Permissions

After a model builds, grant SELECT to the reporting team so dashboards can read the data immediately.

📝 Audit Logging

Log when each model starts and finishes building. Great for debugging slow pipelines.

📊 Creating Indexes

After a table is built, create indexes on frequently queried columns to speed up dashboards.

🧹 Cleanup

Drop temporary tables, vacuum/analyze tables, or run any housekeeping SQL after the build.

dbt_project.yml — Project-wide hooks

models:
  my_project:
    marts:
      +post-hook:
        - "GRANT SELECT ON {{ this }} TO ROLE analyst_role"
        - "ANALYZE {{ this }}"

# Every model in the marts/ folder will automatically:
# 1. Build the table/view
# 2. Grant SELECT to analyst_role
# 3. Run ANALYZE for query optimization

Use project-wide hooks in dbt_project.yml for things that apply to many models (like granting permissions). Use model-level hooks for one-off tasks specific to a single model. Don't go overboard — every hook adds execution time!

📋 Exposures — Connecting to Dashboards

Exposures are like a guest list for your restaurant. They tell dbt: "Hey, these dashboards and reports depend on this data, so be careful when changing it!"

Without exposures, you might change a model and accidentally break 5 dashboards without knowing. With exposures, dbt shows you exactly what's downstream — so you can warn people before making changes.

What Are Exposures?

Exposures are declarations of downstream consumers — the dashboards, reports, ML models, and applications that use your dbt models. They don't change how dbt runs; they add visibility to your lineage graph.

models/exposures.yml

version: 2

exposures:
  - name: weekly_revenue_dashboard
    type: dashboard
    maturity: high
    url: https://bi-tool.company.com/dashboards/42
    description: >
      The CEO's weekly revenue dashboard.
      Breaking this will result in a very bad Monday morning.

    depends_on:
      - ref('fct_orders')
      - ref('dim_customers')

    owner:
      name: Sarah Chen
      email: sarah@company.com

  - name: churn_prediction_model
    type: ml
    maturity: medium
    description: "ML model that predicts customer churn"

    depends_on:
      - ref('dim_customers')
      - ref('fct_orders')

    owner:
      name: Data Science Team
      email: ds-team@company.com

How Exposures Appear in the Lineage Graph

Open dbt docs generate && dbt docs serve and you'll see your exposures as orange nodes at the far right of the lineage graph. They connect back to the models they depend on, giving you a complete picture: raw data → staging → marts → dashboards.

🗄️

Sources

→

📦

Staging

→

🏗️

Marts

→

📊

Exposures
Dashboards

🧠 Exposure Types

dashboard — BI dashboards (Looker, Tableau, Metabase)
notebook — Jupyter notebooks or analysis scripts
analysis — Ad-hoc SQL queries or reports
ml — Machine learning models that consume your data
application — Backend services or APIs reading from your warehouse

Exposures are one of the most underused features in dbt. Add them for every important dashboard. When someone opens a PR that changes fct_orders, they'll immediately see: "Warning — this model feeds the CEO's revenue dashboard." That's powerful!

🔗 dbt Mesh — Multi-Project Architecture

dbt Mesh is like a chain of restaurants. Each restaurant (project) has its own kitchen, its own menu, and its own chef. But they can share recipes and ingredients across locations. The pizza dough recipe from Location A can be used by Location B — without Location B needing to know how the dough is made.

When your company gets big enough that one dbt project becomes a mess (hundreds of models, dozens of teams stepping on each other), you split it into multiple projects that talk to each other. That's dbt Mesh.

When One Project Isn't Enough

Signs you need dbt Mesh:

😵 500+ Models

Your single project has grown so large that dbt run takes forever and nobody knows what half the models do.

👥 Multiple Teams

The marketing team, finance team, and product team all work in the same project and keep breaking each other's models.

🏢 Different Domains

Your data spans completely different business domains (e-commerce, logistics, HR) that shouldn't be tightly coupled.

Cross-Project Refs

In dbt Mesh, projects can reference models from other projects using a two-argument ref():

Project B — referencing a model from Project A

-- In the "marketing" project, reference a model from the "core" project

SELECT
    c.customer_id,
    c.customer_name,
    m.campaign_name,
    m.spend
FROM
    {{ ref('core', 'dim_customers') }} c   -- ← Cross-project ref!
JOIN
    {{ ref('stg_campaigns') }} m          -- ← Same-project ref (normal)
    ON c.customer_id = m.customer_id

Public vs Private Models

🌍 Public Models

Can be referenced by other projects. These are your "API" — the stable, well-documented models you share across teams. Think of them as the menu items your restaurant chain offers everywhere.

🔒 Private Models

Can only be used within the same project. These are your internal staging models, intermediate calculations, and work-in-progress. Like the secret sauce recipe that stays in one kitchen.

models/marts/dim_customers.sql — Making a model public

{{
  config(
    materialized='table',
    access='public'   -- ← Other projects can ref() this model
  )
}}

SELECT
    customer_id,
    customer_name,
    email,
    signup_date
FROM {{ ref('stg_customers') }}  -- stg_customers stays private

Start with one project. Only split into dbt Mesh when you genuinely feel the pain of a monolith. Premature splitting creates more problems than it solves — like opening 10 restaurant locations before your first one is profitable!

📐 The Semantic Layer

The Semantic Layer is like a universal menu that works across all restaurants in the chain. No matter which location you visit, "revenue" means the same thing — it's always calculated the same way. No more arguments about whether revenue includes tax or not!

Without a Semantic Layer, different dashboards might calculate "revenue" differently. The finance dashboard says $1M, the marketing dashboard says $1.2M, and the CEO is confused. The Semantic Layer defines metrics once, and every tool uses the same definition.

The Problem It Solves

💸 The "Which Revenue Is Right?" Problem

Company X has 3 dashboards showing revenue. Dashboard A says $1M (includes refunds). Dashboard B says $1.2M (excludes refunds). Dashboard C says $950K (only counts completed orders). The CEO asks: "What's our actual revenue?" Nobody knows, because each analyst wrote their own SQL.

MetricFlow Integration

dbt uses MetricFlow to power the Semantic Layer. You define metrics in YAML, and any tool (Looker, Tableau, even a Python script) can query them consistently.

models/semantic/sem_metrics.yml

semantic_models:
  - name: orders
    defaults:
      agg_time_dimension: order_date
    model: ref('fct_orders')

    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign

    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day

    measures:
      - name: order_total
        agg: sum
        expr: amount
      - name: order_count
        agg: count

metrics:
  - name: revenue
    description: "Total revenue — the ONE true definition"
    type: simple
    label: Revenue
    type_params:
      measure: order_total

  - name: average_order_value
    description: "Average revenue per order"
    type: derived
    label: Avg Order Value
    type_params:
      expr: revenue / order_count
      metrics:
        - name: revenue
        - name: order_count
          offset_window: 1

Terminal — Querying the Semantic Layer

# Query metrics directly from the command line
$ dbt sl query --metrics revenue --group-by order_date__month

order_date__month | revenue
-----------------+---------
2024-01          | 125,430
2024-02          | 142,890
2024-03          | 168,200

# Same metric, different granularity — same answer every time!
$ dbt sl query --metrics revenue --group-by order_date__year

The Semantic Layer is still evolving (it's a dbt Cloud feature). If you're on dbt Core, you can still define metrics in YAML for documentation purposes. The key takeaway: define your metrics once, in code, and version-control them — never let two dashboards disagree on what "revenue" means.

🚀 Performance Tuning

Performance tuning is like making your car go faster. You can upgrade the engine (better SQL), use a shorter route (partitioning), carry less stuff (selecting only needed columns), and drive during off-peak hours (scheduling). Each trick saves time and money — and in the cloud, time literally is money.

1. Incremental Model Optimization

Instead of rebuilding an entire table every run, only process new or changed rows:

models/marts/fct_orders_incremental.sql

{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',       -- or 'delete+insert', 'append'
    on_schema_change='sync_all_columns'  -- handle new columns automatically
  )
}}

SELECT
    order_id,
    customer_id,
    amount,
    order_date,
    updated_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

2. Partition & Cluster Keys

Tell your warehouse how to organize data physically for faster queries:

BigQuery example — partition + cluster

{{
  config(
    materialized='table',
    partition_by={
      "field": "order_date",
      "data_type": "date",
      "granularity": "month"
    },
    cluster_by=["customer_id", "status"]
  )
}}

-- BigQuery will store data in monthly partitions
-- and cluster within each partition by customer_id + status
-- Queries filtering on these columns will be MUCH faster!

SELECT * FROM {{ ref('stg_orders') }}

3. Selective Runs with Tags

Terminal — Run only what you need

# Tag your models in config:
# {{ config(tags=['daily', 'finance']) }}

# Run only daily models
$ dbt run --select tag:daily

# Run only finance models and their upstream dependencies
$ dbt run --select +tag:finance

# Run a specific model and everything downstream
$ dbt run --select stg_orders+

# Exclude slow models during development
$ dbt run --exclude tag:heavy

Optimization Techniques Comparison

Technique	What It Does	Speed Gain	Cost Savings
Incremental models	Only process new/changed rows	10x–100x faster	High — scans less data
Partitioning	Organize data by date/key	5x–50x faster queries	High — prunes partitions
Clustering	Sort data within partitions	2x–10x faster queries	Medium
Tag-based selection	Run only needed models	Variable	High — fewer models = less compute
Parallel execution	Run independent models simultaneously	2x–4x faster builds	Same cost, less wall-clock time
Ephemeral models	Inline CTEs instead of creating tables	Eliminates table creation overhead	Medium — no storage cost

Terminal — Parallel execution

# Run up to 8 models in parallel (default is 1)
$ dbt run --threads 8

# Set in profiles.yml for permanent config:
# my_project:
#   target: dev
#   outputs:
#     dev:
#       threads: 4    ← dev uses 4 threads
#     prod:
#       threads: 16   ← prod uses 16 threads

The biggest cost saver? Stop using SELECT *. Only select the columns you actually need. In columnar warehouses like BigQuery, Snowflake, and Redshift, you pay per column scanned. Selecting 5 columns instead of 50 can cut your bill by 90%!

🏢 Enterprise Patterns

Enterprise patterns are like the rules for running a big hospital instead of a small clinic. A small clinic can be informal — one doctor, one nurse, everyone knows everything. But a hospital needs strict protocols: who can access patient records, how to handle emergencies, how to train new staff. Same with data at scale.

Multi-Environment Setup

Every serious dbt project has at least 3 environments:

🛠️

DEV
Your sandbox

→

🧪

STAGING
Test with real data

→

🚀

PROD
Live dashboards

profiles.yml — Multi-environment config

my_project:
  target: dev   # default target
  outputs:
    dev:
      type: snowflake
      schema: dbt_yourname     # Each dev gets their own schema
      threads: 4
    staging:
      type: snowflake
      schema: staging
      threads: 8
    prod:
      type: snowflake
      schema: analytics       # The "real" schema dashboards read from
      threads: 16

Data Contracts

Data contracts guarantee the shape of your model's output. If someone accidentally changes a column name or type, dbt will refuse to build it.

models/marts/dim_customers.yml — Data contract

models:
  - name: dim_customers
    config:
      contract:
        enforced: true    # ← Enforce the contract!
    columns:
      - name: customer_id
        data_type: integer
        constraints:
          - type: not_null
          - type: primary_key
      - name: customer_name
        data_type: varchar
      - name: email
        data_type: varchar
      - name: signup_date
        data_type: date

Model Versioning

When you need to change a model's structure but can't break existing consumers:

Versioned models

-- Consumers can choose which version to use:

SELECT * FROM {{ ref('dim_customers', v=1) }}  -- Old version (deprecated)
SELECT * FROM {{ ref('dim_customers', v=2) }}  -- New version (current)
SELECT * FROM {{ ref('dim_customers') }}          -- Latest version (default)

🔐 RBAC (Role-Based Access)

Use post-hooks to GRANT permissions. Analysts get SELECT, engineers get SELECT + INSERT, admins get everything. Never give everyone full access.

📜 Governance & Compliance

Use model groups, access controls, and data contracts to enforce who can change what. Tag PII columns and apply masking policies automatically.

Enterprise patterns aren't just for big companies. Even a 3-person data team benefits from separate dev/prod environments and data contracts. Start simple and add governance as you grow — but always have at least dev + prod from day one.

🚫 Common dbt Anti-Patterns

Anti-patterns are like bad habits. Biting your nails, skipping breakfast, leaving the fridge door open — they seem harmless at first, but they add up over time. In dbt, anti-patterns lead to slow builds, wrong data, and angry stakeholders. Let's learn what not to do!

🚫 Anti-Pattern #1: SELECT * in Production Models

Why it's bad: You're scanning every column, even ones nobody uses. In columnar warehouses, this can cost 10x more than selecting only the columns you need. Plus, if the source adds a column called _internal_debug_flag, it silently appears in your production table.

Fix: Always explicitly list your columns. SELECT customer_id, name, email FROM ...

🚫 Anti-Pattern #2: Not Testing Primary Keys

Why it's bad: Without unique + not_null on your primary key, duplicate rows silently multiply through every downstream model. Your revenue doubles, your user count triples, and nobody notices until the board meeting.

Fix: Add unique + not_null tests to every single primary key. No exceptions.

🚫 Anti-Pattern #3: Hardcoding Environment Values

Why it's bad: Writing FROM production_db.public.orders directly in your SQL means the model only works in production. It breaks in dev and staging.

Fix: Always use {{ source() }} and {{ ref() }}. Let dbt handle the database/schema resolution based on your target environment.

🚫 Anti-Pattern #4: One Giant Model Instead of Layers

Why it's bad: A 500-line SQL file that does staging, business logic, and aggregation all in one model is impossible to test, debug, or reuse. When something breaks, good luck finding the bug.

Fix: Follow the staging → intermediate → marts pattern. Each layer does one thing well. Small, testable, reusable models.

🚫 Anti-Pattern #5: Not Using Incremental for Large Tables

Why it's bad: Rebuilding a 10-billion-row table from scratch every day is like demolishing your house and rebuilding it every time you want to change a lightbulb. It's slow, expensive, and unnecessary.

Fix: Use materialized='incremental' for any table over ~1 million rows that has a reliable timestamp column.

🧠 The Golden Rule

If you wouldn't be comfortable explaining your dbt project to a new team member in 30 minutes, it's probably too complex. Simplify, document, and test.

🎯 Career Advice: Becoming an Analytics Engineer

An analytics engineer is like a translator between the data world and the business world. Data engineers build the pipes that carry water. Business analysts drink the water. Analytics engineers make sure the water is clean, properly filtered, and delivered to the right glass. It's one of the hottest roles in tech right now!

What Does an Analytics Engineer Do?

Analytics engineers own the transformation layer — everything between raw data landing in the warehouse and clean data appearing in dashboards. They write dbt models, define metrics, build tests, and make sure the data is trustworthy.

Skills You Need

🗃️

SQL
Your primary language. Master window functions, CTEs, and joins.

🔄

dbt
The core tool. Know models, tests, docs, and Jinja inside out.

🌿

Git
Version control. Branching, PRs, and code review are daily tasks.

📐

Data Modeling
Star schema, dimensional modeling, and normalization.

💬

Communication
Explain data concepts to non-technical stakeholders.

🐍

Python
Helpful for scripting, automation, and data analysis.

Salary Ranges (2024–2025, Global)

Level	USA (USD)	Europe (EUR)	India (INR)
Junior (0–2 years)	$80K–$110K	€45K–€65K	₹8L–₹15L
Mid-level (2–5 years)	$110K–$150K	€65K–€90K	₹15L–₹30L
Senior (5+ years)	$150K–$200K+	€90K–€130K	₹30L–₹50L+
Staff / Lead	$180K–$250K+	€110K–€160K	₹45L–₹75L+

How to Build a Portfolio

📁 Portfolio Checklist

1. Build a complete dbt project using a public dataset (Jaffle Shop, NYC taxi data, or Kaggle datasets)

2. Push it to GitHub with a clear README explaining your architecture

3. Include tests, documentation, and a generated docs site

4. Write a blog post walking through your design decisions

5. Contribute to open-source dbt packages (even fixing typos in docs counts!)

Certifications

🏅 dbt Analytics Engineering Certification

The official certification from dbt Labs. Covers models, tests, documentation, Jinja, and deployment. Highly respected in the industry. Study this course and you'll be well-prepared!

☁️ Cloud Certifications

Pair your dbt skills with a cloud cert: Snowflake SnowPro Core, Google Cloud Professional Data Engineer, or AWS Data Analytics Specialty.

Pro tip: Contribute to open-source dbt packages! Even small contributions (bug fixes, documentation improvements, new macros) show employers that you understand the ecosystem deeply. Check out dbt-utils, dbt-expectations, and dbt-audit-helper on GitHub.

📚 Course Summary & What's Next

You've just completed a 12-lesson journey from "What is dbt?" to enterprise-grade best practices. That's like going from learning to boil water to running a professional kitchen. Let's look back at everything you've learned!

Recap of All 12 Lessons

1

Meet
dbt

→

2

History

→

3

Install

→

4

Project
Structure

→

5

Models

→

6

Sources
& Refs

7

Testing

→

8

Docs

→

9

Jinja &
Macros

→

10

Packages
& Seeds

→

11

Deploy
& CI/CD

→

12

Advanced
& Best
Practices

What's Next?

🛠️ Build a Practice Project

Clone the Jaffle Shop repo and extend it. Add incremental models, custom tests, macros, and documentation. Push it to your GitHub.

💬 Join dbt Community Slack

Over 70,000 analytics engineers share tips, answer questions, and post job opportunities. It's the best community in data. Join at community.getdbt.com.

🏅 Get Certified

Take the dbt Analytics Engineering Certification exam. This course covers everything you need. Study the docs, practice with real projects, and you'll pass!

📖 Keep Learning

Read the official dbt docs, follow the dbt blog, and watch talks from Coalesce (dbt's annual conference).

🧠 Final Quiz — The Grand Finale!

Four challenging questions to test your mastery. These are harder than the previous quizzes — you've earned it!

Question 1: A post-hook runs `GRANT SELECT ON {{ this }} TO ROLE analyst`. When does this execute?

A) Before the model's SQL runs

B) During the model's SQL execution, as a CTE

C) After the model's table/view has been created

D) Only when you manually run dbt run-operation

Question 2: In dbt Mesh, what does `access='public'` on a model mean?

A) The model is visible on the public internet

B) Other dbt projects can reference this model with cross-project ref()

C) Anyone in the company can edit the model's SQL

D) The model skips all tests and builds immediately

Question 3: Which is the BIGGEST cost saver in a columnar warehouse like BigQuery or Snowflake?

A) Using more threads for parallel execution

B) Running dbt test less frequently

C) Replacing SELECT * with explicit column lists and using incremental models

D) Using ephemeral materialization for all models

Question 4: What problem does the Semantic Layer solve?

A) It makes dbt models run faster by caching results

B) It automatically generates documentation for all models

C) It replaces the need for a data warehouse entirely

D) It ensures metrics like "revenue" are defined once and calculated consistently across all tools

🎉

Congratulations!

You've completed the entire dbt course — all 12 lessons, from zero to advanced. You now know more about dbt than most data professionals. That's incredible!

🏆🎓⚡

Can you explain hooks, exposures, and the Semantic Layer to a colleague? Do you know the top 5 anti-patterns to avoid? Can you describe the analytics engineer role and the skills needed? If yes — you're ready for the real world!

Lesson 12 — Core vs Non-Core

Hooks — pre-hook runs before model builds, post-hook runs after (permissions, logging, indexes)
Exposures — Declare downstream consumers (dashboards, ML models) in YAML for lineage visibility
Performance tuning — Incremental models, partitioning, clustering, selective runs with tags
Anti-patterns — Avoid SELECT *, untested PKs, hardcoded values, monolith models
Multi-environment — Always have at least dev + prod; use profiles.yml to manage targets
Analytics Engineer — Owns the transformation layer; needs SQL, dbt, Git, data modeling, communication

dbt Mesh — Multi-project architecture with cross-project refs and public/private models
Semantic Layer — MetricFlow-powered single source of truth for metrics (dbt Cloud feature)
Data contracts — Enforce column names and types; prevent breaking changes
Model versioning — Serve multiple versions of a model for backward compatibility
dbt Certification — Official dbt Analytics Engineering Certification exam

Full Course Summary — Core Concepts

L1 — Meet dbt: dbt transforms raw data into analytics-ready tables using SQL + software engineering practices
L2 — History: Created by Fishtown Analytics (now dbt Labs) in 2016 to bring engineering rigor to analytics
L3 — Installation: Install via pip, configure profiles.yml to connect to your warehouse
L4 — Project Structure: models/, macros/, seeds/, tests/, snapshots/ — each folder has a purpose
L5 — Models: SQL files that become tables/views; materializations control how they're built
L6 — Sources & Refs: source() declares raw tables, ref() creates the dependency graph
L7 — Testing: unique, not_null, accepted_values, relationships + custom SQL tests
L8 — Documentation: schema.yml descriptions + dbt docs generate for auto-generated docs site
L9 — Jinja & Macros: Templating language that makes SQL dynamic; macros are reusable functions
L10 — Packages & Seeds: Install community packages; seeds load CSV files into your warehouse
L11 — Deployment: CI/CD with GitHub Actions, dbt Cloud jobs, Airflow orchestration
L12 — Advanced: Hooks, exposures, dbt Mesh, Semantic Layer, performance tuning, enterprise patterns

⚡ Advanced dbt & Best Practices