Enterprise patterns, performance tuning, and career advice for analytics engineers
Hooks are like the prep work before cooking and the cleanup after. Before you cook dinner, you preheat the oven and wash your hands. After dinner, you wash the dishes and wipe the counter. You don't think about it โ it just happens every time.
dbt hooks work the same way. They're SQL statements that run automatically before or after your model builds. You set them up once, and they fire every single time โ no manual work needed!
A pre-hook runs before your model's SQL executes. A post-hook runs after. You define them in the model's config block or in dbt_project.yml.
-- pre-hook: runs BEFORE the model builds
-- post-hook: runs AFTER the model builds
{{
config(
materialized='table',
pre_hook=[
"INSERT INTO audit_log (event, ts) VALUES ('fct_orders_start', CURRENT_TIMESTAMP)"
],
post_hook=[
"GRANT SELECT ON {{ this }} TO ROLE reporting_role",
"INSERT INTO audit_log (event, ts) VALUES ('fct_orders_done', CURRENT_TIMESTAMP)"
]
)
}}
SELECT
order_id,
customer_id,
amount,
order_date
FROM {{ ref('stg_orders') }}
After a model builds, grant SELECT to the reporting team so dashboards can read the data immediately.
Log when each model starts and finishes building. Great for debugging slow pipelines.
After a table is built, create indexes on frequently queried columns to speed up dashboards.
Drop temporary tables, vacuum/analyze tables, or run any housekeeping SQL after the build.
models:
my_project:
marts:
+post-hook:
- "GRANT SELECT ON {{ this }} TO ROLE analyst_role"
- "ANALYZE {{ this }}"
# Every model in the marts/ folder will automatically:
# 1. Build the table/view
# 2. Grant SELECT to analyst_role
# 3. Run ANALYZE for query optimization
Use project-wide hooks in dbt_project.yml for things that apply to many models (like granting permissions). Use model-level hooks for one-off tasks specific to a single model. Don't go overboard โ every hook adds execution time!
Exposures are like a guest list for your restaurant. They tell dbt: "Hey, these dashboards and reports depend on this data, so be careful when changing it!"
Without exposures, you might change a model and accidentally break 5 dashboards without knowing. With exposures, dbt shows you exactly what's downstream โ so you can warn people before making changes.
Exposures are declarations of downstream consumers โ the dashboards, reports, ML models, and applications that use your dbt models. They don't change how dbt runs; they add visibility to your lineage graph.
version: 2
exposures:
- name: weekly_revenue_dashboard
type: dashboard
maturity: high
url: https://bi-tool.company.com/dashboards/42
description: >
The CEO's weekly revenue dashboard.
Breaking this will result in a very bad Monday morning.
depends_on:
- ref('fct_orders')
- ref('dim_customers')
owner:
name: Sarah Chen
email: sarah@company.com
- name: churn_prediction_model
type: ml
maturity: medium
description: "ML model that predicts customer churn"
depends_on:
- ref('dim_customers')
- ref('fct_orders')
owner:
name: Data Science Team
email: ds-team@company.com
Open dbt docs generate && dbt docs serve and you'll see your exposures as orange nodes at the far right of the lineage graph. They connect back to the models they depend on, giving you a complete picture: raw data โ staging โ marts โ dashboards.
๐ง Exposure Types
dashboard โ BI dashboards (Looker, Tableau, Metabase)
notebook โ Jupyter notebooks or analysis scripts
analysis โ Ad-hoc SQL queries or reports
ml โ Machine learning models that consume your data
application โ Backend services or APIs reading from your warehouse
Exposures are one of the most underused features in dbt. Add them for every important dashboard. When someone opens a PR that changes fct_orders, they'll immediately see: "Warning โ this model feeds the CEO's revenue dashboard." That's powerful!
dbt Mesh is like a chain of restaurants. Each restaurant (project) has its own kitchen, its own menu, and its own chef. But they can share recipes and ingredients across locations. The pizza dough recipe from Location A can be used by Location B โ without Location B needing to know how the dough is made.
When your company gets big enough that one dbt project becomes a mess (hundreds of models, dozens of teams stepping on each other), you split it into multiple projects that talk to each other. That's dbt Mesh.
Signs you need dbt Mesh:
Your single project has grown so large that dbt run takes forever and nobody knows what half the models do.
The marketing team, finance team, and product team all work in the same project and keep breaking each other's models.
Your data spans completely different business domains (e-commerce, logistics, HR) that shouldn't be tightly coupled.
In dbt Mesh, projects can reference models from other projects using a two-argument ref():
-- In the "marketing" project, reference a model from the "core" project
SELECT
c.customer_id,
c.customer_name,
m.campaign_name,
m.spend
FROM
{{ ref('core', 'dim_customers') }} c -- โ Cross-project ref!
JOIN
{{ ref('stg_campaigns') }} m -- โ Same-project ref (normal)
ON c.customer_id = m.customer_id
Can be referenced by other projects. These are your "API" โ the stable, well-documented models you share across teams. Think of them as the menu items your restaurant chain offers everywhere.
Can only be used within the same project. These are your internal staging models, intermediate calculations, and work-in-progress. Like the secret sauce recipe that stays in one kitchen.
{{
config(
materialized='table',
access='public' -- โ Other projects can ref() this model
)
}}
SELECT
customer_id,
customer_name,
email,
signup_date
FROM {{ ref('stg_customers') }} -- stg_customers stays private
Start with one project. Only split into dbt Mesh when you genuinely feel the pain of a monolith. Premature splitting creates more problems than it solves โ like opening 10 restaurant locations before your first one is profitable!
The Semantic Layer is like a universal menu that works across all restaurants in the chain. No matter which location you visit, "revenue" means the same thing โ it's always calculated the same way. No more arguments about whether revenue includes tax or not!
Without a Semantic Layer, different dashboards might calculate "revenue" differently. The finance dashboard says $1M, the marketing dashboard says $1.2M, and the CEO is confused. The Semantic Layer defines metrics once, and every tool uses the same definition.
๐ธ The "Which Revenue Is Right?" Problem
Company X has 3 dashboards showing revenue. Dashboard A says $1M (includes refunds). Dashboard B says $1.2M (excludes refunds). Dashboard C says $950K (only counts completed orders). The CEO asks: "What's our actual revenue?" Nobody knows, because each analyst wrote their own SQL.
dbt uses MetricFlow to power the Semantic Layer. You define metrics in YAML, and any tool (Looker, Tableau, even a Python script) can query them consistently.
semantic_models:
- name: orders
defaults:
agg_time_dimension: order_date
model: ref('fct_orders')
entities:
- name: order_id
type: primary
- name: customer_id
type: foreign
dimensions:
- name: order_date
type: time
type_params:
time_granularity: day
measures:
- name: order_total
agg: sum
expr: amount
- name: order_count
agg: count
metrics:
- name: revenue
description: "Total revenue โ the ONE true definition"
type: simple
label: Revenue
type_params:
measure: order_total
- name: average_order_value
description: "Average revenue per order"
type: derived
label: Avg Order Value
type_params:
expr: revenue / order_count
metrics:
- name: revenue
- name: order_count
offset_window: 1
# Query metrics directly from the command line
$ dbt sl query --metrics revenue --group-by order_date__month
order_date__month | revenue
-----------------+---------
2024-01 | 125,430
2024-02 | 142,890
2024-03 | 168,200
# Same metric, different granularity โ same answer every time!
$ dbt sl query --metrics revenue --group-by order_date__year
The Semantic Layer is still evolving (it's a dbt Cloud feature). If you're on dbt Core, you can still define metrics in YAML for documentation purposes. The key takeaway: define your metrics once, in code, and version-control them โ never let two dashboards disagree on what "revenue" means.
Performance tuning is like making your car go faster. You can upgrade the engine (better SQL), use a shorter route (partitioning), carry less stuff (selecting only needed columns), and drive during off-peak hours (scheduling). Each trick saves time and money โ and in the cloud, time literally is money.
Instead of rebuilding an entire table every run, only process new or changed rows:
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge', -- or 'delete+insert', 'append'
on_schema_change='sync_all_columns' -- handle new columns automatically
)
}}
SELECT
order_id,
customer_id,
amount,
order_date,
updated_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
Tell your warehouse how to organize data physically for faster queries:
{{
config(
materialized='table',
partition_by={
"field": "order_date",
"data_type": "date",
"granularity": "month"
},
cluster_by=["customer_id", "status"]
)
}}
-- BigQuery will store data in monthly partitions
-- and cluster within each partition by customer_id + status
-- Queries filtering on these columns will be MUCH faster!
SELECT * FROM {{ ref('stg_orders') }}
# Tag your models in config:
# {{ config(tags=['daily', 'finance']) }}
# Run only daily models
$ dbt run --select tag:daily
# Run only finance models and their upstream dependencies
$ dbt run --select +tag:finance
# Run a specific model and everything downstream
$ dbt run --select stg_orders+
# Exclude slow models during development
$ dbt run --exclude tag:heavy
| Technique | What It Does | Speed Gain | Cost Savings |
|---|---|---|---|
| Incremental models | Only process new/changed rows | 10xโ100x faster | High โ scans less data |
| Partitioning | Organize data by date/key | 5xโ50x faster queries | High โ prunes partitions |
| Clustering | Sort data within partitions | 2xโ10x faster queries | Medium |
| Tag-based selection | Run only needed models | Variable | High โ fewer models = less compute |
| Parallel execution | Run independent models simultaneously | 2xโ4x faster builds | Same cost, less wall-clock time |
| Ephemeral models | Inline CTEs instead of creating tables | Eliminates table creation overhead | Medium โ no storage cost |
# Run up to 8 models in parallel (default is 1)
$ dbt run --threads 8
# Set in profiles.yml for permanent config:
# my_project:
# target: dev
# outputs:
# dev:
# threads: 4 โ dev uses 4 threads
# prod:
# threads: 16 โ prod uses 16 threads
The biggest cost saver? Stop using SELECT *. Only select the columns you actually need. In columnar warehouses like BigQuery, Snowflake, and Redshift, you pay per column scanned. Selecting 5 columns instead of 50 can cut your bill by 90%!
Enterprise patterns are like the rules for running a big hospital instead of a small clinic. A small clinic can be informal โ one doctor, one nurse, everyone knows everything. But a hospital needs strict protocols: who can access patient records, how to handle emergencies, how to train new staff. Same with data at scale.
Every serious dbt project has at least 3 environments:
my_project:
target: dev # default target
outputs:
dev:
type: snowflake
schema: dbt_yourname # Each dev gets their own schema
threads: 4
staging:
type: snowflake
schema: staging
threads: 8
prod:
type: snowflake
schema: analytics # The "real" schema dashboards read from
threads: 16
Data contracts guarantee the shape of your model's output. If someone accidentally changes a column name or type, dbt will refuse to build it.
models:
- name: dim_customers
config:
contract:
enforced: true # โ Enforce the contract!
columns:
- name: customer_id
data_type: integer
constraints:
- type: not_null
- type: primary_key
- name: customer_name
data_type: varchar
- name: email
data_type: varchar
- name: signup_date
data_type: date
When you need to change a model's structure but can't break existing consumers:
-- Consumers can choose which version to use:
SELECT * FROM {{ ref('dim_customers', v=1) }} -- Old version (deprecated)
SELECT * FROM {{ ref('dim_customers', v=2) }} -- New version (current)
SELECT * FROM {{ ref('dim_customers') }} -- Latest version (default)
Use post-hooks to GRANT permissions. Analysts get SELECT, engineers get SELECT + INSERT, admins get everything. Never give everyone full access.
Use model groups, access controls, and data contracts to enforce who can change what. Tag PII columns and apply masking policies automatically.
Enterprise patterns aren't just for big companies. Even a 3-person data team benefits from separate dev/prod environments and data contracts. Start simple and add governance as you grow โ but always have at least dev + prod from day one.
Anti-patterns are like bad habits. Biting your nails, skipping breakfast, leaving the fridge door open โ they seem harmless at first, but they add up over time. In dbt, anti-patterns lead to slow builds, wrong data, and angry stakeholders. Let's learn what not to do!
๐ซ Anti-Pattern #1: SELECT * in Production Models
Why it's bad: You're scanning every column, even ones nobody uses. In columnar warehouses, this can cost 10x more than selecting only the columns you need. Plus, if the source adds a column called _internal_debug_flag, it silently appears in your production table.
Fix: Always explicitly list your columns. SELECT customer_id, name, email FROM ...
๐ซ Anti-Pattern #2: Not Testing Primary Keys
Why it's bad: Without unique + not_null on your primary key, duplicate rows silently multiply through every downstream model. Your revenue doubles, your user count triples, and nobody notices until the board meeting.
Fix: Add unique + not_null tests to every single primary key. No exceptions.
๐ซ Anti-Pattern #3: Hardcoding Environment Values
Why it's bad: Writing FROM production_db.public.orders directly in your SQL means the model only works in production. It breaks in dev and staging.
Fix: Always use {{ source() }} and {{ ref() }}. Let dbt handle the database/schema resolution based on your target environment.
๐ซ Anti-Pattern #4: One Giant Model Instead of Layers
Why it's bad: A 500-line SQL file that does staging, business logic, and aggregation all in one model is impossible to test, debug, or reuse. When something breaks, good luck finding the bug.
Fix: Follow the staging โ intermediate โ marts pattern. Each layer does one thing well. Small, testable, reusable models.
๐ซ Anti-Pattern #5: Not Using Incremental for Large Tables
Why it's bad: Rebuilding a 10-billion-row table from scratch every day is like demolishing your house and rebuilding it every time you want to change a lightbulb. It's slow, expensive, and unnecessary.
Fix: Use materialized='incremental' for any table over ~1 million rows that has a reliable timestamp column.
๐ง The Golden Rule
If you wouldn't be comfortable explaining your dbt project to a new team member in 30 minutes, it's probably too complex. Simplify, document, and test.
An analytics engineer is like a translator between the data world and the business world. Data engineers build the pipes that carry water. Business analysts drink the water. Analytics engineers make sure the water is clean, properly filtered, and delivered to the right glass. It's one of the hottest roles in tech right now!
Analytics engineers own the transformation layer โ everything between raw data landing in the warehouse and clean data appearing in dashboards. They write dbt models, define metrics, build tests, and make sure the data is trustworthy.
| Level | USA (USD) | Europe (EUR) | India (INR) |
|---|---|---|---|
| Junior (0โ2 years) | $80Kโ$110K | โฌ45Kโโฌ65K | โน8Lโโน15L |
| Mid-level (2โ5 years) | $110Kโ$150K | โฌ65Kโโฌ90K | โน15Lโโน30L |
| Senior (5+ years) | $150Kโ$200K+ | โฌ90Kโโฌ130K | โน30Lโโน50L+ |
| Staff / Lead | $180Kโ$250K+ | โฌ110Kโโฌ160K | โน45Lโโน75L+ |
๐ Portfolio Checklist
1. Build a complete dbt project using a public dataset (Jaffle Shop, NYC taxi data, or Kaggle datasets)
2. Push it to GitHub with a clear README explaining your architecture
3. Include tests, documentation, and a generated docs site
4. Write a blog post walking through your design decisions
5. Contribute to open-source dbt packages (even fixing typos in docs counts!)
The official certification from dbt Labs. Covers models, tests, documentation, Jinja, and deployment. Highly respected in the industry. Study this course and you'll be well-prepared!
Pair your dbt skills with a cloud cert: Snowflake SnowPro Core, Google Cloud Professional Data Engineer, or AWS Data Analytics Specialty.
Pro tip: Contribute to open-source dbt packages! Even small contributions (bug fixes, documentation improvements, new macros) show employers that you understand the ecosystem deeply. Check out dbt-utils, dbt-expectations, and dbt-audit-helper on GitHub.
You've just completed a 12-lesson journey from "What is dbt?" to enterprise-grade best practices. That's like going from learning to boil water to running a professional kitchen. Let's look back at everything you've learned!
Clone the Jaffle Shop repo and extend it. Add incremental models, custom tests, macros, and documentation. Push it to your GitHub.
Over 70,000 analytics engineers share tips, answer questions, and post job opportunities. It's the best community in data. Join at community.getdbt.com.
Take the dbt Analytics Engineering Certification exam. This course covers everything you need. Study the docs, practice with real projects, and you'll pass!
Read the official dbt docs, follow the dbt blog, and watch talks from Coalesce (dbt's annual conference).
Four challenging questions to test your mastery. These are harder than the previous quizzes โ you've earned it!
GRANT SELECT ON {{ this }} TO ROLE analyst. When does this execute?dbt run-operation
access='public' on a model mean?dbt test less frequently
SELECT * with explicit column lists and using incremental models
You've completed the entire dbt course โ all 12 lessons, from zero to advanced. You now know more about dbt than most data professionals. That's incredible!
Can you explain hooks, exposures, and the Semantic Layer to a colleague? Do you know the top 5 anti-patterns to avoid? Can you describe the analytics engineer role and the skills needed? If yes โ you're ready for the real world!