Why reinvent the wheel? Use community packages and load reference data with seeds.
Packages are like pre-made spice mixes. Instead of measuring out 12 individual spices every time you cook, you grab a jar of "Italian Seasoning" that someone already mixed perfectly. dbt packages give you pre-built macros and tests that save you hours of work.
Imagine you need to create a unique ID by combining three columns. You could spend 30 minutes writing and testing a macro from scratch. Or you could install dbt_utils and use generate_surrogate_key in 10 seconds. That's the power of packages!
dbt packages are community-built, version-controlled collections of macros, tests, and models that anyone can install and use. Think of them like apps on your phone โ someone else built them, you just install and enjoy.
Before writing any macro or custom test from scratch, search hub.getdbt.com first. Chances are, someone has already built exactly what you need โ battle-tested and used by thousands of teams!
dbt deps is like going to the grocery store โ it downloads all the ingredients (packages) you listed in your shopping list (packages.yml). You write down what you need, run the command, and dbt fetches everything for you. No manual downloading, no hunting around the internet!
Create a file called packages.yml in the root of your dbt project (same level as dbt_project.yml):
packages:
# The Swiss Army Knife โ macros for everything
- package: dbt-labs/dbt_utils
version: [">=1.1.0", "<2.0.0"]
# Advanced data quality tests (Great Expectations-style)
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
# Compare model versions during refactoring
- package: dbt-labs/audit_helper
version: [">=0.9.0", "<1.0.0"]
# Generate YAML and SQL boilerplate automatically
- package: dbt-labs/codegen
version: [">=0.12.0", "<1.0.0"]
Version pinning explained: The syntax [">=1.1.0", "<2.0.0"] means "any version from 1.1.0 up to (but not including) 2.0.0." This protects you from breaking changes. It's like saying "I want Italian Seasoning version 1, not version 2 where they changed the recipe!"
$ dbt deps
Installing dbt-labs/dbt_utils...
Installed from version 1.1.1
Installing calogica/dbt_expectations...
Installed from version 0.10.3
Installing dbt-labs/audit_helper...
Installed from version 0.9.0
Installing dbt-labs/codegen...
Installed from version 0.12.1
Successfully installed 4 packages.
Packages are downloaded into the dbt_packages/ folder (previously dbt_modules/). This folder is auto-generated โ never edit files inside it.
Add dbt_packages/ to your .gitignore! These files are downloaded, not written by you. It's like not checking your grocery bags into version control โ you just keep the shopping list (packages.yml) and re-download when needed.
dbt_utils is like a kitchen multi-tool โ it has a knife, bottle opener, screwdriver, and more. You'll use it in almost every project. It's the single most popular dbt package, installed by virtually every team on the planet. If dbt is the kitchen, dbt_utils is the drawer full of essential gadgets.
generate_surrogate_key โ Creating Unique IDsSometimes your data doesn't have a single unique column. You need to combine multiple columns into one unique ID. This macro does it for you:
SELECT
{{ dbt_utils.generate_surrogate_key([
'order_id',
'product_id'
]) }} AS order_item_id,
order_id,
product_id,
quantity,
unit_price
FROM
{{ source('raw_shop', 'order_items') }}
-- Result: order_item_id = 'a1b2c3d4e5f6...' (a hash of order_id + product_id)
-- Every combination of order + product gets a unique, consistent ID!
Why not just concatenate? You could write order_id || '-' || product_id, but what if order_id is NULL? The whole thing becomes NULL. generate_surrogate_key handles NULLs, data types, and edge cases automatically. It's like using a professional blender instead of mashing things with a fork.
pivot / unpivot โ Reshaping DataTurn rows into columns (pivot) or columns into rows (unpivot):
-- Turn this: customer_id | month | revenue
-- 1 | January | 500
-- 1 | February | 300
--
-- Into this: customer_id | January | February
-- 1 | 500 | 300
SELECT
customer_id,
{{ dbt_utils.pivot(
'month',
dbt_utils.get_column_values(
table=ref('monthly_revenue'),
column='month'
),
agg='sum',
then_value='revenue'
) }}
FROM {{ ref('monthly_revenue') }}
GROUP BY customer_id
date_spine โ Generating Date RangesNeed a table with every single date between two points? This is your friend:
{{ dbt_utils.date_spine(
datepart="day",
start_date="cast('2020-01-01' as date)",
end_date="cast('2030-12-31' as date)"
) }}
-- Generates a table with one row per day:
-- date_day
-- 2020-01-01
-- 2020-01-02
-- 2020-01-03
-- ... (4,017 rows)
-- 2030-12-31
get_column_values โ Dynamic Column Lists-- Instead of hardcoding ['active', 'inactive', 'suspended']...
-- Let dbt figure it out from the actual data!
{% set statuses = dbt_utils.get_column_values(
table=ref('stg_customers'),
column='status'
) %}
-- statuses = ['active', 'inactive', 'suspended', 'pending']
-- Now use this list in a pivot, CASE WHEN, or anywhere else!
star โ Select All Columns Except SomeSELECT
{{ dbt_utils.star(
from=ref('stg_customers'),
except=["email", "phone_number", "ssn"]
) }}
FROM
{{ ref('stg_customers') }}
-- Selects ALL columns EXCEPT email, phone_number, and ssn
-- Perfect for creating PII-free views!
The dbt_utils package has 50+ macros. Don't try to memorize them all โ just know they exist. When you hit a problem, check the dbt_utils README first. Odds are, there's a macro for it!
If built-in tests are like a basic health check (temperature, blood pressure), then dbt_expectations is like a full medical exam with an MRI, blood work, and X-rays. It brings dozens of specialized tests inspired by the Python library Great Expectations โ giving you surgical precision over data quality.
models:
- name: stg_orders
columns:
- name: amount
tests:
# Every order amount must be between $0 and $100,000
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
# Average order value should be between $10 and $500
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 10
max_value: 500
- name: email
tests:
# Every email must match the pattern *@*.*
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[^@]+@[^@]+\\.[^@]+$"
tests:
# Table should have between 1,000 and 10,000,000 rows
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 1000
max_value: 10000000
| Scenario | Built-in Test | dbt_expectations |
|---|---|---|
| Column has no NULLs | not_null โ
| Overkill |
| Column is unique | unique โ
| Overkill |
| Values within a range | Custom SQL test | expect_column_values_to_be_between โ
|
| Row count sanity check | Custom SQL test | expect_table_row_count_to_be_between โ
|
| Regex pattern matching | Custom SQL test | expect_column_values_to_match_regex โ
|
| Statistical distribution | Not possible | expect_column_mean_to_be_between โ
|
๐ง Rule of Thumb
Start with the 4 built-in tests (unique, not_null, accepted_values, relationships). When you need something they can't do โ range checks, pattern matching, row count validation, statistical tests โ reach for dbt_expectations.
The dbt ecosystem is rich with community packages. Here are the ones you'll encounter most often:
Compare two versions of a model side-by-side. Perfect for refactoring โ did your changes break anything?
Use when: Rewriting a model and need to verify the output hasn't changed
Auto-generate schema.yml files and base model SQL from your sources. Saves hours of boilerplate typing.
Use when: Setting up a new project or adding new sources
Date utility macros: fiscal years, business days, date dimensions, and more.
Use when: Building date dimension tables or working with fiscal calendars
Manage external tables (S3, GCS, Azure Blob) as dbt sources with schema evolution support.
Use when: Your raw data lives in cloud object storage
Automated data monitoring and anomaly detection. Tracks metrics over time and alerts on drift.
Use when: You want continuous data observability without manual checks
Enforce that every model has descriptions, tests, and documentation. Keeps your project tidy.
Use when: You want to enforce documentation standards across the team
-- Run this in a dbt compile or statement:
{{ codegen.generate_model_yaml(
model_names=['stg_customers', 'stg_orders']
) }}
-- Output: a complete schema.yml with all columns listed!
-- Copy-paste it into your YAML file and add descriptions.
-- Saves 15+ minutes per model.
You don't need all these packages on day one. Start with dbt_utils (everyone needs it) and dbt_expectations (for quality). Add others as your project grows. It's like building a toolbox โ you don't buy every tool at the hardware store, just the ones you need for the current job!
Seeds are like the salt, pepper, and olive oil in your pantry โ small, rarely changing reference data that you always need handy. Country codes, status mappings, department names, currency conversion rates... these don't change every day, and they're too small to warrant a full ETL pipeline. Just put them in a CSV file, and dbt loads them into your warehouse!
Seeds are CSV files that live in your dbt project's seeds/ folder. When you run dbt seed, dbt reads these CSV files and creates actual tables in your data warehouse. That's it โ no ETL pipeline, no Fivetran connector, no API calls. Just a CSV file and one command.
Step 1: Create a CSV file in the seeds/ folder:
country_code,country_name,region
US,United States,North America
GB,United Kingdom,Europe
DE,Germany,Europe
JP,Japan,Asia
IN,India,Asia
BR,Brazil,South America
AU,Australia,Oceania
NG,Nigeria,Africa
status_code,status_label,is_active
1,Active,true
2,Inactive,false
3,Suspended,false
4,Pending Review,false
5,Archived,false
Step 2: Run dbt seed to load them into your warehouse:
$ dbt seed
Running seed country_codes............. OK [INSERT 8 rows in 0.42s]
Running seed status_mappings.......... OK [INSERT 5 rows in 0.31s]
Completed successfully. 2 seeds loaded.
# Load a specific seed only
$ dbt seed --select country_codes
# Full refresh (drop and recreate the table)
$ dbt seed --full-refresh
Once loaded, seeds are referenced with ref() โ just like any other model:
SELECT
c.customer_id,
c.first_name,
c.country_code,
cc.country_name,
cc.region
FROM
{{ ref('stg_customers') }} c
LEFT JOIN
{{ ref('country_codes') }} cc
-- โ This is a SEED, but you reference it with ref() just like a model!
ON c.country_code = cc.country_code
Seeds appear in your DAG! When you use ref('country_codes'), dbt adds the seed to your dependency graph. It knows to load the seed before building any model that depends on it. Everything stays in the right order.
seeds:
my_project:
+schema: seeds # Load into a 'seeds' schema
+quote_columns: false # Don't quote column names
country_codes:
+column_types:
country_code: varchar(2) # Override auto-detected types
country_name: varchar(100)
Seeds are like the spice rack in your kitchen โ small jars of things that rarely change. Sources are like the fridge โ big, constantly restocked with fresh ingredients. You wouldn't put a whole turkey on the spice rack, and you wouldn't keep salt in the fridge. Each has its place!
| Criteria | ๐ฑ Seeds | ๐ก Sources |
|---|---|---|
| Size | Small (< 1,000 rows) | Any size (millions of rows) |
| Change frequency | Rarely (monthly/quarterly) | Frequently (hourly/daily) |
| Where it lives | CSV in your dbt repo (version-controlled) | Already in your warehouse (loaded by ETL) |
| Who maintains it | Analytics engineer (you!) | Data engineer / ETL pipeline |
| How to load | dbt seed |
Fivetran, Airbyte, custom ETL |
| Examples | Country codes, status mappings, department names | Orders, customers, transactions, events |
| How to reference | ref('seed_name') |
source('name', 'table') |
โ ๏ธ The 1,000-Row Rule
If your CSV has more than ~1,000 rows, it probably shouldn't be a seed. Large seeds slow down dbt seed and bloat your git repository. Use a proper ETL tool to load large datasets as sources instead.
Seeds are version-controlled โ that's their superpower! When you update a country code mapping, the change shows up in a git pull request. Your team can review it, approve it, and you have a full history of every change. Try doing that with a raw database table!
Once you've written macros that you use across multiple projects, it's time to package them up and share them โ either within your company or with the whole dbt community. It's like writing a cookbook: you've perfected your recipes, now publish them so others can benefit!
You should build a package when:
โ You have macros used in 3+ projects
โ Multiple teams copy-paste the same code
โ Your macros are generic enough to work in any project
โ You want to enforce company-wide standards
You don't have to publish to dbt Hub. You can install packages directly from a Git repository:
packages:
# Public packages from dbt Hub
- package: dbt-labs/dbt_utils
version: [">=1.1.0", "<2.0.0"]
# Private company package from GitHub
- git: "https://github.com/my-company/our-dbt-utils.git"
revision: v1.2.0 # Pin to a specific tag or branch
# Or use a local path (for development)
- local: ../my-local-package
Publishing to dbt Hub: If your package is useful to the community, you can publish it on hub.getdbt.com. It's free, and your package becomes installable by anyone with a simple package: your-name/your-package entry. Many popular packages started as internal tools that teams decided to open-source!
Let's see if you can stand on giants' shoulders. Click the answer you think is correct:
dbt install
dbt packages install
dbt deps
dbt get
{{ source('seeds', 'country_codes') }}
{{ seed('country_codes') }}
{{ ref('country_codes') }}
FROM seeds.country_codes
Can you explain to a friend what dbt packages are and why you'd use them? Can you name 3 macros from dbt_utils? Do you know when to use a seed vs a source? If yes, you've nailed this lesson!
packages.ymlseeds/ that dbt loads into your warehouse with dbt seedref('seed_name'), just like models