dbt tests are your data's immune system. Let's build one.
Imagine you're a chef serving 1,000 customers tonight. Would you serve food without tasting it first? Absolutely not! One bad dish and your restaurant gets a 1-star review.
Data tests are like taste-testing every dish before it goes out to the dining room. Without tests, bad data reaches dashboards, and people make wrong decisions โ like ordering 10,000 umbrellas because a broken query said it was raining everywhere.
Think bad data can't hurt? Think again:
๐ธ The $10 Million NULL
A company's customer_id column had NULL values. Their revenue query joined orders to customers, and every NULL matched every other NULL โ duplicating revenue by 3x. The board approved a $10M expansion based on fake numbers. A simple not_null test would have caught it in seconds.
๐ The Dashboard That Lied
A marketing team ran a campaign based on a dashboard showing 50,000 new users. Turns out, a broken ETL pipeline was inserting duplicate rows. The real number was 12,000. They'd already spent $200K on ads targeting "growth" that didn't exist.
The goal of testing isn't perfection โ it's catching problems before your stakeholders do. A test that runs automatically every morning is worth more than a human checking dashboards manually. Your boss should never be the one to find the bug!
dbt comes with 4 built-in tests right out of the box โ no installation needed. Think of them as the 4 basic health checks every doctor does: temperature, blood pressure, heart rate, and breathing. They catch the most common problems.
You define these tests in a YAML file (usually schema.yml) next to your models. Let's meet each one:
unique โ No Duplicates AllowedNo two customers should have the same ID, just like no two people should have the same passport number. If two people show up at the airport with passport #12345, something is very wrong!
-- dbt runs something like this behind the scenes:
SELECT customer_id
FROM your_model
GROUP BY customer_id
HAVING COUNT(*) > 1
-- If this returns ANY rows โ TEST FAILS (duplicates found!)
-- If this returns 0 rows โ TEST PASSES โ
not_null โ Every Row Must Have a ValueEvery order must have a customer. An order with no customer is like a letter with no address โ it's going nowhere, and nobody knows who sent it. NULL values in critical columns cause joins to break and calculations to silently return wrong answers.
-- dbt runs something like this behind the scenes:
SELECT *
FROM your_model
WHERE customer_id IS NULL
-- If this returns ANY rows โ TEST FAILS (nulls found!)
-- If this returns 0 rows โ TEST PASSES โ
accepted_values โ Only These Values Are AllowedStatus can only be 'active', 'inactive', or 'suspended'. Anything else โ like 'actve' (typo!) or 'deleted' (not a valid status) โ is a data quality issue. It's like a traffic light: only red, yellow, or green. If it turns purple, something is broken.
-- dbt runs something like this behind the scenes:
SELECT status
FROM your_model
WHERE status NOT IN ('active', 'inactive', 'suspended')
-- If this returns ANY rows โ TEST FAILS (unexpected values!)
-- If this returns 0 rows โ TEST PASSES โ
relationships โ Foreign Keys Must ExistEvery order's customer_id must exist in the customers table. It's like checking that every student in a classroom is actually enrolled in the school. If you find a kid who isn't in the system, something went wrong during registration!
-- dbt runs something like this behind the scenes:
SELECT o.customer_id
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL
-- If this returns ANY rows โ TEST FAILS (orphan records!)
-- If this returns 0 rows โ TEST PASSES โ
Here's how you define all 4 tests for your models in a single YAML file:
version: 2
models:
- name: stg_customers
description: "Cleaned customer data from raw source"
columns:
- name: customer_id
description: "Primary key โ unique identifier for each customer"
tests:
- unique # No duplicate customer IDs
- not_null # Every row must have an ID
- name: email
tests:
- not_null # Every customer must have an email
- name: status
tests:
- accepted_values:
values: ['active', 'inactive', 'suspended']
- name: stg_orders
description: "Cleaned order data"
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships: # Every order's customer must exist!
to: ref('stg_customers')
field: customer_id
# Run ALL tests in your project
$ dbt test
Running tests...
PASS unique_stg_customers_customer_id # โ
No duplicates
PASS not_null_stg_customers_customer_id # โ
No nulls
PASS not_null_stg_customers_email # โ
All have emails
PASS accepted_values_stg_customers_status # โ
Valid statuses only
PASS unique_stg_orders_order_id # โ
No duplicates
PASS not_null_stg_orders_customer_id # โ
No orphan orders
PASS relationships_stg_orders_customer_id # โ
All customers exist
Completed successfully. 7 passed, 0 failed.
# Run tests for a specific model only
$ dbt test --select stg_customers
# Run tests for everything downstream of a model
$ dbt test --select stg_customers+
๐ง How dbt Tests Work
Every test is just a SQL query. If the query returns zero rows, the test passes. If it returns any rows, the test fails. That's it! The rows returned are the "bad" records that violated the rule.
Sometimes the 4 built-in tests aren't enough. What if you need to check that all order amounts are positive? Or that no orders have dates in the future? The built-in tests can't do that.
Custom tests let you write any check you want โ like a chef inventing a new taste test for a special dish. If you can write a SQL query, you can write a custom test!
Custom tests (also called singular tests) live in the tests/ folder at the root of your dbt project. Each test is a single .sql file.
๐ The Rule
Write a query that returns the bad rows. If the query returns any rows, the test FAILS. If it returns zero rows, the test PASSES.
-- This test FAILS if any order has a negative or zero amount.
-- Think about it: can you buy something for -$50? Nope!
SELECT
order_id,
customer_id,
amount
FROM
{{ ref('stg_orders') }}
WHERE
amount <= 0
-- If this returns rows like:
-- order_id=9912, customer_id=45, amount=-12.50
-- โ TEST FAILS โ (we found bad data!)
--
-- If this returns 0 rows โ TEST PASSES โ
-- Orders can't be placed in the future!
-- If we see order_date > today, something is wrong with our ETL.
SELECT
order_id,
order_date
FROM
{{ ref('stg_orders') }}
WHERE
order_date > CURRENT_DATE
-- Found an order dated 2030-01-01? That's a time traveler.
-- Or more likely, a bug. TEST FAILS โ
-- If daily revenue suddenly jumps to $10M when it's usually $50K,
-- something is probably duplicated. This is a "sanity check" test.
SELECT
order_date,
SUM(amount) AS daily_revenue
FROM
{{ ref('stg_orders') }}
GROUP BY order_date
HAVING
SUM(amount) > 1000000 -- Flag days with > $1M revenue
-- If your business normally does $50K/day, a $1M day is suspicious.
-- Investigate before it reaches the dashboard!
Custom tests are incredibly powerful. Any business rule you can express in SQL can become an automated test. "No customer should have more than 100 orders in a single day," "All email addresses should contain an @ symbol," "Total debits should equal total credits" โ if you can query it, you can test it!
Freshness is like checking the expiration date on milk. You open the fridge, look at the carton, and check: "Was this delivered today? Yesterday? Last week?" If your source data hasn't been updated in 6 hours, something might be wrong upstream โ maybe your ETL pipeline crashed at 3 AM and nobody noticed.
Source freshness checks answer one simple question: "When was this data last updated?" If the answer is "too long ago," dbt raises a warning or error.
version: 2
sources:
- name: raw_shop
database: raw_production
schema: public
freshness: # Default for ALL tables in this source
warn_after: {count: 12, period: hour} # โ ๏ธ Warn if > 12 hours old
error_after: {count: 24, period: hour} # โ Error if > 24 hours old
loaded_at_field: _etl_loaded_at # The timestamp column dbt checks
tables:
- name: customers
# Inherits the default freshness above
- name: orders
freshness: # Override for this specific table
warn_after: {count: 6, period: hour} # Orders should be fresher!
error_after: {count: 12, period: hour}
- name: payments
loaded_at_field: updated_at # Different timestamp column
$ dbt source freshness
Running source freshness checks...
PASS source raw_shop.customers # Last loaded 2 hours ago โ
Fresh!
WARN source raw_shop.orders # Last loaded 8 hours ago โ ๏ธ Getting stale...
ERROR source raw_shop.payments # Last loaded 30 hours ago โ STALE!
Done. PASS=1 WARN=1 ERROR=1
Data is fresh. Last update is within the warn_after threshold. Everything is healthy!
Data is getting stale. Past warn_after but not yet error_after. Investigate soon.
Data is too old. Past error_after. Something is broken upstream. Fix it now!
Schedule dbt source freshness to run every morning before your team starts work. If a pipeline broke overnight, you'll know before anyone opens a dashboard. Pair it with Slack or email alerts for maximum protection!
The 4 built-in tests are like a basic health check. But what if you need a full medical exam? That's where dbt_expectations comes in. It's like hiring a professional food inspector instead of just tasting the food yourself. It has dozens of specialized tests for every situation.
Add it to your packages.yml file in the root of your dbt project:
packages:
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
# Install the package
$ dbt deps
Installing calogica/dbt_expectations...
Successfully installed 1 package.
Now you can use powerful tests in your schema.yml:
models:
- name: stg_orders
columns:
- name: amount
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0 # No negative amounts
max_value: 100000 # No absurdly high amounts
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 10 # Average order should be at least $10
max_value: 500 # ...and no more than $500
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 1000 # We should have at least 1,000 orders
max_value: 10000000 # ...but not 10 million (probably dupes)
| Test Name | What It Checks | Example Use Case |
|---|---|---|
expect_column_values_to_be_between |
Values fall within a range | Order amounts between $0 and $100K |
expect_column_mean_to_be_between |
Average is within expected range | Avg order value between $10 and $500 |
expect_table_row_count_to_be_between |
Table has expected number of rows | Catch massive duplications or data loss |
expect_column_values_to_match_regex |
Values match a pattern | Email addresses contain @ |
expect_column_values_to_be_of_type |
Column has correct data type | Dates are actually dates, not strings |
expect_row_values_to_have_recent_data |
Table has recent rows | At least one order in the last 24 hours |
Start with the 4 built-in tests. Once you're comfortable, add dbt_expectations for the tricky stuff. You don't need to use every test โ pick the ones that match your business rules. Quality over quantity!
Testing isn't something you do once and forget. It's a continuous cycle โ like brushing your teeth. You do it every day, not just when you have a dentist appointment. Here's the workflow every dbt developer follows:
Step-by-Step Breakdown
1. Write Model โ Create your SQL model in the models/ folder
2. Add Tests โ Define schema tests in schema.yml and/or write custom tests in tests/
3. Run dbt test โ Execute all tests and review results
4. Fix Failures โ If tests fail, investigate the root cause. Is it bad source data? A bug in your SQL? Fix it and re-run
5. Deploy โ Once all tests pass, merge your code and deploy to production
# Build your model
$ dbt run --select my_new_model
# Test it immediately
$ dbt test --select my_new_model
# Or do both at once with "build" (run + test)
$ dbt build --select my_new_model
Running model my_new_model... OK
Running tests...
PASS unique_my_new_model_id
PASS not_null_my_new_model_id
ERROR accepted_values_my_new_model_status # โ Uh oh!
# Fix the issue, then re-run
$ dbt build --select my_new_model
๐จ Pro Tip: Tests in CI/CD
The best teams run dbt test automatically in their CI/CD pipeline (like GitHub Actions). Every time someone opens a pull request, tests run automatically. If any test fails, the PR is blocked from merging. This means bad data literally cannot reach production!
You don't need to test everything, but you should test the important things. It's like locking your front door and car โ you don't need to lock every drawer in your house, but the main entry points? Absolutely.
Add unique + not_null to every primary key column. This is the #1 most important test. If your primary key has duplicates or nulls, every downstream model is wrong.
Use relationships to verify that foreign keys actually point to existing records. Orphan records (orders with no customer) cause silent data loss in joins.
Write custom tests for domain-specific rules: "amounts must be positive," "dates can't be in the future," "each customer has at most one active subscription."
Always configure freshness on your sources. Stale data is invisible โ your dashboards look fine, but the numbers are from yesterday (or last week!).
Use store_failures: true in your test config. When a test fails, dbt saves the failing rows to a table so you can investigate exactly what went wrong.
Use expect_table_row_count_to_be_between to catch data loss (table suddenly has 0 rows) or duplication (table has 10x the expected rows).
When a test fails, you want to know which rows failed. Add store_failures to save them:
tests:
my_project:
+store_failures: true # Save failing rows to a table
+severity: warn # Warn instead of error (optional)
-- After a test fails, dbt creates a table with the bad rows:
SELECT *
FROM your_schema_dbt_test__audit.not_null_stg_orders_customer_id
-- Result:
-- order_id | customer_id
-- 9912 | NULL โ This order has no customer!
-- 10045 | NULL โ This one too!
A good rule of thumb: every model should have at least 2 tests (unique + not_null on the primary key). Critical models (revenue, user counts, anything on a dashboard) should have 5-10 tests. The more important the data, the more tests it deserves.
Let's see if you can catch the bugs before your boss does. Click the answer you think is correct:
dbt source freshness check?Can you explain to a friend why data testing matters? Can you name the 4 built-in tests and give an example of each? Can you write a custom test that checks for negative values? If yes, you've nailed this lesson!
tests/ that return bad rows (0 rows = pass)dbt build runs models + tests togetherdbt source freshness