From your laptop to production. Automate everything.
Imagine you're a chef testing a new recipe. You don't serve it to customers right away β you test it in the back kitchen first. You taste it, adjust the seasoning, maybe burn a few batches. Only when it's perfect do you put it on the menu for the restaurant floor.
Dev is your test kitchen β break things, experiment, nobody cares. Prod is the restaurant floor β real customers, real dashboards, real decisions. If you serve a burnt dish in prod, people make wrong business decisions based on bad data!
In dbt, you keep dev and prod completely separate by using different schemas (think of them as different rooms in your data warehouse):
dev_analytics schema
Your personal sandbox. Run experiments, break things, iterate fast. Only you can see this data.
prod_analytics schema
The real deal. Dashboards, reports, and business decisions are based on this. Treat it like a hospital operating room β sterile and controlled.
The profiles.yml file tells dbt where to run your models. You define separate targets for dev and prod:
my_project:
target: dev # Default target when you run dbt locally
outputs:
dev:
type: postgres
host: localhost
user: analytics_dev
password: "{{ env_var('DBT_DEV_PASSWORD') }}"
database: warehouse
schema: dev_analytics # β Your test kitchen
threads: 4
prod:
type: postgres
host: prod-warehouse.company.com
user: analytics_prod
password: "{{ env_var('DBT_PROD_PASSWORD') }}"
database: warehouse
schema: prod_analytics # β The restaurant floor
threads: 8
target.name in Your ModelsSometimes you want your models to behave differently in dev vs prod. For example, in dev you might only process the last 7 days of data (to save time), but in prod you process everything:
SELECT
order_id,
customer_id,
order_date,
amount
FROM
{{ ref('stg_orders') }}
{% if target.name == 'dev' %}
-- In dev: only process last 7 days (fast iteration!)
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
{% endif %}
-- In prod: this WHERE clause is skipped β processes ALL data
π§ The Golden Rule
Never run dbt run --target prod from your laptop. Production deployments should always go through an automated pipeline (CI/CD). This prevents "oops, I accidentally dropped the production table" moments.
There are many ways to get your dbt models from your laptop to production. It's like choosing how to deliver a pizza β you can hand-deliver it, use a delivery app, hire a courier, or set up a drone. Each has trade-offs in cost, complexity, and reliability.
| Option | Complexity | Cost | Best For |
|---|---|---|---|
| dbt Cloud | β Low | $$$ (paid) | Teams wanting a managed, no-ops solution |
| GitHub Actions | ββ Medium | Free (open source) | Teams already on GitHub who want flexibility |
| Airflow + dbt | βββ High | Free (open source) | Complex pipelines where dbt is one step of many |
| Kubernetes CronJobs | βββ High | $$ (infra costs) | Container-based teams with existing K8s clusters |
If you're just starting out, dbt Cloud is the easiest path. If you want full control and don't mind YAML files, GitHub Actions is the most popular free option. If dbt is part of a larger data pipeline (extract β load β transform β serve), Airflow is the industry standard orchestrator.
dbt Cloud is like a fully automated kitchen. You set the timer, pick the recipes, and it cooks, tests, and serves the food on schedule. You don't need to stand there stirring the pot β it does everything for you. You just check the results in the morning with your coffee.
In dbt Cloud, you create separate environments for each stage of your pipeline:
Your personal IDE workspace. Run models interactively, test queries, iterate fast.
Schema: dev_<your_name>
A mirror of production for final testing. CI jobs run here to validate PRs before merging.
Schema: staging_analytics
The real deal. Scheduled jobs run here on a cron schedule. Dashboards read from this.
Schema: prod_analytics
A dbt Cloud job is a scheduled set of commands. Here's what a typical production job looks like:
# Step 1: Load reference data (country codes, status mappings, etc.)
$ dbt seed
# Step 2: Build all models (staging β intermediate β marts)
$ dbt run
# Step 3: Run ALL tests (schema tests, custom tests, freshness)
$ dbt test
# Or combine steps 2 & 3 with "build" (runs + tests in DAG order)
$ dbt build
Job Configuration Checklist
Schedule: Cron expression (e.g., 0 6 * * * = every day at 6 AM)
Commands: dbt seed, dbt run, dbt test (or just dbt build)
Notifications: Email or Slack alerts on failure
Timeout: Kill the job if it runs longer than expected (e.g., 60 minutes)
Retries: Automatically retry on transient failures (e.g., warehouse timeout)
A typical morning at a data team using dbt Cloud: At 6 AM, the production job runs automatically. By 7 AM, all models are built and tested. If anything fails, the on-call engineer gets a Slack notification. By 8 AM when the business team opens their dashboards, the data is fresh, tested, and ready. Nobody had to wake up early or press any buttons!
CI/CD is like having a robot assistant that checks your recipe every time you change it. You write a new recipe (code), hand it to the robot (open a pull request), and the robot automatically taste-tests everything (runs dbt tests). If it passes all taste tests, the robot adds it to the menu (deploys to production). If something tastes wrong, the robot blocks it and tells you what's broken.
CI = Continuous Integration (test every change). CD = Continuous Deployment (auto-deploy when tests pass).
Create this file in your repository. GitHub Actions will automatically run it on every pull request:
name: dbt CI
on:
pull_request:
branches: [main]
jobs:
dbt-test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dbt
run: pip install dbt-postgres
- name: Install packages
run: dbt deps
- name: Run modified models
run: dbt run --select state:modified+
env:
DBT_PROFILES_DIR: .
DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}
- name: Test modified models
run: dbt test --select state:modified+
env:
DBT_PROFILES_DIR: .
DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}
state:modifiedWhy "state:modified"? Imagine your project has 200 models. You changed just 1 model. Without Slim CI, you'd rebuild and test all 200 models β taking 45 minutes. With state:modified+, dbt only runs the model you changed plus everything downstream of it. That's maybe 5 models and 3 minutes. It's like only re-inspecting the rooms you renovated, not the entire building!
# Only run models that changed since the last production run
$ dbt run --select state:modified+ --defer --state ./prod-artifacts/
# state:modified β models YOU changed in this PR
# + β plus everything DOWNSTREAM of those models
# --defer β for unchanged models, use prod tables instead of rebuilding
# --state β path to production's manifest.json (the "before" snapshot)
name: dbt Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install dbt-postgres
- run: dbt deps
- run: dbt seed --target prod
- run: dbt run --target prod
- run: dbt test --target prod
env:
DBT_PROFILES_DIR: .
DBT_PROD_PASSWORD: ${{ secrets.DBT_PROD_PASSWORD }}
The CI workflow runs on pull requests (test before merge). The deploy workflow runs on push to main (deploy after merge). Together, they form a safety net: no untested code ever reaches production. It's the gold standard for data teams!
Airflow is the head chef who coordinates the entire kitchen. dbt is the sous chef who handles all the data prep. Together they run a smooth operation. Airflow says "First, extract data from the API. Then, load it into the warehouse. Then, let dbt transform it. Finally, send a Slack notification." dbt handles step 3 beautifully, but Airflow orchestrates the whole show.
Use Airflow when dbt is one step in a larger pipeline:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
'email': ['data-alerts@company.com'],
}
with DAG(
dag_id='dbt_production_pipeline',
default_args=default_args,
schedule_interval='0 6 * * *', # Every day at 6 AM
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
dbt_seed = BashOperator(
task_id='dbt_seed',
bash_command='cd /opt/dbt/my_project && dbt seed --target prod',
)
dbt_run = BashOperator(
task_id='dbt_run',
bash_command='cd /opt/dbt/my_project && dbt run --target prod',
)
dbt_test = BashOperator(
task_id='dbt_test',
bash_command='cd /opt/dbt/my_project && dbt test --target prod',
)
notify_success = BashOperator(
task_id='notify_success',
bash_command='echo "dbt pipeline completed successfully!"',
)
# Define the execution order
dbt_seed >> dbt_run >> dbt_test >> notify_success
The astronomer-cosmos package automatically converts your dbt DAG into individual Airflow tasks β one task per model. This gives you granular visibility and retry capabilities:
from cosmos import DbtDag, ProjectConfig, ProfileConfig
dbt_dag = DbtDag(
project_config=ProjectConfig("/opt/dbt/my_project"),
profile_config=ProfileConfig(
profile_name="my_project",
target_name="prod",
),
schedule_interval="0 6 * * *",
start_date=datetime(2024, 1, 1),
dag_id="dbt_cosmos_pipeline",
)
# Cosmos auto-generates one Airflow task per dbt model!
# stg_customers β stg_orders β fct_orders β dim_customers
# Each model is a separate, retryable task in Airflow.
Why Cosmos is a game-changer: Without Cosmos, a single dbt run is one big Airflow task. If model #47 out of 200 fails, you have to re-run all 200. With Cosmos, each model is its own task β if model #47 fails, you only retry that one model. It's like having individual light switches instead of one master switch for the whole building!
Imagine writing your house key's code on a sticky note and taping it to your front door. That's what hardcoding passwords in your code is like. Anyone who sees the code gets your password. Instead, you keep the key in a safe (environment variables) and only take it out when you need to unlock the door.
env_var() in profiles.ymlmy_project:
target: prod
outputs:
prod:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
role: "{{ env_var('SNOWFLAKE_ROLE') }}"
database: "{{ env_var('SNOWFLAKE_DATABASE') }}"
warehouse: "{{ env_var('SNOWFLAKE_WAREHOUSE') }}"
schema: prod_analytics
threads: 8
Go to your repo β Settings β Secrets β Actions. Add variables like DBT_PROD_PASSWORD. GitHub encrypts them and injects them into your workflow at runtime.
In dbt Cloud β Environment β Environment Variables. Set values per environment (dev vs prod). dbt Cloud handles injection automatically.
Enterprise-grade secret management. Your CI/CD pipeline fetches secrets at runtime from a centralized vault. Most secure option.
π¨ Security Best Practices
profiles.yml to your .gitignore (it contains connection details)env_var() for every credential in profiles.ymlDeploying your models is only half the battle. You also need to watch them like a security camera. Is the pipeline running on time? Are tests passing? Did the data arrive? Monitoring is like having a smoke detector in your kitchen β you don't need it every day, but when something burns, you want to know immediately.
Track how long each model takes to build. If fct_orders usually takes 2 minutes but suddenly takes 30, something changed β maybe a bad join is creating a cartesian product.
Get notified instantly when a test fails. A failing not_null test on customer_id at 6 AM means your 8 AM dashboard will show wrong numbers.
Monitor dbt source freshness results. If your orders table hasn't been updated in 12 hours, the ETL pipeline upstream probably crashed.
Track row counts over time. A sudden 10x spike means duplicates; a sudden drop to zero means data loss. Both are emergencies.
# In dbt Cloud β Account Settings β Notifications
# Email notifications:
# β
On job failure β data-team@company.com
# β
On test failure β data-quality@company.com
# Slack notifications (via webhook):
# β
On job failure β #data-alerts channel
# β
On job success β #data-pipeline-log channel
# For GitHub Actions, add a Slack step:
- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
channel-id: 'C01234ABCDE'
slack-message: 'π¨ dbt pipeline failed! Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}'
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
Set up a "morning check" dashboard that shows: (1) last successful run time, (2) number of test failures, (3) source freshness status, and (4) model execution times. Your on-call engineer checks this first thing every morning. If everything is green, grab your coffee. If something is red, start debugging before the business team notices!
Imagine you have two identical kitchens β Kitchen A (Blue) and Kitchen B (Green). While Kitchen A serves customers, Kitchen B quietly prepares the new menu. Once Kitchen B is ready and taste-tested, you swap β customers now eat from Kitchen B, and Kitchen A becomes the prep area. Customers never notice the change. Zero downtime, zero risk!
# Step 1: Build everything in a staging schema
$ dbt run --target prod --vars '{"schema_suffix": "_staging"}'
# Step 2: Run ALL tests against the staging schema
$ dbt test --target prod --vars '{"schema_suffix": "_staging"}'
# Step 3: If all tests pass, swap the schemas (atomic operation)
ALTER SCHEMA prod_analytics RENAME TO prod_analytics_old;
ALTER SCHEMA prod_analytics_staging RENAME TO prod_analytics;
# Step 4: Drop the old schema (optional, keep as backup)
-- DROP SCHEMA prod_analytics_old CASCADE;
# Result: Zero downtime! Dashboards never saw stale or partial data.
Why Blue-Green Matters
Without Blue-Green: During a dbt run, some tables are updated and some aren't. If a dashboard queries at that exact moment, it sees a mix of old and new data β like reading a book where half the pages are from the old edition and half from the new one.
With Blue-Green: The swap is atomic β one instant, everything changes at once. Dashboards always see a consistent snapshot. It's the gold standard for production deployments.
Blue-green deployments are an advanced pattern. Most teams don't need this on day one. Start with simple scheduled deployments. Once your project grows to 100+ models and your dashboards are business-critical, consider blue-green for zero-downtime deployments.
Let's see if you're ready to ship it. Click the answer you think is correct:
dbt run --target prod from your laptop?state:modified+ do in Slim CI?Can you explain the difference between dev and prod environments? Can you describe what CI/CD does and why it matters? Do you know where to store secrets and why you should never hardcode them? If yes, you're ready to ship it!
dbt seed β dbt run β dbt test (or dbt build)state:modified for Slim CI