LESSON 11 Β· ADVANCED

πŸš€ Deployment & CI/CD β€” Ship It!

From your laptop to production. Automate everything.

🏠 Dev vs Prod β€” Why It Matters

Imagine you're a chef testing a new recipe. You don't serve it to customers right away β€” you test it in the back kitchen first. You taste it, adjust the seasoning, maybe burn a few batches. Only when it's perfect do you put it on the menu for the restaurant floor.

Dev is your test kitchen β€” break things, experiment, nobody cares. Prod is the restaurant floor β€” real customers, real dashboards, real decisions. If you serve a burnt dish in prod, people make wrong business decisions based on bad data!

Separate Schemas: The Two Kitchens

In dbt, you keep dev and prod completely separate by using different schemas (think of them as different rooms in your data warehouse):

πŸ§ͺ Development

dev_analytics schema

Your personal sandbox. Run experiments, break things, iterate fast. Only you can see this data.

🏭 Production

prod_analytics schema

The real deal. Dashboards, reports, and business decisions are based on this. Treat it like a hospital operating room β€” sterile and controlled.

profiles.yml β€” Defining Your Targets

The profiles.yml file tells dbt where to run your models. You define separate targets for dev and prod:

~/.dbt/profiles.yml
my_project:
  target: dev                    # Default target when you run dbt locally
  outputs:

    dev:
      type: postgres
      host: localhost
      user: analytics_dev
      password: "{{ env_var('DBT_DEV_PASSWORD') }}"
      database: warehouse
      schema: dev_analytics          # ← Your test kitchen
      threads: 4

    prod:
      type: postgres
      host: prod-warehouse.company.com
      user: analytics_prod
      password: "{{ env_var('DBT_PROD_PASSWORD') }}"
      database: warehouse
      schema: prod_analytics         # ← The restaurant floor
      threads: 8

Using target.name in Your Models

Sometimes you want your models to behave differently in dev vs prod. For example, in dev you might only process the last 7 days of data (to save time), but in prod you process everything:

models/marts/fct_orders.sql
SELECT
    order_id,
    customer_id,
    order_date,
    amount
FROM
    {{ ref('stg_orders') }}

{% if target.name == 'dev' %}
    -- In dev: only process last 7 days (fast iteration!)
    WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
{% endif %}

-- In prod: this WHERE clause is skipped β†’ processes ALL data

🧠 The Golden Rule

Never run dbt run --target prod from your laptop. Production deployments should always go through an automated pipeline (CI/CD). This prevents "oops, I accidentally dropped the production table" moments.

πŸ—ΊοΈ Deployment Options

There are many ways to get your dbt models from your laptop to production. It's like choosing how to deliver a pizza β€” you can hand-deliver it, use a delivery app, hire a courier, or set up a drone. Each has trade-offs in cost, complexity, and reliability.

Option Complexity Cost Best For
dbt Cloud ⭐ Low $$$ (paid) Teams wanting a managed, no-ops solution
GitHub Actions ⭐⭐ Medium Free (open source) Teams already on GitHub who want flexibility
Airflow + dbt ⭐⭐⭐ High Free (open source) Complex pipelines where dbt is one step of many
Kubernetes CronJobs ⭐⭐⭐ High $$ (infra costs) Container-based teams with existing K8s clusters
πŸ’»
Write
Code
Local
β†’
πŸ”€
Push to
Git
PR
β†’
πŸ§ͺ
CI Tests
Run
Auto
β†’
βœ…
Merge
to Main
Approved
β†’
πŸš€
Deploy
to Prod
Auto

If you're just starting out, dbt Cloud is the easiest path. If you want full control and don't mind YAML files, GitHub Actions is the most popular free option. If dbt is part of a larger data pipeline (extract β†’ load β†’ transform β†’ serve), Airflow is the industry standard orchestrator.

☁️ dbt Cloud Deployment

dbt Cloud is like a fully automated kitchen. You set the timer, pick the recipes, and it cooks, tests, and serves the food on schedule. You don't need to stand there stirring the pot β€” it does everything for you. You just check the results in the morning with your coffee.

Setting Up Environments

In dbt Cloud, you create separate environments for each stage of your pipeline:

πŸ§ͺ Development

Your personal IDE workspace. Run models interactively, test queries, iterate fast.

Schema: dev_<your_name>

πŸ”¬ Staging

A mirror of production for final testing. CI jobs run here to validate PRs before merging.

Schema: staging_analytics

🏭 Production

The real deal. Scheduled jobs run here on a cron schedule. Dashboards read from this.

Schema: prod_analytics

Creating a Production Job

A dbt Cloud job is a scheduled set of commands. Here's what a typical production job looks like:

Typical Production Job β€” Commands
# Step 1: Load reference data (country codes, status mappings, etc.)
$ dbt seed

# Step 2: Build all models (staging β†’ intermediate β†’ marts)
$ dbt run

# Step 3: Run ALL tests (schema tests, custom tests, freshness)
$ dbt test

# Or combine steps 2 & 3 with "build" (runs + tests in DAG order)
$ dbt build

Job Configuration Checklist

Schedule: Cron expression (e.g., 0 6 * * * = every day at 6 AM)

Commands: dbt seed, dbt run, dbt test (or just dbt build)

Notifications: Email or Slack alerts on failure

Timeout: Kill the job if it runs longer than expected (e.g., 60 minutes)

Retries: Automatically retry on transient failures (e.g., warehouse timeout)

A typical morning at a data team using dbt Cloud: At 6 AM, the production job runs automatically. By 7 AM, all models are built and tested. If anything fails, the on-call engineer gets a Slack notification. By 8 AM when the business team opens their dashboards, the data is fresh, tested, and ready. Nobody had to wake up early or press any buttons!

βš™οΈ GitHub Actions CI/CD

CI/CD is like having a robot assistant that checks your recipe every time you change it. You write a new recipe (code), hand it to the robot (open a pull request), and the robot automatically taste-tests everything (runs dbt tests). If it passes all taste tests, the robot adds it to the menu (deploys to production). If something tastes wrong, the robot blocks it and tells you what's broken.

CI = Continuous Integration (test every change). CD = Continuous Deployment (auto-deploy when tests pass).

Full CI Workflow Example

Create this file in your repository. GitHub Actions will automatically run it on every pull request:

.github/workflows/dbt_ci.yml
name: dbt CI

on:
  pull_request:
    branches: [main]

jobs:
  dbt-test:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dbt
        run: pip install dbt-postgres

      - name: Install packages
        run: dbt deps

      - name: Run modified models
        run: dbt run --select state:modified+
        env:
          DBT_PROFILES_DIR: .
          DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}

      - name: Test modified models
        run: dbt test --select state:modified+
        env:
          DBT_PROFILES_DIR: .
          DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}

Slim CI with state:modified

Why "state:modified"? Imagine your project has 200 models. You changed just 1 model. Without Slim CI, you'd rebuild and test all 200 models β€” taking 45 minutes. With state:modified+, dbt only runs the model you changed plus everything downstream of it. That's maybe 5 models and 3 minutes. It's like only re-inspecting the rooms you renovated, not the entire building!

Slim CI commands explained
# Only run models that changed since the last production run
$ dbt run --select state:modified+ --defer --state ./prod-artifacts/

# state:modified  β†’ models YOU changed in this PR
# +               β†’ plus everything DOWNSTREAM of those models
# --defer         β†’ for unchanged models, use prod tables instead of rebuilding
# --state         β†’ path to production's manifest.json (the "before" snapshot)

Deploy on Merge to Main

.github/workflows/dbt_deploy.yml
name: dbt Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - run: pip install dbt-postgres
      - run: dbt deps
      - run: dbt seed --target prod
      - run: dbt run --target prod
      - run: dbt test --target prod
        env:
          DBT_PROFILES_DIR: .
          DBT_PROD_PASSWORD: ${{ secrets.DBT_PROD_PASSWORD }}

The CI workflow runs on pull requests (test before merge). The deploy workflow runs on push to main (deploy after merge). Together, they form a safety net: no untested code ever reaches production. It's the gold standard for data teams!

🌬️ Airflow + dbt Integration

Airflow is the head chef who coordinates the entire kitchen. dbt is the sous chef who handles all the data prep. Together they run a smooth operation. Airflow says "First, extract data from the API. Then, load it into the warehouse. Then, let dbt transform it. Finally, send a Slack notification." dbt handles step 3 beautifully, but Airflow orchestrates the whole show.

When to Use Airflow + dbt

Use Airflow when dbt is one step in a larger pipeline:

πŸ“₯
Extract
from API
Python
β†’
πŸ“€
Load to
Warehouse
Fivetran
β†’
πŸ”§
Transform
with dbt
dbt run
β†’
πŸ§ͺ
Test
Quality
dbt test
β†’
πŸ“’
Notify
Team
Slack

Full Airflow DAG Example

dags/dbt_pipeline.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['data-alerts@company.com'],
}

with DAG(
    dag_id='dbt_production_pipeline',
    default_args=default_args,
    schedule_interval='0 6 * * *',     # Every day at 6 AM
    start_date=datetime(2024, 1, 1),
    catchup=False,
) as dag:

    dbt_seed = BashOperator(
        task_id='dbt_seed',
        bash_command='cd /opt/dbt/my_project && dbt seed --target prod',
    )

    dbt_run = BashOperator(
        task_id='dbt_run',
        bash_command='cd /opt/dbt/my_project && dbt run --target prod',
    )

    dbt_test = BashOperator(
        task_id='dbt_test',
        bash_command='cd /opt/dbt/my_project && dbt test --target prod',
    )

    notify_success = BashOperator(
        task_id='notify_success',
        bash_command='echo "dbt pipeline completed successfully!"',
    )

    # Define the execution order
    dbt_seed >> dbt_run >> dbt_test >> notify_success

The Cosmos Package β€” dbt + Airflow Made Easy

The astronomer-cosmos package automatically converts your dbt DAG into individual Airflow tasks β€” one task per model. This gives you granular visibility and retry capabilities:

dags/dbt_cosmos_pipeline.py
from cosmos import DbtDag, ProjectConfig, ProfileConfig

dbt_dag = DbtDag(
    project_config=ProjectConfig("/opt/dbt/my_project"),
    profile_config=ProfileConfig(
        profile_name="my_project",
        target_name="prod",
    ),
    schedule_interval="0 6 * * *",
    start_date=datetime(2024, 1, 1),
    dag_id="dbt_cosmos_pipeline",
)

# Cosmos auto-generates one Airflow task per dbt model!
# stg_customers β†’ stg_orders β†’ fct_orders β†’ dim_customers
# Each model is a separate, retryable task in Airflow.

Why Cosmos is a game-changer: Without Cosmos, a single dbt run is one big Airflow task. If model #47 out of 200 fails, you have to re-run all 200. With Cosmos, each model is its own task β€” if model #47 fails, you only retry that one model. It's like having individual light switches instead of one master switch for the whole building!

πŸ” Environment Variables & Secrets

Imagine writing your house key's code on a sticky note and taping it to your front door. That's what hardcoding passwords in your code is like. Anyone who sees the code gets your password. Instead, you keep the key in a safe (environment variables) and only take it out when you need to unlock the door.

Using env_var() in profiles.yml

profiles.yml β€” The SAFE way
my_project:
  target: prod
  outputs:
    prod:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      role: "{{ env_var('SNOWFLAKE_ROLE') }}"
      database: "{{ env_var('SNOWFLAKE_DATABASE') }}"
      warehouse: "{{ env_var('SNOWFLAKE_WAREHOUSE') }}"
      schema: prod_analytics
      threads: 8

Where to Store Secrets

πŸ”‘ GitHub Secrets

Go to your repo β†’ Settings β†’ Secrets β†’ Actions. Add variables like DBT_PROD_PASSWORD. GitHub encrypts them and injects them into your workflow at runtime.

☁️ dbt Cloud Variables

In dbt Cloud β†’ Environment β†’ Environment Variables. Set values per environment (dev vs prod). dbt Cloud handles injection automatically.

πŸ—„οΈ Vault / AWS Secrets Manager

Enterprise-grade secret management. Your CI/CD pipeline fetches secrets at runtime from a centralized vault. Most secure option.

🚨 Security Best Practices

  • Never commit passwords, API keys, or tokens to Git β€” ever
  • Add profiles.yml to your .gitignore (it contains connection details)
  • Use env_var() for every credential in profiles.yml
  • Rotate passwords regularly (at least every 90 days)
  • Use read-only service accounts for production dbt runs
  • Audit who has access to production credentials quarterly

πŸ“Š Monitoring & Alerting

Deploying your models is only half the battle. You also need to watch them like a security camera. Is the pipeline running on time? Are tests passing? Did the data arrive? Monitoring is like having a smoke detector in your kitchen β€” you don't need it every day, but when something burns, you want to know immediately.

What to Monitor

⏱️ Execution Time

Track how long each model takes to build. If fct_orders usually takes 2 minutes but suddenly takes 30, something changed β€” maybe a bad join is creating a cartesian product.

πŸ§ͺ Test Failures

Get notified instantly when a test fails. A failing not_null test on customer_id at 6 AM means your 8 AM dashboard will show wrong numbers.

πŸ“… Data Freshness

Monitor dbt source freshness results. If your orders table hasn't been updated in 12 hours, the ETL pipeline upstream probably crashed.

πŸ“ˆ Row Count Trends

Track row counts over time. A sudden 10x spike means duplicates; a sudden drop to zero means data loss. Both are emergencies.

Notification Channels

dbt Cloud notification setup
# In dbt Cloud β†’ Account Settings β†’ Notifications

# Email notifications:
# βœ… On job failure β†’ data-team@company.com
# βœ… On test failure β†’ data-quality@company.com

# Slack notifications (via webhook):
# βœ… On job failure β†’ #data-alerts channel
# βœ… On job success β†’ #data-pipeline-log channel

# For GitHub Actions, add a Slack step:
- name: Notify Slack on failure
  if: failure()
  uses: slackapi/slack-github-action@v1
  with:
    channel-id: 'C01234ABCDE'
    slack-message: '🚨 dbt pipeline failed! Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}'
  env:
    SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

Set up a "morning check" dashboard that shows: (1) last successful run time, (2) number of test failures, (3) source freshness status, and (4) model execution times. Your on-call engineer checks this first thing every morning. If everything is green, grab your coffee. If something is red, start debugging before the business team notices!

πŸ”΅πŸŸ’ Blue-Green Deployments

Imagine you have two identical kitchens β€” Kitchen A (Blue) and Kitchen B (Green). While Kitchen A serves customers, Kitchen B quietly prepares the new menu. Once Kitchen B is ready and taste-tested, you swap β€” customers now eat from Kitchen B, and Kitchen A becomes the prep area. Customers never notice the change. Zero downtime, zero risk!

How It Works

πŸ”΅
Blue
(Live)
Serving
β†’
🟒
Green
(Build)
Staging
β†’
πŸ§ͺ
Test
Green
Validate
β†’
πŸ”„
Swap
Schemas
Atomic
β†’
🟒
Green
(Live!)
Serving
Blue-Green deployment script
# Step 1: Build everything in a staging schema
$ dbt run --target prod --vars '{"schema_suffix": "_staging"}'

# Step 2: Run ALL tests against the staging schema
$ dbt test --target prod --vars '{"schema_suffix": "_staging"}'

# Step 3: If all tests pass, swap the schemas (atomic operation)
ALTER SCHEMA prod_analytics RENAME TO prod_analytics_old;
ALTER SCHEMA prod_analytics_staging RENAME TO prod_analytics;

# Step 4: Drop the old schema (optional, keep as backup)
-- DROP SCHEMA prod_analytics_old CASCADE;

# Result: Zero downtime! Dashboards never saw stale or partial data.

Why Blue-Green Matters

Without Blue-Green: During a dbt run, some tables are updated and some aren't. If a dashboard queries at that exact moment, it sees a mix of old and new data β€” like reading a book where half the pages are from the old edition and half from the new one.

With Blue-Green: The swap is atomic β€” one instant, everything changes at once. Dashboards always see a consistent snapshot. It's the gold standard for production deployments.

Blue-green deployments are an advanced pattern. Most teams don't need this on day one. Start with simple scheduled deployments. Once your project grows to 100+ models and your dashboards are business-critical, consider blue-green for zero-downtime deployments.

🧠 Quick Quiz β€” Test Your Understanding!

Let's see if you're ready to ship it. Click the answer you think is correct:

Question 1: Why should you NEVER run dbt run --target prod from your laptop?

A) It's slower than running from the cloud
B) Production deployments should go through automated CI/CD to prevent accidental changes
C) dbt doesn't support the --target flag locally
D) Your laptop doesn't have enough RAM

Question 2: What does state:modified+ do in Slim CI?

A) Runs all models in the project
B) Only runs models that have been deleted
C) Runs only the models you changed plus everything downstream
D) Runs models in reverse order

Question 3: In a blue-green deployment, what happens during the "swap"?

A) All tables are dropped and rebuilt from scratch
B) The database is taken offline for maintenance
C) The staging schema is atomically renamed to production β€” zero downtime
D) Users are redirected to a backup database

Can you explain the difference between dev and prod environments? Can you describe what CI/CD does and why it matters? Do you know where to store secrets and why you should never hardcode them? If yes, you're ready to ship it!

  • Dev vs Prod β€” Separate schemas keep experiments away from real dashboards
  • profiles.yml targets β€” Define dev and prod connections with different schemas
  • target.name β€” Use Jinja to make models behave differently in dev vs prod
  • CI/CD β€” Automated testing on PRs (CI) and automated deployment on merge (CD)
  • env_var() β€” Never hardcode credentials; use environment variables
  • Monitoring β€” Track execution time, test failures, and data freshness
  • Production commands β€” dbt seed β†’ dbt run β†’ dbt test (or dbt build)
  • dbt Cloud β€” Managed deployment with scheduled jobs and built-in IDE
  • GitHub Actions β€” Free CI/CD with state:modified for Slim CI
  • Airflow + dbt β€” Orchestrating dbt as part of a larger pipeline
  • Cosmos package β€” One Airflow task per dbt model for granular retries
  • Blue-green deployments β€” Zero-downtime schema swaps for critical pipelines
  • Kubernetes CronJobs β€” Container-based dbt scheduling
{PW}