LESSON 10 ยท INTERMEDIATE

๐Ÿ“ฆ Packages & Seeds โ€” Standing on Giants' Shoulders

Why reinvent the wheel? Use community packages and load reference data with seeds.

๐Ÿ“ฆ What are dbt Packages?

Packages are like pre-made spice mixes. Instead of measuring out 12 individual spices every time you cook, you grab a jar of "Italian Seasoning" that someone already mixed perfectly. dbt packages give you pre-built macros and tests that save you hours of work.

Imagine you need to create a unique ID by combining three columns. You could spend 30 minutes writing and testing a macro from scratch. Or you could install dbt_utils and use generate_surrogate_key in 10 seconds. That's the power of packages!

How Packages Work

dbt packages are community-built, version-controlled collections of macros, tests, and models that anyone can install and use. Think of them like apps on your phone โ€” someone else built them, you just install and enjoy.

๐Ÿ‘ฉโ€๐Ÿ’ป
Developer
writes package
Step 1
โ†’
๐ŸŒ
Publishes to
dbt Hub
Step 2
โ†’
๐Ÿ“‹
You add to
packages.yml
Step 3
โ†’
โฌ‡๏ธ
dbt deps
installs it
Step 4
โ†’
๐Ÿš€
Use macros
& tests!
Step 5
4,000+
packages available on dbt Hub
90%
of dbt projects use at least one package
10x
faster development with community packages

Before writing any macro or custom test from scratch, search hub.getdbt.com first. Chances are, someone has already built exactly what you need โ€” battle-tested and used by thousands of teams!

โฌ‡๏ธ Installing Packages

dbt deps is like going to the grocery store โ€” it downloads all the ingredients (packages) you listed in your shopping list (packages.yml). You write down what you need, run the command, and dbt fetches everything for you. No manual downloading, no hunting around the internet!

Step 1: Create packages.yml

Create a file called packages.yml in the root of your dbt project (same level as dbt_project.yml):

packages.yml
packages:
  # The Swiss Army Knife โ€” macros for everything
  - package: dbt-labs/dbt_utils
    version: [">=1.1.0", "<2.0.0"]

  # Advanced data quality tests (Great Expectations-style)
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

  # Compare model versions during refactoring
  - package: dbt-labs/audit_helper
    version: [">=0.9.0", "<1.0.0"]

  # Generate YAML and SQL boilerplate automatically
  - package: dbt-labs/codegen
    version: [">=0.12.0", "<1.0.0"]

Version pinning explained: The syntax [">=1.1.0", "<2.0.0"] means "any version from 1.1.0 up to (but not including) 2.0.0." This protects you from breaking changes. It's like saying "I want Italian Seasoning version 1, not version 2 where they changed the recipe!"

Step 2: Run dbt deps

Terminal
$ dbt deps

Installing dbt-labs/dbt_utils...
  Installed from version 1.1.1
Installing calogica/dbt_expectations...
  Installed from version 0.10.3
Installing dbt-labs/audit_helper...
  Installed from version 0.9.0
Installing dbt-labs/codegen...
  Installed from version 0.12.1

Successfully installed 4 packages.

Where Do Packages Go?

Packages are downloaded into the dbt_packages/ folder (previously dbt_modules/). This folder is auto-generated โ€” never edit files inside it.

๐Ÿ“ my_dbt_project/ ๐Ÿ“„ dbt_project.yml ๐Ÿ“„ packages.yml โ† your shopping list ๐Ÿ“ dbt_packages/ โ† auto-generated by dbt deps ๐Ÿ“ dbt_utils/ ๐Ÿ“ dbt_expectations/ ๐Ÿ“ audit_helper/ ๐Ÿ“ codegen/ ๐Ÿ“ models/ ๐Ÿ“ seeds/

Add dbt_packages/ to your .gitignore! These files are downloaded, not written by you. It's like not checking your grocery bags into version control โ€” you just keep the shopping list (packages.yml) and re-download when needed.

๐Ÿ”ง dbt_utils โ€” The Swiss Army Knife

dbt_utils is like a kitchen multi-tool โ€” it has a knife, bottle opener, screwdriver, and more. You'll use it in almost every project. It's the single most popular dbt package, installed by virtually every team on the planet. If dbt is the kitchen, dbt_utils is the drawer full of essential gadgets.

1. generate_surrogate_key โ€” Creating Unique IDs

Sometimes your data doesn't have a single unique column. You need to combine multiple columns into one unique ID. This macro does it for you:

models/staging/stg_order_items.sql
SELECT
    {{ dbt_utils.generate_surrogate_key([
        'order_id',
        'product_id'
    ]) }} AS order_item_id,
    order_id,
    product_id,
    quantity,
    unit_price
FROM
    {{ source('raw_shop', 'order_items') }}

-- Result: order_item_id = 'a1b2c3d4e5f6...' (a hash of order_id + product_id)
-- Every combination of order + product gets a unique, consistent ID!

Why not just concatenate? You could write order_id || '-' || product_id, but what if order_id is NULL? The whole thing becomes NULL. generate_surrogate_key handles NULLs, data types, and edge cases automatically. It's like using a professional blender instead of mashing things with a fork.

2. pivot / unpivot โ€” Reshaping Data

Turn rows into columns (pivot) or columns into rows (unpivot):

Pivot: rows โ†’ columns
-- Turn this:  customer_id | month    | revenue
--             1           | January  | 500
--             1           | February | 300
--
-- Into this:  customer_id | January | February
--             1           | 500     | 300

SELECT
    customer_id,
    {{ dbt_utils.pivot(
        'month',
        dbt_utils.get_column_values(
            table=ref('monthly_revenue'),
            column='month'
        ),
        agg='sum',
        then_value='revenue'
    ) }}
FROM {{ ref('monthly_revenue') }}
GROUP BY customer_id

3. date_spine โ€” Generating Date Ranges

Need a table with every single date between two points? This is your friend:

models/utilities/dim_dates.sql
{{ dbt_utils.date_spine(
    datepart="day",
    start_date="cast('2020-01-01' as date)",
    end_date="cast('2030-12-31' as date)"
) }}

-- Generates a table with one row per day:
-- date_day
-- 2020-01-01
-- 2020-01-02
-- 2020-01-03
-- ... (4,017 rows)
-- 2030-12-31

4. get_column_values โ€” Dynamic Column Lists

Get all unique statuses dynamically
-- Instead of hardcoding ['active', 'inactive', 'suspended']...
-- Let dbt figure it out from the actual data!

{% set statuses = dbt_utils.get_column_values(
    table=ref('stg_customers'),
    column='status'
) %}

-- statuses = ['active', 'inactive', 'suspended', 'pending']
-- Now use this list in a pivot, CASE WHEN, or anywhere else!

5. star โ€” Select All Columns Except Some

Select everything except sensitive columns
SELECT
    {{ dbt_utils.star(
        from=ref('stg_customers'),
        except=["email", "phone_number", "ssn"]
    ) }}
FROM
    {{ ref('stg_customers') }}

-- Selects ALL columns EXCEPT email, phone_number, and ssn
-- Perfect for creating PII-free views!

The dbt_utils package has 50+ macros. Don't try to memorize them all โ€” just know they exist. When you hit a problem, check the dbt_utils README first. Odds are, there's a macro for it!

๐Ÿ” dbt_expectations โ€” The Quality Inspector

If built-in tests are like a basic health check (temperature, blood pressure), then dbt_expectations is like a full medical exam with an MRI, blood work, and X-rays. It brings dozens of specialized tests inspired by the Python library Great Expectations โ€” giving you surgical precision over data quality.

Popular dbt_expectations Tests

schema.yml โ€” Using dbt_expectations
models:
  - name: stg_orders
    columns:
      - name: amount
        tests:
          # Every order amount must be between $0 and $100,000
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 100000

          # Average order value should be between $10 and $500
          - dbt_expectations.expect_column_mean_to_be_between:
              min_value: 10
              max_value: 500

      - name: email
        tests:
          # Every email must match the pattern *@*.*
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: "^[^@]+@[^@]+\\.[^@]+$"

    tests:
      # Table should have between 1,000 and 10,000,000 rows
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 1000
          max_value: 10000000

When to Use dbt_expectations vs Built-in Tests

Scenario Built-in Test dbt_expectations
Column has no NULLsnot_null โœ…Overkill
Column is uniqueunique โœ…Overkill
Values within a rangeCustom SQL testexpect_column_values_to_be_between โœ…
Row count sanity checkCustom SQL testexpect_table_row_count_to_be_between โœ…
Regex pattern matchingCustom SQL testexpect_column_values_to_match_regex โœ…
Statistical distributionNot possibleexpect_column_mean_to_be_between โœ…

๐Ÿง  Rule of Thumb

Start with the 4 built-in tests (unique, not_null, accepted_values, relationships). When you need something they can't do โ€” range checks, pattern matching, row count validation, statistical tests โ€” reach for dbt_expectations.

๐ŸŒŸ Other Popular Packages

The dbt ecosystem is rich with community packages. Here are the ones you'll encounter most often:

๐Ÿ” audit_helper

Compare two versions of a model side-by-side. Perfect for refactoring โ€” did your changes break anything?

Use when: Rewriting a model and need to verify the output hasn't changed

โšก codegen

Auto-generate schema.yml files and base model SQL from your sources. Saves hours of boilerplate typing.

Use when: Setting up a new project or adding new sources

๐Ÿ“… dbt_date

Date utility macros: fiscal years, business days, date dimensions, and more.

Use when: Building date dimension tables or working with fiscal calendars

๐Ÿ—„๏ธ dbt_external_tables

Manage external tables (S3, GCS, Azure Blob) as dbt sources with schema evolution support.

Use when: Your raw data lives in cloud object storage

๐Ÿ“Š re_data

Automated data monitoring and anomaly detection. Tracks metrics over time and alerts on drift.

Use when: You want continuous data observability without manual checks

๐Ÿท๏ธ dbt_meta_testing

Enforce that every model has descriptions, tests, and documentation. Keeps your project tidy.

Use when: You want to enforce documentation standards across the team

Using codegen to auto-generate schema.yml
-- Run this in a dbt compile or statement:

{{ codegen.generate_model_yaml(
    model_names=['stg_customers', 'stg_orders']
) }}

-- Output: a complete schema.yml with all columns listed!
-- Copy-paste it into your YAML file and add descriptions.
-- Saves 15+ minutes per model.

You don't need all these packages on day one. Start with dbt_utils (everyone needs it) and dbt_expectations (for quality). Add others as your project grows. It's like building a toolbox โ€” you don't buy every tool at the hardware store, just the ones you need for the current job!

๐ŸŒฑ Seeds โ€” Your Pantry Staples

Seeds are like the salt, pepper, and olive oil in your pantry โ€” small, rarely changing reference data that you always need handy. Country codes, status mappings, department names, currency conversion rates... these don't change every day, and they're too small to warrant a full ETL pipeline. Just put them in a CSV file, and dbt loads them into your warehouse!

What are Seeds?

Seeds are CSV files that live in your dbt project's seeds/ folder. When you run dbt seed, dbt reads these CSV files and creates actual tables in your data warehouse. That's it โ€” no ETL pipeline, no Fivetran connector, no API calls. Just a CSV file and one command.

Creating a Seed

Step 1: Create a CSV file in the seeds/ folder:

seeds/country_codes.csv
country_code,country_name,region
US,United States,North America
GB,United Kingdom,Europe
DE,Germany,Europe
JP,Japan,Asia
IN,India,Asia
BR,Brazil,South America
AU,Australia,Oceania
NG,Nigeria,Africa
seeds/status_mappings.csv
status_code,status_label,is_active
1,Active,true
2,Inactive,false
3,Suspended,false
4,Pending Review,false
5,Archived,false

Step 2: Run dbt seed to load them into your warehouse:

Terminal
$ dbt seed

Running seed country_codes............. OK [INSERT 8 rows in 0.42s]
Running seed status_mappings.......... OK [INSERT 5 rows in 0.31s]

Completed successfully. 2 seeds loaded.

# Load a specific seed only
$ dbt seed --select country_codes

# Full refresh (drop and recreate the table)
$ dbt seed --full-refresh

Using Seeds in Your Models

Once loaded, seeds are referenced with ref() โ€” just like any other model:

models/staging/stg_customers_enriched.sql
SELECT
    c.customer_id,
    c.first_name,
    c.country_code,
    cc.country_name,
    cc.region
FROM
    {{ ref('stg_customers') }} c
LEFT JOIN
    {{ ref('country_codes') }} cc
    --  โ†‘ This is a SEED, but you reference it with ref() just like a model!
    ON c.country_code = cc.country_code

Seeds appear in your DAG! When you use ref('country_codes'), dbt adds the seed to your dependency graph. It knows to load the seed before building any model that depends on it. Everything stays in the right order.

Configuring Seeds in dbt_project.yml

dbt_project.yml
seeds:
  my_project:
    +schema: seeds              # Load into a 'seeds' schema
    +quote_columns: false       # Don't quote column names

    country_codes:
      +column_types:
        country_code: varchar(2)  # Override auto-detected types
        country_name: varchar(100)

โš–๏ธ Seeds vs Sources โ€” When to Use Which?

Seeds are like the spice rack in your kitchen โ€” small jars of things that rarely change. Sources are like the fridge โ€” big, constantly restocked with fresh ingredients. You wouldn't put a whole turkey on the spice rack, and you wouldn't keep salt in the fridge. Each has its place!

Criteria ๐ŸŒฑ Seeds ๐Ÿ“ก Sources
Size Small (< 1,000 rows) Any size (millions of rows)
Change frequency Rarely (monthly/quarterly) Frequently (hourly/daily)
Where it lives CSV in your dbt repo (version-controlled) Already in your warehouse (loaded by ETL)
Who maintains it Analytics engineer (you!) Data engineer / ETL pipeline
How to load dbt seed Fivetran, Airbyte, custom ETL
Examples Country codes, status mappings, department names Orders, customers, transactions, events
How to reference ref('seed_name') source('name', 'table')

โœ… Good Seed Examples

  • Country codes (250 rows)
  • US state abbreviations (50 rows)
  • Department name mappings (20 rows)
  • Status code lookups (10 rows)
  • Currency exchange rates (monthly snapshot)
  • Employee-to-manager mappings (small org)

โŒ Bad Seed Examples

  • Customer list (100K+ rows)
  • Transaction history (millions of rows)
  • Real-time event data
  • Anything that changes daily
  • Data with PII (emails, SSNs)
  • Large product catalogs (10K+ items)

โš ๏ธ The 1,000-Row Rule

If your CSV has more than ~1,000 rows, it probably shouldn't be a seed. Large seeds slow down dbt seed and bloat your git repository. Use a proper ETL tool to load large datasets as sources instead.

Seeds are version-controlled โ€” that's their superpower! When you update a country code mapping, the change shows up in a git pull request. Your team can review it, approve it, and you have a full history of every change. Try doing that with a raw database table!

๐Ÿ—๏ธ Building Your Own Package

Once you've written macros that you use across multiple projects, it's time to package them up and share them โ€” either within your company or with the whole dbt community. It's like writing a cookbook: you've perfected your recipes, now publish them so others can benefit!

When to Build a Package

You should build a package when:

โœ… You have macros used in 3+ projects

โœ… Multiple teams copy-paste the same code

โœ… Your macros are generic enough to work in any project

โœ… You want to enforce company-wide standards

Package Structure

๐Ÿ“ my_company_dbt_utils/ ๐Ÿ“„ dbt_project.yml โ† package metadata ๐Ÿ“ macros/ ๐Ÿ“„ generate_schema_name.sql ๐Ÿ“„ cents_to_dollars.sql ๐Ÿ“„ clean_email.sql ๐Ÿ“ tests/ ๐Ÿ“„ test_is_valid_email.sql ๐Ÿ“„ README.md

Using a Git Package (Private)

You don't have to publish to dbt Hub. You can install packages directly from a Git repository:

packages.yml โ€” Installing from Git
packages:
  # Public packages from dbt Hub
  - package: dbt-labs/dbt_utils
    version: [">=1.1.0", "<2.0.0"]

  # Private company package from GitHub
  - git: "https://github.com/my-company/our-dbt-utils.git"
    revision: v1.2.0    # Pin to a specific tag or branch

  # Or use a local path (for development)
  - local: ../my-local-package

Publishing to dbt Hub: If your package is useful to the community, you can publish it on hub.getdbt.com. It's free, and your package becomes installable by anyone with a simple package: your-name/your-package entry. Many popular packages started as internal tools that teams decided to open-source!

๐Ÿง  Quick Quiz โ€” Test Your Understanding!

Let's see if you can stand on giants' shoulders. Click the answer you think is correct:

Question 1: What command installs packages listed in packages.yml?

A) dbt install
B) dbt packages install
C) dbt deps
D) dbt get

Question 2: Which is a good use case for a seed?

A) Loading 5 million customer records
B) A 50-row country code lookup table
C) Real-time transaction data that changes every second
D) A 500,000-row product catalog

Question 3: How do you reference a seed in a dbt model?

A) {{ source('seeds', 'country_codes') }}
B) {{ seed('country_codes') }}
C) {{ ref('country_codes') }}
D) FROM seeds.country_codes

Can you explain to a friend what dbt packages are and why you'd use them? Can you name 3 macros from dbt_utils? Do you know when to use a seed vs a source? If yes, you've nailed this lesson!

  • Packages โ€” Community-built collections of macros and tests; installed via packages.yml
  • dbt deps โ€” Command to download/install all packages
  • dbt_utils โ€” The most popular package: surrogate keys, pivots, date spines, and 50+ macros
  • Seeds โ€” CSV files in seeds/ that dbt loads into your warehouse with dbt seed
  • ref() for seeds โ€” Seeds are referenced with ref('seed_name'), just like models
  • Seeds vs Sources โ€” Seeds = small, stable, version-controlled; Sources = large, frequently updated
  • 1,000-row rule โ€” If a CSV has more than ~1,000 rows, use a source instead of a seed
  • dbt_expectations โ€” Advanced testing package with Great Expectations-style tests
  • audit_helper โ€” Comparing model versions during refactoring
  • codegen โ€” Auto-generating YAML and SQL boilerplate
  • Building your own package โ€” Publishing reusable macros to Git or dbt Hub
  • Git packages โ€” Installing private packages from GitHub repos
{PW}