MODULE 2 OF 15

Your First DAG

Write, run, and monitor your very first Airflow pipeline from scratch!

Our Running Example: John's Rocket Launch Pipeline

Meet John! John loves space and wants to download pictures of every rocket launch. His pipeline does three things: (1) download_launches — fetches launch data from NASA's API, (2) get_pictures — extracts and downloads the cool rocket images, and (3) notify — sends him an email when it's done. We'll use this example throughout the module!

What is a DAG File?

A DAG file is a Python file that describes your entire pipeline. It lives in Airflow's dags/ folder, and the scheduler reads it to understand what tasks to run and in what order.

Explain Like I'm 5

Imagine you're writing a recipe for making a cake. The recipe (DAG file) doesn't bake the cake itself — it tells Mom what to do and in what order: mix flour, add eggs, bake for 30 minutes. The DAG file is your "recipe" for the pipeline. Mom (the Airflow worker) reads it and does the actual work!

The Recipe Analogy

DAG file = The recipe card. It says "Step 1: download launches, Step 2: get pictures, Step 3: notify." It's just text (Python code) until someone (Airflow) reads it and executes it. The file itself never runs—Airflow runs what the file describes!

Tasks vs Operators

An Operator is a blueprint (a recipe). A Task is the actual instance (the cake you baked from that recipe). When you write BashOperator(task_id="download_launches", bash_command="curl ..."), you're creating one task from the BashOperator blueprint.

Explain Like I'm 5

Think of a cookie cutter (Operator) and the actual cookies (Tasks). The cookie cutter is the BashOperator — it always makes the same shape (runs a shell command). But each cookie you cut out is different: one might say "download_launches," another "get_pictures." Those are your tasks. Same blueprint (BashOperator), different tasks (different task_ids and commands)!

Blueprint vs Instance

Operator = The cookie cutter, the mold, the template. BashOperator knows "I run shell commands." PythonOperator knows "I run Python functions." They define what kind of work gets done.
Task = The specific job. download_launches is a task. get_pictures is another task. Each task has a unique task_id and its own parameters. John's pipeline has 3 tasks: download_launches, get_pictures, notify.

BashOperator (Running Shell Commands)

BashOperator lets you run any shell command from your DAG. John uses it for download_launches — he runs curl to fetch launch data from NASA's API. Think of it as "run this command in a terminal."

Explain Like I'm 5

BashOperator is like asking Mom to press a button on a machine. "Press the 'download' button." Mom (the worker) does it. The machine (your command) does the actual work. You're just telling Airflow what command to run.

BashOperator Examples

1
curl (download from API)

bash_command="curl -o launches.json https://ll.thespacedevs.com/2.2.0/launch/" — John uses this to fetch rocket launch data.

2
echo (print messages)

bash_command='echo "Pipeline started!"' — Simple debugging or notifications.

3
File operations

bash_command="mkdir -p /tmp/launches && ls -la" — Create folders, move files, count lines with wc -l.

PythonOperator (Running Python Functions)

PythonOperator runs a Python function you define. John uses it for get_pictures — he writes a function that reads the launch JSON, finds image URLs, and downloads them. Way more flexible than a single shell command!

Explain Like I'm 5

BashOperator = "Press this one button." PythonOperator = "Follow this recipe step by step." With Python, you can use loops, if-statements, libraries like requests or pandas. Mom (the worker) runs your whole Python function, not just one command.

The python_callable Pattern

You write a normal Python function (e.g. def get_pictures():), then pass it to PythonOperator as python_callable=get_pictures. Airflow calls your function when the task runs. The function must be importable — defined at the top level of your DAG file or in a separate module.

python_callable Rule

Never use a lambda! Use a named function. python_callable=my_func works. python_callable=lambda: 1+1 breaks with serialization. Define your logic in a real function!

Setting Dependencies with >> and <<

The >> and << operators set execution order. task_a >> task_b means "run task_a first, then task_b." You can chain: download_launches >> get_pictures >> notify.

The Assembly Line

download_launches >> get_pictures >> notify is like a factory: first you get the parts (download), then you assemble (get pictures), then you ship (notify). Each step waits for the previous one to finish. The >> arrow points in the direction of flow!

default_args (Reusable Defaults)

default_args is a dictionary you pass to the DAG. Every task in that DAG inherits these settings: retries, retry_delay, owner, email_on_failure, etc. No need to repeat the same values for every task!

Explain Like I'm 5

Like a class rule: "In this classroom, everyone gets 2 retries and 5 minutes between retries." You say it once. Every task (student) follows it automatically unless they have a special override.

DAG Context Manager: with DAG(...) as dag

Instead of dag = DAG(...) and passing dag=dag to every operator, use with DAG(...) as dag:. Everything inside the with block automatically belongs to that DAG. Cleaner and less error-prone!

Pro Tip

When you use with DAG(...) as dag:, you don't need to pass dag=dag to each operator in Airflow 2+. The operator picks up the DAG from context. But if you create tasks outside the block, you must pass dag=dag explicitly.

Handling Task Failures

When a task fails: (1) Downstream tasks don't run — if get_pictures fails, notify never runs. (2) Retries — with retries=2, Airflow tries the task up to 3 times (initial + 2 retries) before marking it failed. (3) Clearing — In the UI, you can "clear" a failed task to reset it. Airflow will re-run it (and downstream tasks if you choose).

What Happens When a Task Fails

1
Retry

Airflow waits retry_delay (e.g. 5 minutes), then runs the task again. Repeats up to retries times.

2
Mark Failed

After all retries are used, the task is marked failed. Downstream tasks stay in "upstream failed" state.

3
Clear & Re-run

Fix the bug, then in the Airflow UI: click the task → Clear → confirm. Airflow re-runs that task. You can optionally clear downstream tasks too to re-run the whole chain.

Always Check Logs

When a task fails, click it in the Graph or Grid view and open the Log tab. The logs show the exact error (Python traceback, command output, etc.). That's your first step to debugging!

John's Rocket Launch DAG (3 Tasks)

John's pipeline: download_launches → get_pictures → notify. Follow the arrows to see the execution order:

download_launches get_pictures notify

John's rocket launch pipeline: download → get pictures → notify

Parallel Tasks in a DAG

DAGs can have branches! Independent tasks can run at the same time (in parallel), saving time:

Fetch Weather Clean Weather Fetch Sales Clean Sales Join Data Train Model Branch 1 (parallel) Branch 2 (parallel)

Weather and Sales branches run in parallel, then join together

Why Parallel Matters

If "Fetch Weather" takes 5 minutes and "Fetch Sales" takes 3 minutes, running them in parallel means the total is only 5 minutes (not 8). DAGs automatically detect which tasks can run in parallel!

Airflow Architecture — Animated

Watch how data flows through Airflow's components:

How Airflow Processes a DAG


DAG File

Scheduler

Task Queue

Worker

Results in DB

Web UI

The Airflow Web UI

Airflow comes with a stunning web interface. Here's what you'll see:

Key Views in the Airflow UI

DAGs List View

The home page. Shows all your DAGs with their status (running, paused, failed), schedule, owner, and recent task statuses as colored circles.

Graph View

Shows the structure of a DAG as a visual graph — tasks as boxes and dependencies as arrows. Color-coded by task state (green = success, red = failed, yellow = running).

Grid View (formerly Tree View)

Shows the history of all runs over time. Each column is a DAG run, each row is a task. This is the most powerful view for debugging — you can see patterns like "this task fails every Monday."

Code View

Shows the actual Python code of your DAG file directly in the browser. Great for quick debugging without opening your IDE.

Who Uses Airflow?

Airbnb

Created Airflow! Uses it for search ranking, pricing, and host analytics pipelines.

Spotify

Runs 20,000+ DAGs for recommendation engines and content analytics.

Uber

Orchestrates ML model training, pricing calculations, and driver analytics.

Netflix

Manages data pipelines that power content recommendations for 200M+ users.

Installing Airflow

Let's get Airflow running on your machine. There are two ways:

Option 1: pip install (simplest)

# Create a virtual environment first (always a good idea!)
python -m venv airflow_env
source airflow_env/bin/activate  # On Mac/Linux

# Install Airflow (use the constraint file for stable versions)
pip install apache-airflow==2.8.0 \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.9.txt"

# Initialize the database
airflow db init

# Create an admin user
airflow users create \
  --username admin \
  --password admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com

# Start the webserver (in one terminal)
airflow webserver --port 8080

# Start the scheduler (in another terminal)
airflow scheduler

Option 2: Docker (recommended for beginners)

# Pull and run the official Airflow Docker image
docker run -ti \
  -p 8080:8080 \
  --name airflow \
  apache/airflow:2.8.0 \
  standalone

# That's it! Open http://localhost:8080
# Login: admin / admin

What Just Happened?

pip install = You downloaded Airflow onto your computer, like installing an app from the App Store.

db init = You set up a little database where Airflow will remember everything (like creating a notebook to keep notes).

users create = You created your login account for the Airflow website.

webserver = Starts the website you can see in your browser.

scheduler = Starts the brain that decides when to run your pipelines.

Your First Look at a DAG File

Here's the simplest possible DAG — don't worry, we'll break down every line:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

# Define the DAG
with DAG(
    dag_id="my_first_dag",           # The name of your pipeline
    start_date=datetime(2024, 1, 1),  # When to start scheduling
    schedule="@daily",                # Run once every day
    catchup=False,                    # Don't run for past dates
) as dag:

    # Task 1: Print hello
    hello = BashOperator(
        task_id="say_hello",
        bash_command='echo "Hello, Airflow! Today is $(date)"',
    )

    # Task 2: Print goodbye
    goodbye = BashOperator(
        task_id="say_goodbye",
        bash_command='echo "Goodbye! Pipeline complete."',
    )

    # Set the order: hello runs first, then goodbye
    hello >> goodbye

Line-by-Line Breakdown

1
Imports

from airflow import DAG — We import the DAG class. This is the container for your entire pipeline.
from airflow.operators.bash import BashOperator — We import BashOperator so we can run shell commands.

2
DAG Definition

with DAG(...) as dag: — This creates a DAG. Everything indented inside the with block belongs to this DAG.
dag_id = unique name shown in the UI.
start_date = Airflow won't schedule runs before this date.
schedule="@daily" = run once per day at midnight.

3
Tasks

Each BashOperator is a task. It runs a bash command. The task_id is the unique name of that task within the DAG.

4
Dependencies

hello >> goodbye means "run hello first, then goodbye." The >> operator sets the execution order. Think of it as an arrow: hello → goodbye.

Key Airflow CLI Commands

You'll use these commands all the time. Memorize them!

# List all your DAGs
airflow dags list

# Trigger a DAG manually
airflow dags trigger my_first_dag

# Test a specific task (without recording in DB)
airflow tasks test my_first_dag say_hello 2024-01-01

# List tasks in a DAG
airflow tasks list my_first_dag

# Check scheduler health
airflow jobs check

# View configuration
airflow config list

Pro Tip

airflow tasks test is your best friend during development! It runs a single task without the scheduler, without recording results, and shows you the output immediately. Perfect for debugging.

Practice Exercises

Try these exercises to solidify what you've learned. Don't peek at the answers until you've tried!

Exercise 1: Identify the DAG

Scenario

Your company needs a daily pipeline that does this:

1. Downloads sales data from an API

2. Downloads inventory data from a database

3. Merges sales and inventory data

4. Calculates daily KPIs

5. Sends an email report

Question: Draw this as a DAG. Which tasks can run in parallel?

Hint

Steps 1 and 2 don't depend on each other — they fetch from completely different sources. They can run at the same time!

Answer

Parallel branch 1: Download Sales Data
Parallel branch 2: Download Inventory Data
Both feed into: Merge Data → Calculate KPIs → Send Email

Tasks 1 and 2 run in parallel. Tasks 3, 4, and 5 run sequentially after both branches complete.

Exercise 2: Spot the Cycle

Scenario

Someone wrote this dependency chain:

Task A → Task B → Task C → Task A

Question: Is this a valid DAG? Why or why not?

Answer

No! This is NOT a valid DAG because it has a cycle (A → B → C → A). Task A can't run until Task C finishes, but Task C can't run until Task B finishes, and Task B can't run until Task A finishes. Nobody can start — deadlock!

Exercise 3: Write a DAG Skeleton

Challenge

Without looking at the code examples, try writing a DAG that:

1. Has dag_id = "practice_dag"

2. Starts from January 1, 2024

3. Runs weekly

4. Has 3 tasks: extract, transform, load

5. extract → transform → load

# Solution:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id="practice_dag",
    start_date=datetime(2024, 1, 1),
    schedule="@weekly",
    catchup=False,
) as dag:

    extract = BashOperator(
        task_id="extract",
        bash_command='echo "Extracting data..."',
    )

    transform = BashOperator(
        task_id="transform",
        bash_command='echo "Transforming data..."',
    )

    load = BashOperator(
        task_id="load",
        bash_command='echo "Loading data..."',
    )

    extract >> transform >> load

Module 2 Quiz

Test your understanding! Click on the answer you think is correct.

1. What does DAG stand for?

2. What is the role of the Airflow Scheduler?

3. Why must a DAG be "acyclic"?

4. Airflow is primarily a ___?

5. Which company originally created Apache Airflow?

6. What does the >> operator do in Airflow?

7. Which Airflow component actually executes the tasks?

8. Which is NOT a good use case for Airflow?