Write, run, and monitor your very first Airflow pipeline from scratch!
Meet John! John loves space and wants to download pictures of every rocket launch. His pipeline does three things: (1) download_launches — fetches launch data from NASA's API, (2) get_pictures — extracts and downloads the cool rocket images, and (3) notify — sends him an email when it's done. We'll use this example throughout the module!
A DAG file is a Python file that describes your entire pipeline. It lives in Airflow's dags/ folder, and the scheduler reads it to understand what tasks to run and in what order.
Imagine you're writing a recipe for making a cake. The recipe (DAG file) doesn't bake the cake itself — it tells Mom what to do and in what order: mix flour, add eggs, bake for 30 minutes. The DAG file is your "recipe" for the pipeline. Mom (the Airflow worker) reads it and does the actual work!
DAG file = The recipe card. It says "Step 1: download launches, Step 2: get pictures, Step 3: notify." It's just text (Python code) until someone (Airflow) reads it and executes it. The file itself never runs—Airflow runs what the file describes!
An Operator is a blueprint (a recipe). A Task is the actual instance (the cake you baked from that recipe). When you write BashOperator(task_id="download_launches", bash_command="curl ..."), you're creating one task from the BashOperator blueprint.
Think of a cookie cutter (Operator) and the actual cookies (Tasks). The cookie cutter is the BashOperator — it always makes the same shape (runs a shell command). But each cookie you cut out is different: one might say "download_launches," another "get_pictures." Those are your tasks. Same blueprint (BashOperator), different tasks (different task_ids and commands)!
Operator = The cookie cutter, the mold, the template. BashOperator knows "I run shell commands." PythonOperator knows "I run Python functions." They define what kind of work gets done.
Task = The specific job. download_launches is a task. get_pictures is another task. Each task has a unique task_id and its own parameters. John's pipeline has 3 tasks: download_launches, get_pictures, notify.
BashOperator lets you run any shell command from your DAG. John uses it for download_launches — he runs curl to fetch launch data from NASA's API. Think of it as "run this command in a terminal."
BashOperator is like asking Mom to press a button on a machine. "Press the 'download' button." Mom (the worker) does it. The machine (your command) does the actual work. You're just telling Airflow what command to run.
bash_command="curl -o launches.json https://ll.thespacedevs.com/2.2.0/launch/" — John uses this to fetch rocket launch data.
bash_command='echo "Pipeline started!"' — Simple debugging or notifications.
bash_command="mkdir -p /tmp/launches && ls -la" — Create folders, move files, count lines with wc -l.
PythonOperator runs a Python function you define. John uses it for get_pictures — he writes a function that reads the launch JSON, finds image URLs, and downloads them. Way more flexible than a single shell command!
BashOperator = "Press this one button." PythonOperator = "Follow this recipe step by step." With Python, you can use loops, if-statements, libraries like requests or pandas. Mom (the worker) runs your whole Python function, not just one command.
You write a normal Python function (e.g. def get_pictures():), then pass it to PythonOperator as python_callable=get_pictures. Airflow calls your function when the task runs. The function must be importable — defined at the top level of your DAG file or in a separate module.
Never use a lambda! Use a named function. python_callable=my_func works. python_callable=lambda: 1+1 breaks with serialization. Define your logic in a real function!
The >> and << operators set execution order. task_a >> task_b means "run task_a first, then task_b." You can chain: download_launches >> get_pictures >> notify.
download_launches >> get_pictures >> notify is like a factory: first you get the parts (download), then you assemble (get pictures), then you ship (notify). Each step waits for the previous one to finish. The >> arrow points in the direction of flow!
default_args is a dictionary you pass to the DAG. Every task in that DAG inherits these settings: retries, retry_delay, owner, email_on_failure, etc. No need to repeat the same values for every task!
Like a class rule: "In this classroom, everyone gets 2 retries and 5 minutes between retries." You say it once. Every task (student) follows it automatically unless they have a special override.
Instead of dag = DAG(...) and passing dag=dag to every operator, use with DAG(...) as dag:. Everything inside the with block automatically belongs to that DAG. Cleaner and less error-prone!
When you use with DAG(...) as dag:, you don't need to pass dag=dag to each operator in Airflow 2+. The operator picks up the DAG from context. But if you create tasks outside the block, you must pass dag=dag explicitly.
When a task fails: (1) Downstream tasks don't run — if get_pictures fails, notify never runs. (2) Retries — with retries=2, Airflow tries the task up to 3 times (initial + 2 retries) before marking it failed. (3) Clearing — In the UI, you can "clear" a failed task to reset it. Airflow will re-run it (and downstream tasks if you choose).
Airflow waits retry_delay (e.g. 5 minutes), then runs the task again. Repeats up to retries times.
After all retries are used, the task is marked failed. Downstream tasks stay in "upstream failed" state.
Fix the bug, then in the Airflow UI: click the task → Clear → confirm. Airflow re-runs that task. You can optionally clear downstream tasks too to re-run the whole chain.
When a task fails, click it in the Graph or Grid view and open the Log tab. The logs show the exact error (Python traceback, command output, etc.). That's your first step to debugging!
John's pipeline: download_launches → get_pictures → notify. Follow the arrows to see the execution order:
John's rocket launch pipeline: download → get pictures → notify
DAGs can have branches! Independent tasks can run at the same time (in parallel), saving time:
Weather and Sales branches run in parallel, then join together
If "Fetch Weather" takes 5 minutes and "Fetch Sales" takes 3 minutes, running them in parallel means the total is only 5 minutes (not 8). DAGs automatically detect which tasks can run in parallel!
Watch how data flows through Airflow's components:
How Airflow Processes a DAG
Airflow comes with a stunning web interface. Here's what you'll see:
The home page. Shows all your DAGs with their status (running, paused, failed), schedule, owner, and recent task statuses as colored circles.
Shows the structure of a DAG as a visual graph — tasks as boxes and dependencies as arrows. Color-coded by task state (green = success, red = failed, yellow = running).
Shows the history of all runs over time. Each column is a DAG run, each row is a task. This is the most powerful view for debugging — you can see patterns like "this task fails every Monday."
Shows the actual Python code of your DAG file directly in the browser. Great for quick debugging without opening your IDE.
Created Airflow! Uses it for search ranking, pricing, and host analytics pipelines.
Runs 20,000+ DAGs for recommendation engines and content analytics.
Orchestrates ML model training, pricing calculations, and driver analytics.
Manages data pipelines that power content recommendations for 200M+ users.
Let's get Airflow running on your machine. There are two ways:
# Create a virtual environment first (always a good idea!) python -m venv airflow_env source airflow_env/bin/activate # On Mac/Linux # Install Airflow (use the constraint file for stable versions) pip install apache-airflow==2.8.0 \ --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.9.txt" # Initialize the database airflow db init # Create an admin user airflow users create \ --username admin \ --password admin \ --firstname Admin \ --lastname User \ --role Admin \ --email admin@example.com # Start the webserver (in one terminal) airflow webserver --port 8080 # Start the scheduler (in another terminal) airflow scheduler
# Pull and run the official Airflow Docker image docker run -ti \ -p 8080:8080 \ --name airflow \ apache/airflow:2.8.0 \ standalone # That's it! Open http://localhost:8080 # Login: admin / admin
pip install = You downloaded Airflow onto your computer, like installing an app from the App Store.
db init = You set up a little database where Airflow will remember everything (like creating a notebook to keep notes).
users create = You created your login account for the Airflow website.
webserver = Starts the website you can see in your browser.
scheduler = Starts the brain that decides when to run your pipelines.
Here's the simplest possible DAG — don't worry, we'll break down every line:
from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime # Define the DAG with DAG( dag_id="my_first_dag", # The name of your pipeline start_date=datetime(2024, 1, 1), # When to start scheduling schedule="@daily", # Run once every day catchup=False, # Don't run for past dates ) as dag: # Task 1: Print hello hello = BashOperator( task_id="say_hello", bash_command='echo "Hello, Airflow! Today is $(date)"', ) # Task 2: Print goodbye goodbye = BashOperator( task_id="say_goodbye", bash_command='echo "Goodbye! Pipeline complete."', ) # Set the order: hello runs first, then goodbye hello >> goodbye
from airflow import DAG — We import the DAG class. This is the container for your entire pipeline.
from airflow.operators.bash import BashOperator — We import BashOperator so we can run shell commands.
with DAG(...) as dag: — This creates a DAG. Everything indented inside the with block belongs to this DAG.
dag_id = unique name shown in the UI.
start_date = Airflow won't schedule runs before this date.
schedule="@daily" = run once per day at midnight.
Each BashOperator is a task. It runs a bash command. The task_id is the unique name of that task within the DAG.
hello >> goodbye means "run hello first, then goodbye." The >> operator sets the execution order. Think of it as an arrow: hello → goodbye.
You'll use these commands all the time. Memorize them!
# List all your DAGs airflow dags list # Trigger a DAG manually airflow dags trigger my_first_dag # Test a specific task (without recording in DB) airflow tasks test my_first_dag say_hello 2024-01-01 # List tasks in a DAG airflow tasks list my_first_dag # Check scheduler health airflow jobs check # View configuration airflow config list
airflow tasks test is your best friend during development! It runs a single task without the scheduler, without recording results, and shows you the output immediately. Perfect for debugging.
Try these exercises to solidify what you've learned. Don't peek at the answers until you've tried!
Your company needs a daily pipeline that does this:
1. Downloads sales data from an API
2. Downloads inventory data from a database
3. Merges sales and inventory data
4. Calculates daily KPIs
5. Sends an email report
Question: Draw this as a DAG. Which tasks can run in parallel?
Steps 1 and 2 don't depend on each other — they fetch from completely different sources. They can run at the same time!
Parallel branch 1: Download Sales Data
Parallel branch 2: Download Inventory Data
Both feed into: Merge Data → Calculate KPIs → Send Email
Tasks 1 and 2 run in parallel. Tasks 3, 4, and 5 run sequentially after both branches complete.
Someone wrote this dependency chain:
Task A → Task B → Task C → Task A
Question: Is this a valid DAG? Why or why not?
No! This is NOT a valid DAG because it has a cycle (A → B → C → A). Task A can't run until Task C finishes, but Task C can't run until Task B finishes, and Task B can't run until Task A finishes. Nobody can start — deadlock!
Without looking at the code examples, try writing a DAG that:
1. Has dag_id = "practice_dag"
2. Starts from January 1, 2024
3. Runs weekly
4. Has 3 tasks: extract, transform, load
5. extract → transform → load
# Solution: from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id="practice_dag", start_date=datetime(2024, 1, 1), schedule="@weekly", catchup=False, ) as dag: extract = BashOperator( task_id="extract", bash_command='echo "Extracting data..."', ) transform = BashOperator( task_id="transform", bash_command='echo "Transforming data..."', ) load = BashOperator( task_id="load", bash_command='echo "Loading data..."', ) extract >> transform >> load
Test your understanding! Click on the answer you think is correct.