MODULE 1 OF 15

Meet Apache Airflow

What is Airflow? Why do millions of data pipelines run on it? Let's find out from scratch!

What is Data Engineering?

Before we talk about Airflow, let's understand the world it lives in. Data engineering is the art of taking any action involving data and turning it into a reliable, repeatable, and maintainable process.

Explain Like I'm 5

Imagine you have a toy factory. Every morning, trucks bring different parts (wheels, plastic, paint). Workers sort the parts, assemble toys, paint them, and pack them into boxes for delivery.

Data engineering is like being the factory manager — you don't make the toys yourself, but you make sure every step happens in the right order, at the right time, and nothing falls apart!

Companies like Netflix, Uber, Spotify, and Airbnb deal with billions of data points every day. Someone needs to make sure all that data flows smoothly from where it's collected to where it's used. That "someone" is data engineering, and Airflow is one of the most powerful tools for this job.

Real-World Example

Spotify's Discover Weekly: Every Monday, Spotify recommends 30 songs you might like. Behind the scenes, data pipelines collect your listening history, compare it with millions of other users, run machine learning models, and push personalized playlists to your account — all automatically, every single week. That's data engineering at scale!

What is a Data Pipeline?

A data pipeline is a set of steps to move and transform data from one place to another. Think of it as an assembly line for data.

The Restaurant Kitchen Analogy

A data pipeline is like a restaurant kitchen:

Step 1: Ingredients arrive (raw data from APIs, databases, files)

Step 2: Chef preps ingredients — washes, chops, seasons (clean & transform data)

Step 3: Chef cooks the dish (run calculations, models, aggregations)

Step 4: Dish is plated and served (load into dashboard, report, or database)

Each step depends on the previous one. You can't cook ingredients that haven't been prepped yet!

A Simple Pipeline Example

Let's say you want to build a weather dashboard. Here's what the pipeline would look like:

Weather Dashboard Pipeline

1
Fetch Weather Data

Call a weather API to get forecast data for your city

2
Clean & Transform

Convert temperatures from Fahrenheit to Celsius, remove incomplete records

3
Push to Dashboard

Send the cleaned data to your weather dashboard so users can see it

Key Takeaway

A data pipeline is just a series of steps that move data from A to B, with some processing in between. The order matters — you can't push data that hasn't been cleaned yet!

Pipelines as Graphs (DAGs)

Here's where things get interesting. We can draw a pipeline as a graph where:

  • Nodes = Tasks (each step in your pipeline)
  • Arrows (Edges) = Dependencies (which task must finish before the next starts)

This type of graph is called a DAG — Directed Acyclic Graph. Let's break that jargon down:

Directed

The arrows have a direction. Task A points to Task B, meaning "A must happen before B." It's a one-way street.

Acyclic

No loops! You can never come back to where you started. If Task A leads to B, and B leads to C, then C can never lead back to A. That would create an infinite loop!

Graph

The whole picture — all the tasks (nodes) and connections (edges) together form a graph. It's a visual map of your entire pipeline.

Explain Like I'm 5

Imagine you're getting dressed in the morning:

1. Put on underwear

2. Put on pants (can't do this before step 1!)

3. Put on shirt (can happen at the same time as step 2)

4. Put on jacket (needs shirt to be on first)

5. Put on shoes (needs pants to be on first)

This is a DAG! Each step has a direction (order), there are no loops (you don't go back and take off your underwear), and together it forms a graph of all your tasks.

Why "Acyclic" Matters

If Task B needs Task C to finish, and Task C needs Task B to finish — neither can ever run! This is called a deadlock. DAGs prevent this by not allowing any circular dependencies.

Enter Apache Airflow

Apache Airflow is an open-source platform for building, scheduling, and monitoring data pipelines (workflows). It was created by engineers at Airbnb in 2014 and later donated to the Apache Software Foundation.

The Origin Story

Airbnb was growing fast and had hundreds of data pipelines running every day. Engineers were using cron jobs and custom scripts, but it was getting chaotic — nobody could see which pipelines were running, failing, or stuck. Maxime Beauchemin built Airflow as an internal tool to solve this mess. It worked so well that they open-sourced it, and now thousands of companies worldwide use it.

What Makes Airflow Special?

The 5 Superpowers of Airflow

1
Python-Based

Write your pipelines as Python code. If you can code it in Python, Airflow can run it. No XML, no YAML, no drag-and-drop — pure code that you can version-control with Git.

2
Scheduling

Tell Airflow "run this pipeline every day at midnight" or "every Monday at 9am" and it just does it. Automatically. Forever. Like a very reliable alarm clock for your data.

3
Beautiful Web UI

A built-in web dashboard where you can see all your pipelines, their status, logs, history, and more. No more guessing — everything is visible.

4
Rich Integrations

Out-of-the-box connections to AWS, GCP, Azure, PostgreSQL, MySQL, Slack, email, S3, BigQuery, Snowflake, and hundreds more.

5
Failure Handling

If a task fails, Airflow can automatically retry it. If it keeps failing, it alerts you via email or Slack. You can even restart from the point of failure without re-running everything.

Think of Airflow as a Spider in a Web

Airflow doesn't process your data itself — it orchestrates the processing. Imagine a spider sitting in the center of its web. The spider (Airflow) doesn't catch the flies directly — it built the web (pipeline structure) and knows exactly when something lands on it (data arrives). It coordinates everything from the center.

Airflow tells your database: "run this query." It tells your Python script: "process this file." It tells AWS: "copy this data." But Airflow itself doesn't do the heavy lifting — it makes sure everything happens in the right order at the right time.

Airflow Architecture

Airflow has 4 main components that work together like a well-oiled machine:

Scheduler

The brain. Reads your DAG files, checks if it's time to run, and assigns tasks to workers. It runs in a continuous loop.

Web Server

The dashboard. Shows you all your DAGs, their status, logs, and history in a beautiful web interface at port 8080.

Worker(s)

The muscles. Actually execute the tasks the scheduler tells them to. Can be one machine or hundreds in parallel.

Metadata DB

The memory. Stores everything: task statuses, DAG history, run times, logs. Usually PostgreSQL or MySQL.

Explain Like I'm 5

Think of a school:

Scheduler = The Principal. Decides who teaches what and when.

Web Server = The Notice Board. Everyone can see the schedule and announcements.

Workers = The Teachers. They actually teach the classes (do the work).

Metadata DB = The Office Files. Records of everything — grades, attendance, history.

How They Work Together

The Airflow Workflow Cycle

1
You Write DAG Files

You write Python files that describe your pipeline (tasks + dependencies + schedule). These go in the dags/ folder.

2
Scheduler Reads & Parses

The scheduler continuously scans the dags folder, reads your Python files, and figures out the DAG structure and schedule.

3
Scheduler Queues Tasks

When the schedule says it's time, the scheduler checks task dependencies and adds ready tasks to the execution queue.

4
Workers Execute

Workers pick up tasks from the queue and run them. Results (success/failure) are stored in the metadata database.

5
Web Server Displays

The web server reads from the database and shows you a beautiful dashboard with status, logs, and history for every task.

When to Use Airflow (and When NOT To)

Airflow is GREAT for:

Use CaseExample
Batch data processingDaily ETL jobs, weekly reports
ML model training pipelinesTrain a model every night with new data
Data warehouse loadingLoad data from 10 sources into Snowflake daily
Scheduled reportsGenerate and email a sales report every Monday
Complex multi-step processesFetch → Clean → Transform → Load → Validate → Notify

Airflow is NOT ideal for:

Use CaseBetter Alternative
Real-time streaming dataApache Kafka, Apache Flink, Spark Streaming
Pipelines that change structure every runCustom orchestration tools
Teams with zero Python experienceAzure Data Factory, SSIS (visual tools)
Sub-second latency requirementsEvent-driven architectures

Remember

Airflow is an orchestrator, not a data processing engine. It tells other tools what to do and when. It's the conductor of the orchestra, not the violinist.

Airflow vs Other Tools

FeatureAirflowLuigiPrefectDagster
LanguagePythonPythonPythonPython
Web UIExcellentBasicExcellentGood
SchedulingBuilt-inNone (need cron)Built-inBuilt-in
CommunityMassiveMediumGrowingGrowing
Learning CurveMediumEasyEasyMedium
BackfillingYesNoYesYes
Production ReadyBattle-testedLimitedYesYes

Why Airflow Wins for Most Teams

Airflow has the largest community, the most integrations, and is battle-tested at companies like Airbnb, Google, Amazon, Netflix, and Uber. When you Google a problem, you'll almost always find an answer. That matters more than any feature comparison!

What a DAG Looks Like

Here's a simple weather dashboard pipeline drawn as a DAG. Follow the arrows to see the execution order:

Fetch Weather Data Clean & Transform Push to Dashboard

A simple 3-task DAG: data flows from left to right

Parallel Tasks in a DAG

DAGs can have branches! Independent tasks can run at the same time (in parallel), saving time:

Fetch Weather Clean Weather Fetch Sales Clean Sales Join Data Train Model Branch 1 (parallel) Branch 2 (parallel)

Weather and Sales branches run in parallel, then join together

Why Parallel Matters

If "Fetch Weather" takes 5 minutes and "Fetch Sales" takes 3 minutes, running them in parallel means the total is only 5 minutes (not 8). DAGs automatically detect which tasks can run in parallel!

Airflow Architecture — Animated

Watch how data flows through Airflow's components:

How Airflow Processes a DAG


DAG File

Scheduler

Task Queue

Worker

Results in DB

Web UI

The Airflow Web UI

Airflow comes with a stunning web interface. Here's what you'll see:

Key Views in the Airflow UI

DAGs List View

The home page. Shows all your DAGs with their status (running, paused, failed), schedule, owner, and recent task statuses as colored circles.

Graph View

Shows the structure of a DAG as a visual graph — tasks as boxes and dependencies as arrows. Color-coded by task state (green = success, red = failed, yellow = running).

Grid View (formerly Tree View)

Shows the history of all runs over time. Each column is a DAG run, each row is a task. This is the most powerful view for debugging — you can see patterns like "this task fails every Monday."

Code View

Shows the actual Python code of your DAG file directly in the browser. Great for quick debugging without opening your IDE.

Who Uses Airflow?

Airbnb

Created Airflow! Uses it for search ranking, pricing, and host analytics pipelines.

Spotify

Runs 20,000+ DAGs for recommendation engines and content analytics.

Uber

Orchestrates ML model training, pricing calculations, and driver analytics.

Netflix

Manages data pipelines that power content recommendations for 200M+ users.

Installing Airflow

Let's get Airflow running on your machine. There are two ways:

Option 1: pip install (simplest)

# Create a virtual environment first (always a good idea!)
python -m venv airflow_env
source airflow_env/bin/activate  # On Mac/Linux

# Install Airflow (use the constraint file for stable versions)
pip install apache-airflow==2.8.0 \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.9.txt"

# Initialize the database
airflow db init

# Create an admin user
airflow users create \
  --username admin \
  --password admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com

# Start the webserver (in one terminal)
airflow webserver --port 8080

# Start the scheduler (in another terminal)
airflow scheduler

Option 2: Docker (recommended for beginners)

# Pull and run the official Airflow Docker image
docker run -ti \
  -p 8080:8080 \
  --name airflow \
  apache/airflow:2.8.0 \
  standalone

# That's it! Open http://localhost:8080
# Login: admin / admin

What Just Happened?

pip install = You downloaded Airflow onto your computer, like installing an app from the App Store.

db init = You set up a little database where Airflow will remember everything (like creating a notebook to keep notes).

users create = You created your login account for the Airflow website.

webserver = Starts the website you can see in your browser.

scheduler = Starts the brain that decides when to run your pipelines.

Your First Look at a DAG File

Here's the simplest possible DAG — don't worry, we'll break down every line:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

# Define the DAG
with DAG(
    dag_id="my_first_dag",           # The name of your pipeline
    start_date=datetime(2024, 1, 1),  # When to start scheduling
    schedule="@daily",                # Run once every day
    catchup=False,                    # Don't run for past dates
) as dag:

    # Task 1: Print hello
    hello = BashOperator(
        task_id="say_hello",
        bash_command='echo "Hello, Airflow! Today is $(date)"',
    )

    # Task 2: Print goodbye
    goodbye = BashOperator(
        task_id="say_goodbye",
        bash_command='echo "Goodbye! Pipeline complete."',
    )

    # Set the order: hello runs first, then goodbye
    hello >> goodbye

Line-by-Line Breakdown

1
Imports

from airflow import DAG — We import the DAG class. This is the container for your entire pipeline.
from airflow.operators.bash import BashOperator — We import BashOperator so we can run shell commands.

2
DAG Definition

with DAG(...) as dag: — This creates a DAG. Everything indented inside the with block belongs to this DAG.
dag_id = unique name shown in the UI.
start_date = Airflow won't schedule runs before this date.
schedule="@daily" = run once per day at midnight.

3
Tasks

Each BashOperator is a task. It runs a bash command. The task_id is the unique name of that task within the DAG.

4
Dependencies

hello >> goodbye means "run hello first, then goodbye." The >> operator sets the execution order. Think of it as an arrow: hello → goodbye.

Key Airflow CLI Commands

You'll use these commands all the time. Memorize them!

# List all your DAGs
airflow dags list

# Trigger a DAG manually
airflow dags trigger my_first_dag

# Test a specific task (without recording in DB)
airflow tasks test my_first_dag say_hello 2024-01-01

# List tasks in a DAG
airflow tasks list my_first_dag

# Check scheduler health
airflow jobs check

# View configuration
airflow config list

Pro Tip

airflow tasks test is your best friend during development! It runs a single task without the scheduler, without recording results, and shows you the output immediately. Perfect for debugging.

Practice Exercises

Try these exercises to solidify what you've learned. Don't peek at the answers until you've tried!

Exercise 1: Identify the DAG

Scenario

Your company needs a daily pipeline that does this:

1. Downloads sales data from an API

2. Downloads inventory data from a database

3. Merges sales and inventory data

4. Calculates daily KPIs

5. Sends an email report

Question: Draw this as a DAG. Which tasks can run in parallel?

Hint

Steps 1 and 2 don't depend on each other — they fetch from completely different sources. They can run at the same time!

Answer

Parallel branch 1: Download Sales Data
Parallel branch 2: Download Inventory Data
Both feed into: Merge Data → Calculate KPIs → Send Email

Tasks 1 and 2 run in parallel. Tasks 3, 4, and 5 run sequentially after both branches complete.

Exercise 2: Spot the Cycle

Scenario

Someone wrote this dependency chain:

Task A → Task B → Task C → Task A

Question: Is this a valid DAG? Why or why not?

Answer

No! This is NOT a valid DAG because it has a cycle (A → B → C → A). Task A can't run until Task C finishes, but Task C can't run until Task B finishes, and Task B can't run until Task A finishes. Nobody can start — deadlock!

Exercise 3: Write a DAG Skeleton

Challenge

Without looking at the code examples, try writing a DAG that:

1. Has dag_id = "practice_dag"

2. Starts from January 1, 2024

3. Runs weekly

4. Has 3 tasks: extract, transform, load

5. extract → transform → load

# Solution:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id="practice_dag",
    start_date=datetime(2024, 1, 1),
    schedule="@weekly",
    catchup=False,
) as dag:

    extract = BashOperator(
        task_id="extract",
        bash_command='echo "Extracting data..."',
    )

    transform = BashOperator(
        task_id="transform",
        bash_command='echo "Transforming data..."',
    )

    load = BashOperator(
        task_id="load",
        bash_command='echo "Loading data..."',
    )

    extract >> transform >> load

Module 1 Quiz

Test your understanding! Click on the answer you think is correct.

1. What does DAG stand for?

2. What is the role of the Airflow Scheduler?

3. Why must a DAG be "acyclic"?

4. Airflow is primarily a ___?

5. Which company originally created Apache Airflow?

6. What does the >> operator do in Airflow?

7. Which Airflow component actually executes the tasks?

8. Which is NOT a good use case for Airflow?