What is Airflow? Why do millions of data pipelines run on it? Let's find out from scratch!
Before we talk about Airflow, let's understand the world it lives in. Data engineering is the art of taking any action involving data and turning it into a reliable, repeatable, and maintainable process.
Imagine you have a toy factory. Every morning, trucks bring different parts (wheels, plastic, paint). Workers sort the parts, assemble toys, paint them, and pack them into boxes for delivery.
Data engineering is like being the factory manager — you don't make the toys yourself, but you make sure every step happens in the right order, at the right time, and nothing falls apart!
Companies like Netflix, Uber, Spotify, and Airbnb deal with billions of data points every day. Someone needs to make sure all that data flows smoothly from where it's collected to where it's used. That "someone" is data engineering, and Airflow is one of the most powerful tools for this job.
Spotify's Discover Weekly: Every Monday, Spotify recommends 30 songs you might like. Behind the scenes, data pipelines collect your listening history, compare it with millions of other users, run machine learning models, and push personalized playlists to your account — all automatically, every single week. That's data engineering at scale!
A data pipeline is a set of steps to move and transform data from one place to another. Think of it as an assembly line for data.
A data pipeline is like a restaurant kitchen:
Step 1: Ingredients arrive (raw data from APIs, databases, files)
Step 2: Chef preps ingredients — washes, chops, seasons (clean & transform data)
Step 3: Chef cooks the dish (run calculations, models, aggregations)
Step 4: Dish is plated and served (load into dashboard, report, or database)
Each step depends on the previous one. You can't cook ingredients that haven't been prepped yet!
Let's say you want to build a weather dashboard. Here's what the pipeline would look like:
Call a weather API to get forecast data for your city
Convert temperatures from Fahrenheit to Celsius, remove incomplete records
Send the cleaned data to your weather dashboard so users can see it
A data pipeline is just a series of steps that move data from A to B, with some processing in between. The order matters — you can't push data that hasn't been cleaned yet!
Here's where things get interesting. We can draw a pipeline as a graph where:
This type of graph is called a DAG — Directed Acyclic Graph. Let's break that jargon down:
The arrows have a direction. Task A points to Task B, meaning "A must happen before B." It's a one-way street.
No loops! You can never come back to where you started. If Task A leads to B, and B leads to C, then C can never lead back to A. That would create an infinite loop!
The whole picture — all the tasks (nodes) and connections (edges) together form a graph. It's a visual map of your entire pipeline.
Imagine you're getting dressed in the morning:
1. Put on underwear
2. Put on pants (can't do this before step 1!)
3. Put on shirt (can happen at the same time as step 2)
4. Put on jacket (needs shirt to be on first)
5. Put on shoes (needs pants to be on first)
This is a DAG! Each step has a direction (order), there are no loops (you don't go back and take off your underwear), and together it forms a graph of all your tasks.
If Task B needs Task C to finish, and Task C needs Task B to finish — neither can ever run! This is called a deadlock. DAGs prevent this by not allowing any circular dependencies.
Apache Airflow is an open-source platform for building, scheduling, and monitoring data pipelines (workflows). It was created by engineers at Airbnb in 2014 and later donated to the Apache Software Foundation.
Airbnb was growing fast and had hundreds of data pipelines running every day. Engineers were using cron jobs and custom scripts, but it was getting chaotic — nobody could see which pipelines were running, failing, or stuck. Maxime Beauchemin built Airflow as an internal tool to solve this mess. It worked so well that they open-sourced it, and now thousands of companies worldwide use it.
Write your pipelines as Python code. If you can code it in Python, Airflow can run it. No XML, no YAML, no drag-and-drop — pure code that you can version-control with Git.
Tell Airflow "run this pipeline every day at midnight" or "every Monday at 9am" and it just does it. Automatically. Forever. Like a very reliable alarm clock for your data.
A built-in web dashboard where you can see all your pipelines, their status, logs, history, and more. No more guessing — everything is visible.
Out-of-the-box connections to AWS, GCP, Azure, PostgreSQL, MySQL, Slack, email, S3, BigQuery, Snowflake, and hundreds more.
If a task fails, Airflow can automatically retry it. If it keeps failing, it alerts you via email or Slack. You can even restart from the point of failure without re-running everything.
Airflow doesn't process your data itself — it orchestrates the processing. Imagine a spider sitting in the center of its web. The spider (Airflow) doesn't catch the flies directly — it built the web (pipeline structure) and knows exactly when something lands on it (data arrives). It coordinates everything from the center.
Airflow tells your database: "run this query." It tells your Python script: "process this file." It tells AWS: "copy this data." But Airflow itself doesn't do the heavy lifting — it makes sure everything happens in the right order at the right time.
Airflow has 4 main components that work together like a well-oiled machine:
The brain. Reads your DAG files, checks if it's time to run, and assigns tasks to workers. It runs in a continuous loop.
The dashboard. Shows you all your DAGs, their status, logs, and history in a beautiful web interface at port 8080.
The muscles. Actually execute the tasks the scheduler tells them to. Can be one machine or hundreds in parallel.
The memory. Stores everything: task statuses, DAG history, run times, logs. Usually PostgreSQL or MySQL.
Think of a school:
Scheduler = The Principal. Decides who teaches what and when.
Web Server = The Notice Board. Everyone can see the schedule and announcements.
Workers = The Teachers. They actually teach the classes (do the work).
Metadata DB = The Office Files. Records of everything — grades, attendance, history.
You write Python files that describe your pipeline (tasks + dependencies + schedule). These go in the dags/ folder.
The scheduler continuously scans the dags folder, reads your Python files, and figures out the DAG structure and schedule.
When the schedule says it's time, the scheduler checks task dependencies and adds ready tasks to the execution queue.
Workers pick up tasks from the queue and run them. Results (success/failure) are stored in the metadata database.
The web server reads from the database and shows you a beautiful dashboard with status, logs, and history for every task.
| Use Case | Example |
|---|---|
| Batch data processing | Daily ETL jobs, weekly reports |
| ML model training pipelines | Train a model every night with new data |
| Data warehouse loading | Load data from 10 sources into Snowflake daily |
| Scheduled reports | Generate and email a sales report every Monday |
| Complex multi-step processes | Fetch → Clean → Transform → Load → Validate → Notify |
| Use Case | Better Alternative |
|---|---|
| Real-time streaming data | Apache Kafka, Apache Flink, Spark Streaming |
| Pipelines that change structure every run | Custom orchestration tools |
| Teams with zero Python experience | Azure Data Factory, SSIS (visual tools) |
| Sub-second latency requirements | Event-driven architectures |
Airflow is an orchestrator, not a data processing engine. It tells other tools what to do and when. It's the conductor of the orchestra, not the violinist.
| Feature | Airflow | Luigi | Prefect | Dagster |
|---|---|---|---|---|
| Language | Python | Python | Python | Python |
| Web UI | Excellent | Basic | Excellent | Good |
| Scheduling | Built-in | None (need cron) | Built-in | Built-in |
| Community | Massive | Medium | Growing | Growing |
| Learning Curve | Medium | Easy | Easy | Medium |
| Backfilling | Yes | No | Yes | Yes |
| Production Ready | Battle-tested | Limited | Yes | Yes |
Airflow has the largest community, the most integrations, and is battle-tested at companies like Airbnb, Google, Amazon, Netflix, and Uber. When you Google a problem, you'll almost always find an answer. That matters more than any feature comparison!
Here's a simple weather dashboard pipeline drawn as a DAG. Follow the arrows to see the execution order:
A simple 3-task DAG: data flows from left to right
DAGs can have branches! Independent tasks can run at the same time (in parallel), saving time:
Weather and Sales branches run in parallel, then join together
If "Fetch Weather" takes 5 minutes and "Fetch Sales" takes 3 minutes, running them in parallel means the total is only 5 minutes (not 8). DAGs automatically detect which tasks can run in parallel!
Watch how data flows through Airflow's components:
How Airflow Processes a DAG
Airflow comes with a stunning web interface. Here's what you'll see:
The home page. Shows all your DAGs with their status (running, paused, failed), schedule, owner, and recent task statuses as colored circles.
Shows the structure of a DAG as a visual graph — tasks as boxes and dependencies as arrows. Color-coded by task state (green = success, red = failed, yellow = running).
Shows the history of all runs over time. Each column is a DAG run, each row is a task. This is the most powerful view for debugging — you can see patterns like "this task fails every Monday."
Shows the actual Python code of your DAG file directly in the browser. Great for quick debugging without opening your IDE.
Created Airflow! Uses it for search ranking, pricing, and host analytics pipelines.
Runs 20,000+ DAGs for recommendation engines and content analytics.
Orchestrates ML model training, pricing calculations, and driver analytics.
Manages data pipelines that power content recommendations for 200M+ users.
Let's get Airflow running on your machine. There are two ways:
# Create a virtual environment first (always a good idea!) python -m venv airflow_env source airflow_env/bin/activate # On Mac/Linux # Install Airflow (use the constraint file for stable versions) pip install apache-airflow==2.8.0 \ --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.9.txt" # Initialize the database airflow db init # Create an admin user airflow users create \ --username admin \ --password admin \ --firstname Admin \ --lastname User \ --role Admin \ --email admin@example.com # Start the webserver (in one terminal) airflow webserver --port 8080 # Start the scheduler (in another terminal) airflow scheduler
# Pull and run the official Airflow Docker image docker run -ti \ -p 8080:8080 \ --name airflow \ apache/airflow:2.8.0 \ standalone # That's it! Open http://localhost:8080 # Login: admin / admin
pip install = You downloaded Airflow onto your computer, like installing an app from the App Store.
db init = You set up a little database where Airflow will remember everything (like creating a notebook to keep notes).
users create = You created your login account for the Airflow website.
webserver = Starts the website you can see in your browser.
scheduler = Starts the brain that decides when to run your pipelines.
Here's the simplest possible DAG — don't worry, we'll break down every line:
from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime # Define the DAG with DAG( dag_id="my_first_dag", # The name of your pipeline start_date=datetime(2024, 1, 1), # When to start scheduling schedule="@daily", # Run once every day catchup=False, # Don't run for past dates ) as dag: # Task 1: Print hello hello = BashOperator( task_id="say_hello", bash_command='echo "Hello, Airflow! Today is $(date)"', ) # Task 2: Print goodbye goodbye = BashOperator( task_id="say_goodbye", bash_command='echo "Goodbye! Pipeline complete."', ) # Set the order: hello runs first, then goodbye hello >> goodbye
from airflow import DAG — We import the DAG class. This is the container for your entire pipeline.
from airflow.operators.bash import BashOperator — We import BashOperator so we can run shell commands.
with DAG(...) as dag: — This creates a DAG. Everything indented inside the with block belongs to this DAG.
dag_id = unique name shown in the UI.
start_date = Airflow won't schedule runs before this date.
schedule="@daily" = run once per day at midnight.
Each BashOperator is a task. It runs a bash command. The task_id is the unique name of that task within the DAG.
hello >> goodbye means "run hello first, then goodbye." The >> operator sets the execution order. Think of it as an arrow: hello → goodbye.
You'll use these commands all the time. Memorize them!
# List all your DAGs airflow dags list # Trigger a DAG manually airflow dags trigger my_first_dag # Test a specific task (without recording in DB) airflow tasks test my_first_dag say_hello 2024-01-01 # List tasks in a DAG airflow tasks list my_first_dag # Check scheduler health airflow jobs check # View configuration airflow config list
airflow tasks test is your best friend during development! It runs a single task without the scheduler, without recording results, and shows you the output immediately. Perfect for debugging.
Try these exercises to solidify what you've learned. Don't peek at the answers until you've tried!
Your company needs a daily pipeline that does this:
1. Downloads sales data from an API
2. Downloads inventory data from a database
3. Merges sales and inventory data
4. Calculates daily KPIs
5. Sends an email report
Question: Draw this as a DAG. Which tasks can run in parallel?
Steps 1 and 2 don't depend on each other — they fetch from completely different sources. They can run at the same time!
Parallel branch 1: Download Sales Data
Parallel branch 2: Download Inventory Data
Both feed into: Merge Data → Calculate KPIs → Send Email
Tasks 1 and 2 run in parallel. Tasks 3, 4, and 5 run sequentially after both branches complete.
Someone wrote this dependency chain:
Task A → Task B → Task C → Task A
Question: Is this a valid DAG? Why or why not?
No! This is NOT a valid DAG because it has a cycle (A → B → C → A). Task A can't run until Task C finishes, but Task C can't run until Task B finishes, and Task B can't run until Task A finishes. Nobody can start — deadlock!
Without looking at the code examples, try writing a DAG that:
1. Has dag_id = "practice_dag"
2. Starts from January 1, 2024
3. Runs weekly
4. Has 3 tasks: extract, transform, load
5. extract → transform → load
# Solution: from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id="practice_dag", start_date=datetime(2024, 1, 1), schedule="@weekly", catchup=False, ) as dag: extract = BashOperator( task_id="extract", bash_command='echo "Extracting data..."', ) transform = BashOperator( task_id="transform", bash_command='echo "Transforming data..."', ) load = BashOperator( task_id="load", bash_command='echo "Loading data..."', ) extract >> transform >> load
Test your understanding! Click on the answer you think is correct.