Master cron expressions, execution dates, incremental processing, and backfilling!
Scheduling means running your DAGs automatically at fixed intervals, without you having to click "Trigger DAG" every time. Instead of manual runs, Airflow's scheduler decides when each DAG should run based on the schedule you define.
Imagine you have a pet hamster that needs feeding every morning at 8am. You could wake up every day and feed it yourself — or you could set an alarm clock. The alarm schedules the feeding for you automatically!
Airflow scheduling is like that alarm clock — you tell it "run this pipeline every day at midnight" or "every hour" and it wakes up and runs your DAG at the right time, every time.
A company tracks user events (clicks, sign-ups, purchases) in a database. Every day at 6am, they need a pipeline to process yesterday's events, aggregate them, and update their analytics dashboard. Without scheduling, someone would have to log in at 6am every single day. With Airflow scheduling, the DAG runs automatically at 6am every day — no human needed!
In your DAG definition, you specify when the DAG runs using either schedule_interval (older Airflow) or schedule (Airflow 2.2+). They do the same thing!
schedule_interval was deprecated in favor of schedule. Use schedule in new DAGs. Both work, but schedule is the modern choice.
You can use three types of schedules:
@daily, @weekly0 9 * * MON (every Monday at 9am)timedelta(hours=3)Airflow provides convenient preset strings for common schedules. No need to learn cron syntax for these!
| Preset | Meaning | Equivalent Cron |
|---|---|---|
@once | Run once (no automatic repeats) | N/A |
@hourly | Every hour at minute 0 | 0 * * * * |
@daily | Every day at midnight (00:00) | 0 0 * * * |
@weekly | Every Sunday at midnight | 0 0 * * 0 |
@monthly | First day of every month at midnight | 0 0 1 * * |
@yearly | January 1st at midnight | 0 0 1 1 * |
Use presets when your schedule matches exactly. Need "every Monday at 9am"? Presets won't help — you'll need a cron expression!
A cron expression has 5 fields separated by spaces. Each field controls one part of the schedule:
minute hour day-of-month month day-of-week
0 9 * * MON
(every Monday at 9:00 AM)
| Field | Valid Values | Example |
|---|---|---|
| minute | 0–59 | 0 = at :00, 30 = at :30 |
| hour | 0–23 | 9 = 9am, 0 = midnight |
| day-of-month | 1–31 | 1 = 1st, 15 = 15th |
| month | 1–12 or JAN–DEC | 1 = Jan, 12 = Dec |
| day-of-week | 0–7 (0 and 7 = Sunday) or MON–SUN | MON = Monday, 5 = Friday |
* (wildcard) — Every value. * in minute = every minute1-5 (range) — From 1 to 5 inclusive*/15 (step) — Every 15 units. */15 in minute = every 15 minutes1,15,30 (list) — Only 1, 15, and 30| Expression | Meaning |
|---|---|
0 * * * * | Every hour (at minute 0) |
0 0 * * * | Every day at midnight |
0 9 * * MON | Every Monday at 9am |
*/15 * * * * | Every 15 minutes |
0 0 1 * * | First day of every month at midnight |
This is the most confusing part of Airflow! Once you get it, everything clicks. Let's break it down.
Imagine you run a restaurant. Every day, you "close the books" for the previous day. On Tuesday morning, you review Monday's sales. You're not reviewing Tuesday's data — Tuesday isn't over yet! You're reviewing the data for the period that just ended (Monday).
Airflow works the same way. A DAG run that starts on Tuesday at 6am is processing Monday's data. The "execution date" (logical_date) points to the beginning of that data period — midnight Monday — even though the run actually happens on Tuesday.
execution_date) — The start of the data interval. For a daily DAG, it's midnight of the day whose data you're processing.A DAG run for a period STARTS at the END of that period. You process Monday's data when Monday is over — i.e., on Tuesday. The logical_date = Monday 00:00 (start of Monday), even though you actually run on Tuesday.
Airflow 2.2+ uses data_interval_start and data_interval_end. The old execution_date is deprecated. New code should use the interval terms!
Incremental processing means you process only new data since the last run, instead of reprocessing everything. This saves time and resources.
Imagine you have a giant toy box. Every day you get 5 new toys. You could either: (A) count ALL the toys in the box every day, or (B) just add the 5 new ones to your count. Option B is incremental — you only process what's new!
Your app logs user events (clicks, page views) to a database. A daily DAG with schedule="@daily" runs every morning. Instead of processing all events ever, you use {{ data_interval_start }} and {{ data_interval_end }} in your SQL:
WHERE event_time >= '{{ data_interval_start }}' AND event_time < '{{ data_interval_end }}'
Each run processes only that day's events. Monday's run gets Monday's events. Tuesday's run gets Tuesday's events. No duplicates, no full-table scans!
Backfilling means running your DAG for past dates. Maybe you just created the DAG and want to process historical data, or you fixed a bug and need to reprocess last week.
You built a pipeline and want to backfill data from Jan 1 to today.
A task had a bug; you fixed it and want to reprocess affected dates.
Source data was corrupted and restored; you need to re-run the pipeline.
Use the CLI: airflow dags backfill -s START_DATE -e END_DATE dag_id
start_date tells Airflow: "Don't schedule any runs before this date." The first run will be for the first schedule slot on or after start_date.
end_date (optional) tells Airflow: "Stop scheduling after this date." Useful for one-time or temporary DAGs.
Don't use datetime.now() or dynamic dates for start_date! Airflow parses the DAG file once and caches it. Use a fixed date like datetime(2024, 1, 1).
catchup controls what happens when there's a gap between start_date and "now."
Airflow will create DAG runs for every schedule slot between start_date and now. If you deploy a daily DAG with start_date 30 days ago, it will queue 30 runs!
Airflow only schedules from "now" onward. No past runs. Use this when you don't need historical backfill and want to avoid a flood of runs.
Most production DAGs use catchup=False. Do backfills explicitly with airflow dags backfill when needed, rather than catching up automatically.
In Airflow 2.2+, timetables offer a more powerful, programmatic way to define schedules. Instead of cron, you write a Python class that returns when the next run should be.
Use timetables when you need:
For most use cases, presets and cron expressions are enough. Timetables are for advanced scenarios.
See how a daily DAG's logical_date differs from when it actually runs:
For Monday's data: logical_date is Mon 00:00, but the run happens Tuesday morning
The 5 fields of a cron expression with examples:
0 9 * * MON decoded: minute=0, hour=9, every day, every month, Mondays only
When you backfill, Airflow queues multiple runs in order:
airflow dags backfill -s 2024-01-01 -e 2024-03-15 my_dag
For a daily DAG processing Monday's data:
Monday's data = events from Mon 00:00 (inclusive) to Tue 00:00 (exclusive)
Deploy DAG with start_date 30 days ago → 30 runs queue immediately (Jan 1–30)
Deploy same DAG → only 1 run queues (today). No past runs.
from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id="daily_user_events", start_date=datetime(2024, 1, 1), schedule="@daily", catchup=False, ) as dag: process_events = BashOperator( task_id="process_events", bash_command='echo "Processing {{ ds }}"', )
with DAG( dag_id="weekly_report", start_date=datetime(2024, 1, 1), schedule="0 9 * * MON", # Every Monday at 9:00 AM catchup=False, ) as dag: # ... tasks ...
from datetime import datetime, timedelta with DAG( dag_id="hourly_sync", start_date=datetime(2024, 1, 1), schedule=timedelta(hours=3), catchup=False, ) as dag: # Runs every 3 hours
with DAG( dag_id="campaign_2024_q1", start_date=datetime(2024, 1, 1), end_date=datetime(2024, 3, 31), schedule="@daily", ) as dag: # Will stop scheduling after March 31, 2024
Use Jinja templating to pass dates into your tasks:
# {{ ds }} = logical_date as YYYY-MM-DD string # {{ data_interval_start }} = start of data period # {{ data_interval_end }} = end of data period process = BashOperator( task_id="process", bash_command=''' echo "Processing data for {{ ds }}" echo "Interval: {{ data_interval_start }} to {{ data_interval_end }}" ''', ) # In PythonOperator, access via context: def my_task(**context): ds = context["ds"] interval_start = context["data_interval_start"] interval_end = context["data_interval_end"] return f"Processed {ds}"
# Backfill from Jan 1 to Mar 15, 2024 airflow dags backfill -s 2024-01-01 -e 2024-03-15 daily_user_events # Backfill from a single date (no -e) airflow dags backfill -s 2024-01-01 daily_user_events # Backfill and run tasks even if they already succeeded (re-run) airflow dags backfill -s 2024-01-01 --reset-dagruns daily_user_events
Write the cron expression for: "Every day at 6:30 AM"
30 6 * * * — minute=30, hour=6, every day/month/dow
A daily DAG with schedule="@daily" runs at 6am on Wednesday, March 13, 2024. What is the logical_date (data_interval_start) for this run?
2024-03-12 00:00:00 (Tuesday midnight). The run on Wednesday 6am processes Tuesday's data. The data interval is Tuesday 00:00 → Wednesday 00:00.
Write the cron expression for: "Every 15 minutes"
*/15 * * * * — step of 15 in the minute field, every hour, every day, etc.
You have a table user_events with a event_time timestamp. Write the WHERE clause for a daily incremental DAG so each run processes only that day's events. Use Airflow template variables.
WHERE event_time >= '{{ data_interval_start }}' AND event_time < '{{ data_interval_end }}'