MODULE 3 OF 15

Scheduling Mastery

Master cron expressions, execution dates, incremental processing, and backfilling!

What is Scheduling?

Scheduling means running your DAGs automatically at fixed intervals, without you having to click "Trigger DAG" every time. Instead of manual runs, Airflow's scheduler decides when each DAG should run based on the schedule you define.

Explain Like I'm 5

Imagine you have a pet hamster that needs feeding every morning at 8am. You could wake up every day and feed it yourself — or you could set an alarm clock. The alarm schedules the feeding for you automatically!

Airflow scheduling is like that alarm clock — you tell it "run this pipeline every day at midnight" or "every hour" and it wakes up and runs your DAG at the right time, every time.

Real-World Example

A company tracks user events (clicks, sign-ups, purchases) in a database. Every day at 6am, they need a pipeline to process yesterday's events, aggregate them, and update their analytics dashboard. Without scheduling, someone would have to log in at 6am every single day. With Airflow scheduling, the DAG runs automatically at 6am every day — no human needed!

schedule_interval / schedule Parameter

In your DAG definition, you specify when the DAG runs using either schedule_interval (older Airflow) or schedule (Airflow 2.2+). They do the same thing!

Naming Change

schedule_interval was deprecated in favor of schedule. Use schedule in new DAGs. Both work, but schedule is the modern choice.

You can use three types of schedules:

  • Presets — Simple words like @daily, @weekly
  • Cron expressions — Flexible like 0 9 * * MON (every Monday at 9am)
  • timedelta — Python intervals like timedelta(hours=3)

Airflow Presets

Airflow provides convenient preset strings for common schedules. No need to learn cron syntax for these!

PresetMeaningEquivalent Cron
@onceRun once (no automatic repeats)N/A
@hourlyEvery hour at minute 00 * * * *
@dailyEvery day at midnight (00:00)0 0 * * *
@weeklyEvery Sunday at midnight0 0 * * 0
@monthlyFirst day of every month at midnight0 0 1 * *
@yearlyJanuary 1st at midnight0 0 1 1 *

When to Use Presets

Use presets when your schedule matches exactly. Need "every Monday at 9am"? Presets won't help — you'll need a cron expression!

Cron Expressions — Deep Dive

A cron expression has 5 fields separated by spaces. Each field controls one part of the schedule:

Cron Format: 5 Fields

minute   hour   day-of-month   month   day-of-week

0        9       *           *      MON

(every Monday at 9:00 AM)

FieldValid ValuesExample
minute0–590 = at :00, 30 = at :30
hour0–239 = 9am, 0 = midnight
day-of-month1–311 = 1st, 15 = 15th
month1–12 or JAN–DEC1 = Jan, 12 = Dec
day-of-week0–7 (0 and 7 = Sunday) or MON–SUNMON = Monday, 5 = Friday

Special Characters

  • * (wildcard) — Every value. * in minute = every minute
  • 1-5 (range) — From 1 to 5 inclusive
  • */15 (step) — Every 15 units. */15 in minute = every 15 minutes
  • 1,15,30 (list) — Only 1, 15, and 30

Cron Examples

ExpressionMeaning
0 * * * *Every hour (at minute 0)
0 0 * * *Every day at midnight
0 9 * * MONEvery Monday at 9am
*/15 * * * *Every 15 minutes
0 0 1 * *First day of every month at midnight

Execution Dates Explained (The Tricky Part!)

This is the most confusing part of Airflow! Once you get it, everything clicks. Let's break it down.

The Restaurant Analogy

Imagine you run a restaurant. Every day, you "close the books" for the previous day. On Tuesday morning, you review Monday's sales. You're not reviewing Tuesday's data — Tuesday isn't over yet! You're reviewing the data for the period that just ended (Monday).

Airflow works the same way. A DAG run that starts on Tuesday at 6am is processing Monday's data. The "execution date" (logical_date) points to the beginning of that data period — midnight Monday — even though the run actually happens on Tuesday.

Key Terms

  • logical_date (formerly execution_date) — The start of the data interval. For a daily DAG, it's midnight of the day whose data you're processing.
  • data_interval_start — Same as logical_date. Start of the period.
  • data_interval_end — End of the period. For daily, it's midnight of the next day.

Critical Insight

A DAG run for a period STARTS at the END of that period. You process Monday's data when Monday is over — i.e., on Tuesday. The logical_date = Monday 00:00 (start of Monday), even though you actually run on Tuesday.

Use data_interval_start / data_interval_end

Airflow 2.2+ uses data_interval_start and data_interval_end. The old execution_date is deprecated. New code should use the interval terms!

Incremental Processing

Incremental processing means you process only new data since the last run, instead of reprocessing everything. This saves time and resources.

Explain Like I'm 5

Imagine you have a giant toy box. Every day you get 5 new toys. You could either: (A) count ALL the toys in the box every day, or (B) just add the 5 new ones to your count. Option B is incremental — you only process what's new!

User Events Example (Ch. 3)

Your app logs user events (clicks, page views) to a database. A daily DAG with schedule="@daily" runs every morning. Instead of processing all events ever, you use {{ data_interval_start }} and {{ data_interval_end }} in your SQL:

WHERE event_time >= '{{ data_interval_start }}' AND event_time < '{{ data_interval_end }}'

Each run processes only that day's events. Monday's run gets Monday's events. Tuesday's run gets Tuesday's events. No duplicates, no full-table scans!

Backfilling

Backfilling means running your DAG for past dates. Maybe you just created the DAG and want to process historical data, or you fixed a bug and need to reprocess last week.

When to Backfill

1
New DAG

You built a pipeline and want to backfill data from Jan 1 to today.

2
Bug Fix

A task had a bug; you fixed it and want to reprocess affected dates.

3
Data Recovery

Source data was corrupted and restored; you need to re-run the pipeline.

Use the CLI: airflow dags backfill -s START_DATE -e END_DATE dag_id

start_date and end_date Behavior

start_date tells Airflow: "Don't schedule any runs before this date." The first run will be for the first schedule slot on or after start_date.

end_date (optional) tells Airflow: "Stop scheduling after this date." Useful for one-time or temporary DAGs.

start_date Gotcha

Don't use datetime.now() or dynamic dates for start_date! Airflow parses the DAG file once and caches it. Use a fixed date like datetime(2024, 1, 1).

catchup Parameter

catchup controls what happens when there's a gap between start_date and "now."

catchup=True (default)

Airflow will create DAG runs for every schedule slot between start_date and now. If you deploy a daily DAG with start_date 30 days ago, it will queue 30 runs!

catchup=False

Airflow only schedules from "now" onward. No past runs. Use this when you don't need historical backfill and want to avoid a flood of runs.

Recommendation

Most production DAGs use catchup=False. Do backfills explicitly with airflow dags backfill when needed, rather than catching up automatically.

Timetables (Airflow 2.x)

In Airflow 2.2+, timetables offer a more powerful, programmatic way to define schedules. Instead of cron, you write a Python class that returns when the next run should be.

Use timetables when you need:

  • Custom logic (e.g., "run on the last business day of each month")
  • Schedules that depend on external factors

For most use cases, presets and cron expressions are enough. Timetables are for advanced scenarios.

Execution Dates vs Actual Run Time

See how a daily DAG's logical_date differs from when it actually runs:

Mon 00:00 Tue 00:00 DAG Run logical_date = Mon 00:00 Runs: Tue 06:00 Tue 06:00 data_interval: Mon 00:00 → Tue 00:00 | Actual run: Tue 06:00

For Monday's data: logical_date is Mon 00:00, but the run happens Tuesday morning

Cron Expression Visual Decoder

The 5 fields of a cron expression with examples:

minute hour day month dow 0 9 * * MON = Every Monday at 9:00 AM 0-59 0-23 1-31 1-12 0-7

0 9 * * MON decoded: minute=0, hour=9, every day, every month, Mondays only

Backfilling Timeline

When you backfill, Airflow queues multiple runs in order:

Jan 1
Jan 2
Jan 3
...
Today

airflow dags backfill -s 2024-01-01 -e 2024-03-15 my_dag

data_interval_start and data_interval_end

For a daily DAG processing Monday's data:

data_interval_start Mon 2024-01-08 00:00 period data_interval_end Tue 2024-01-09 00:00

Monday's data = events from Mon 00:00 (inclusive) to Tue 00:00 (exclusive)

catchup=True vs catchup=False

catchup=True

Deploy DAG with start_date 30 days ago → 30 runs queue immediately (Jan 1–30)

Run 1 Run 2 ... Run 30

catchup=False

Deploy same DAG → only 1 run queues (today). No past runs.

Run (today only)

DAG with @daily Schedule

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id="daily_user_events",
    start_date=datetime(2024, 1, 1),
    schedule="@daily",
    catchup=False,
) as dag:

    process_events = BashOperator(
        task_id="process_events",
        bash_command='echo "Processing {{ ds }}"',
    )

DAG with Cron (Every Monday 9am)

with DAG(
    dag_id="weekly_report",
    start_date=datetime(2024, 1, 1),
    schedule="0 9 * * MON",  # Every Monday at 9:00 AM
    catchup=False,
) as dag:
    # ... tasks ...

DAG with timedelta (Every 3 Hours)

from datetime import datetime, timedelta

with DAG(
    dag_id="hourly_sync",
    start_date=datetime(2024, 1, 1),
    schedule=timedelta(hours=3),
    catchup=False,
) as dag:
    # Runs every 3 hours

DAG with start_date and end_date

with DAG(
    dag_id="campaign_2024_q1",
    start_date=datetime(2024, 1, 1),
    end_date=datetime(2024, 3, 31),
    schedule="@daily",
) as dag:
    # Will stop scheduling after March 31, 2024

Execution Context in Tasks

Use Jinja templating to pass dates into your tasks:

# {{ ds }} = logical_date as YYYY-MM-DD string
# {{ data_interval_start }} = start of data period
# {{ data_interval_end }} = end of data period

process = BashOperator(
    task_id="process",
    bash_command='''
        echo "Processing data for {{ ds }}"
        echo "Interval: {{ data_interval_start }} to {{ data_interval_end }}"
    ''',
)

# In PythonOperator, access via context:
def my_task(**context):
    ds = context["ds"]
    interval_start = context["data_interval_start"]
    interval_end = context["data_interval_end"]
    return f"Processed {ds}"

Backfill Command

# Backfill from Jan 1 to Mar 15, 2024
airflow dags backfill -s 2024-01-01 -e 2024-03-15 daily_user_events

# Backfill from a single date (no -e)
airflow dags backfill -s 2024-01-01 daily_user_events

# Backfill and run tasks even if they already succeeded (re-run)
airflow dags backfill -s 2024-01-01 --reset-dagruns daily_user_events

Practice Exercises

Exercise 1: Write a Cron Expression

Write the cron expression for: "Every day at 6:30 AM"

Answer

30 6 * * * — minute=30, hour=6, every day/month/dow

Exercise 2: Predict the Execution Date

A daily DAG with schedule="@daily" runs at 6am on Wednesday, March 13, 2024. What is the logical_date (data_interval_start) for this run?

Answer

2024-03-12 00:00:00 (Tuesday midnight). The run on Wednesday 6am processes Tuesday's data. The data interval is Tuesday 00:00 → Wednesday 00:00.

Exercise 3: Another Cron

Write the cron expression for: "Every 15 minutes"

Answer

*/15 * * * * — step of 15 in the minute field, every hour, every day, etc.

Exercise 4: Incremental DAG Setup

You have a table user_events with a event_time timestamp. Write the WHERE clause for a daily incremental DAG so each run processes only that day's events. Use Airflow template variables.

Answer

WHERE event_time >= '{{ data_interval_start }}' AND event_time < '{{ data_interval_end }}'

Module 3 Quiz

1. What does scheduling mean in Airflow?

2. What cron expression means "every hour"?

3. A daily DAG runs on Tuesday at 6am. What data does it process?

4. What is logical_date (formerly execution_date)?

5. What does catchup=False do?

6. What command do you use to backfill a DAG?

7. Which template variable gives you the logical date as YYYY-MM-DD?

8. Incremental processing means: