Module 14: Production Operations | FKTI Airflow Course

Executor Types

The executor is the component that decides where and how your tasks run. Choosing the right executor is critical for production: it affects parallelism, reliability, and cost.

Explain Like I'm 5

Imagine a restaurant kitchen. The executor is the head chef who decides who does what:

Sequential = One chef does every dish, one at a time. Slow but simple.

Local = Several chefs in the same kitchen, each doing one dish at a time.

Celery = Many kitchens (workers) in different places, all taking orders from one central board.

Kubernetes = The kitchen can grow or shrink automatically: more chefs when it's busy, fewer when it's quiet.

Executor Comparison

Executor	Use Case	Parallelism	Production?
SequentialExecutor	Learning, local dev	1 task at a time	No
LocalExecutor	Single machine, small teams	Multiple tasks (multiprocessing)	Yes (small scale)
CeleryExecutor	Distributed workers, high throughput	Many workers, horizontal scale	Yes
KubernetesExecutor	Cloud-native, auto-scaling	Pods per task, scale to zero	Yes

SequentialExecutor runs one task after another in the same process. No parallelism — good only for trying Airflow on your laptop.

LocalExecutor uses a database as a queue and runs tasks in subprocesses on the same machine. Simple to set up; limited by one box.

CeleryExecutor uses Celery workers and a message broker (e.g. Redis or RabbitMQ). You add more worker machines to run more tasks — classic production choice.

KubernetesExecutor creates a new Kubernetes Pod for each task and tears it down when done. Perfect for cloud, isolation, and scaling with cluster size.

Metastore Setup

Airflow's metadata database (metastore) stores DAG definitions, task states, run history, connections, variables, and more. In production it must be robust and backed up.

Recommendations

Metastore Best Practices

Use PostgreSQL or MySQL

SQLite is only for development. Production needs PostgreSQL (preferred) or MySQL for concurrent scheduler and worker access.

Connection pooling

Use PgBouncer or similar to avoid exhausting DB connections when you have many workers.

Backups

Schedule regular backups. Restoring from a backup is the main recovery path after a metastore failure.

Key Takeaway

The metastore is the single source of truth. If it's slow or down, the whole Airflow deployment is affected. Treat it like a critical production database.

Logging

Task logs tell you what happened during a run. In production you need to know where logs go and how to keep them when workers or pods disappear.

Where Logs Go

By default, logs are written to the local filesystem of the worker that ran the task (e.g. AIRFLOW_HOME/logs/dag_id/task_id/date/run.log). That's fine for a single machine; it's a problem when you have many workers or ephemeral pods.

Remote Logging

Configure remote logging so logs are stored in a central, durable place:

Cloud Storage: S3, GCS, or Azure Blob — same place you might store data; logs survive worker restarts.
Elasticsearch (or similar): For search and dashboards over log content.

In airflow.cfg you set the logging section to use a handler that writes to your chosen backend (e.g. remote_base_log_folder for S3/GCS). The UI then fetches logs from that location instead of the local disk.

Explain Like I'm 5

Local logs are like sticky notes on each chef's station — if the chef leaves, the notes are gone. Remote logging is like writing everything in a shared notebook in the office: anyone can look up what happened, even after the kitchen is reorganized.

Monitoring

Production Airflow needs visibility into health and performance. The Airflow UI is a start; for serious ops you add metrics and dashboards.

Prometheus + Grafana

Airflow can expose Prometheus metrics (task success/failure counts, DAG parse times, scheduler loop duration, etc.). Prometheus scrapes these; Grafana displays them in dashboards. You can track:

Number of running vs queued tasks
Task duration and failure rates per DAG or task
Scheduler and worker health

What to Monitor

Key Metrics

Scheduler lag

How far behind "now" the scheduler is. Growing lag means tasks are being delayed.

Queue depth

Number of tasks waiting for a worker. Consistently high queue = need more workers or faster tasks.

Task failure rate

Failures per DAG or globally. Spikes often indicate upstream or config issues.

Alerting

When something goes wrong, you want to know immediately. Airflow supports alerting on failure (and other events) so you're notified instead of discovering broken pipelines later.

How Alerting Works

You can configure:

Email: Send emails on task or DAG failure (built-in).
Slack/Teams: Use notifiers or custom callbacks to post to Slack or Microsoft Teams.
PagerDuty / Opsgenie: For on-call escalation.

At the DAG or task level you set on_failure_callback (and optionally on_success_callback) to call a function that sends the alert. You can also use SLA misses: define a time limit for a task or DAG; if it's exceeded, Airflow can trigger an SLA miss callback (e.g. email or webhook).

Don't Skip Alerting

In production, pipelines run unattended. Without alerting, a single failing task or stuck scheduler can go unnoticed until users or downstream systems complain. Set up at least email or Slack on failure for critical DAGs.

Scaling

Scaling means handling more DAGs and more tasks without everything slowing down. You scale workers and, in some setups, schedulers.

Multiple Workers

With CeleryExecutor or LocalExecutor, you run multiple worker processes (or machines). More workers = more tasks running at once. Tune the number based on queue depth and CPU/memory. Avoid running more workers than your queue typically needs — it wastes resources and can overload the metastore.

Multiple Schedulers

Airflow 2.x supports HA (high availability) schedulers: multiple scheduler processes with a distributed lock so only one schedules at a time. If one scheduler dies, another takes over. This improves availability; it doesn't make scheduling faster (only one scheduler is active).

KubernetesExecutor Scaling

With KubernetesExecutor, each task runs in its own Pod. Scaling is handled by the cluster: you add nodes or use cluster autoscaler. No need to manually add Airflow workers — the cluster grows when the queue grows.

Explain Like I'm 5

More workers = more chefs in the kitchen. Multiple schedulers = having a backup manager so if one is sick, the other can still assign work. Kubernetes scaling = the restaurant can open extra pop-up kitchens when there's a rush and close them when it's quiet.

Performance Tuning

Small config changes can have a big impact on throughput and stability.

Common Knobs

parallelism — Max number of task instances that can run across the whole Airflow deployment.
dag_concurrency — Max concurrent task instances per DAG.
max_active_runs_per_dag — How many DAG runs (per DAG) can be active at once; prevents one DAG from flooding the queue.
scheduler_heartbeat_sec — How often the scheduler reports that it's alive; tune if you see lock issues.
min_file_process_interval — How often DAG files are re-parsed; increase to reduce scheduler load if you have many DAG files.

Start with defaults, then adjust based on metrics: if the queue is always full, increase parallelism or add workers; if the scheduler is CPU-bound, increase min_file_process_interval or reduce DAG complexity.

Pro Tip

Change one thing at a time and watch metrics. Tuning is iterative — what works for 10 DAGs may not work for 500.

Executor Comparison

How each executor runs tasks: from one at a time to many machines.

Top: Executor types. Bottom: Typical monitoring flow (Airflow → Prometheus → Grafana → Alerts)

Monitoring Stack Diagram

In production, Airflow feeds metrics into Prometheus; Grafana visualizes them; alerting notifies on thresholds.

Production Monitoring Flow

Airflow (scheduler, workers) exposes /metrics → Prometheus scrapes and stores time-series → Grafana queries Prometheus and shows dashboards → Alertmanager (or Grafana alerts) sends to Slack, email, or PagerDuty when rules fire.

Logs can go in parallel: task logs → remote storage (S3/GCS) or to Elasticsearch for search. The Airflow UI reads logs from the configured backend.

airflow.cfg Snippets

Key sections for production: executor, parallelism, and logging.

Executor and Parallelism

# airflow.cfg — [core]

# Use Celery for distributed workers (or LocalExecutor for single machine)
executor = CeleryExecutor

# Max task instances running across the whole deployment
parallelism = 32

# Max concurrent task instances per DAG
dag_concurrency = 16

# Max active DAG runs per DAG (prevents one DAG from filling the queue)
max_active_runs_per_dag = 3

Executor-Specific Config (Celery)

# airflow.cfg — [celery]

broker_url = redis://redis:6379/0
result_backend = db+postgresql://airflow:airflow@postgres/airflow

# Concurrency per worker (tasks per worker process)
worker_concurrency = 4

Remote Logging (e.g. S3)

# airflow.cfg — [logging]

remote_logging = True
remote_base_log_folder = s3://your-bucket/airflow-logs/
remote_log_conn_id = aws_default
encrypt_s3_logs = False

Prometheus Metrics

# Enable Prometheus metrics (Airflow 2.x)
# In airflow.cfg [metrics] or via env:

# Metrics are exposed at /metrics when using Prometheus Exporter or
# with apache-airflow-providers-prometheus

# Example: statsd_exporter or custom exporter can scrape:
# - airflow_task_fail_total
# - airflow_dag_processing_total_parse_time
# - airflow_scheduler_heartbeat

Note

Exact config keys can vary by Airflow version. Always check the official airflow.cfg template for your version. Use environment variables (e.g. AIRFLOW__CORE__EXECUTOR) in containers instead of editing the file directly when possible.

Practice Exercises

Try these to reinforce production operations. Use the Learn and Code tabs as reference.

Exercise 1: Choose the Executor

Explain Like I'm 5

Executor Comparison

Metastore Setup

Recommendations

Metastore Best Practices

Use PostgreSQL or MySQL

Connection pooling

Backups

Key Takeaway

Logging

Where Logs Go

Remote Logging

Explain Like I'm 5

Monitoring

Prometheus + Grafana

What to Monitor

Key Metrics

Scheduler lag

Queue depth

Task failure rate

Alerting

How Alerting Works

Don't Skip Alerting

Scaling

Multiple Workers

Multiple Schedulers

KubernetesExecutor Scaling

Explain Like I'm 5

Performance Tuning

Common Knobs

Pro Tip

Executor Comparison

Monitoring Stack Diagram

Production Monitoring Flow

airflow.cfg Snippets

Executor and Parallelism

Executor-Specific Config (Celery)

Remote Logging (e.g. S3)

Prometheus Metrics

Note

Practice Exercises

Exercise 1: Choose the Executor

Scenario

Hint

Answer

Exercise 2: Where Do Logs Go?

Scenario

Answer

Exercise 3: Why Multiple Schedulers?

Scenario

Answer

Exercise 4: Tuning the Queue

Scenario

Answer

Module 14 Quiz

1. Which executor runs only one task at a time?

2. In production, the Airflow metadata database should be ___?

3. Remote logging (e.g. to S3) is important when ___?

4. Prometheus in an Airflow stack is typically used for ___?

5. Alerting on task failure is important in production because ___?

6. Multiple HA schedulers mean ___?

7. The config option "parallelism" in airflow.cfg controls ___?

8. KubernetesExecutor is a good fit when ___?