Executors, monitoring, logging, and scaling!
The executor is the component that decides where and how your tasks run. Choosing the right executor is critical for production: it affects parallelism, reliability, and cost.
Imagine a restaurant kitchen. The executor is the head chef who decides who does what:
Sequential = One chef does every dish, one at a time. Slow but simple.
Local = Several chefs in the same kitchen, each doing one dish at a time.
Celery = Many kitchens (workers) in different places, all taking orders from one central board.
Kubernetes = The kitchen can grow or shrink automatically: more chefs when it's busy, fewer when it's quiet.
| Executor | Use Case | Parallelism | Production? |
|---|---|---|---|
| SequentialExecutor | Learning, local dev | 1 task at a time | No |
| LocalExecutor | Single machine, small teams | Multiple tasks (multiprocessing) | Yes (small scale) |
| CeleryExecutor | Distributed workers, high throughput | Many workers, horizontal scale | Yes |
| KubernetesExecutor | Cloud-native, auto-scaling | Pods per task, scale to zero | Yes |
SequentialExecutor runs one task after another in the same process. No parallelism — good only for trying Airflow on your laptop.
LocalExecutor uses a database as a queue and runs tasks in subprocesses on the same machine. Simple to set up; limited by one box.
CeleryExecutor uses Celery workers and a message broker (e.g. Redis or RabbitMQ). You add more worker machines to run more tasks — classic production choice.
KubernetesExecutor creates a new Kubernetes Pod for each task and tears it down when done. Perfect for cloud, isolation, and scaling with cluster size.
Airflow's metadata database (metastore) stores DAG definitions, task states, run history, connections, variables, and more. In production it must be robust and backed up.
SQLite is only for development. Production needs PostgreSQL (preferred) or MySQL for concurrent scheduler and worker access.
Use PgBouncer or similar to avoid exhausting DB connections when you have many workers.
Schedule regular backups. Restoring from a backup is the main recovery path after a metastore failure.
The metastore is the single source of truth. If it's slow or down, the whole Airflow deployment is affected. Treat it like a critical production database.
Task logs tell you what happened during a run. In production you need to know where logs go and how to keep them when workers or pods disappear.
By default, logs are written to the local filesystem of the worker that ran the task (e.g. AIRFLOW_HOME/logs/dag_id/task_id/date/run.log). That's fine for a single machine; it's a problem when you have many workers or ephemeral pods.
Configure remote logging so logs are stored in a central, durable place:
In airflow.cfg you set the logging section to use a handler that writes to your chosen backend (e.g. remote_base_log_folder for S3/GCS). The UI then fetches logs from that location instead of the local disk.
Local logs are like sticky notes on each chef's station — if the chef leaves, the notes are gone. Remote logging is like writing everything in a shared notebook in the office: anyone can look up what happened, even after the kitchen is reorganized.
Production Airflow needs visibility into health and performance. The Airflow UI is a start; for serious ops you add metrics and dashboards.
Airflow can expose Prometheus metrics (task success/failure counts, DAG parse times, scheduler loop duration, etc.). Prometheus scrapes these; Grafana displays them in dashboards. You can track:
How far behind "now" the scheduler is. Growing lag means tasks are being delayed.
Number of tasks waiting for a worker. Consistently high queue = need more workers or faster tasks.
Failures per DAG or globally. Spikes often indicate upstream or config issues.
When something goes wrong, you want to know immediately. Airflow supports alerting on failure (and other events) so you're notified instead of discovering broken pipelines later.
You can configure:
At the DAG or task level you set on_failure_callback (and optionally on_success_callback) to call a function that sends the alert. You can also use SLA misses: define a time limit for a task or DAG; if it's exceeded, Airflow can trigger an SLA miss callback (e.g. email or webhook).
In production, pipelines run unattended. Without alerting, a single failing task or stuck scheduler can go unnoticed until users or downstream systems complain. Set up at least email or Slack on failure for critical DAGs.
Scaling means handling more DAGs and more tasks without everything slowing down. You scale workers and, in some setups, schedulers.
With CeleryExecutor or LocalExecutor, you run multiple worker processes (or machines). More workers = more tasks running at once. Tune the number based on queue depth and CPU/memory. Avoid running more workers than your queue typically needs — it wastes resources and can overload the metastore.
Airflow 2.x supports HA (high availability) schedulers: multiple scheduler processes with a distributed lock so only one schedules at a time. If one scheduler dies, another takes over. This improves availability; it doesn't make scheduling faster (only one scheduler is active).
With KubernetesExecutor, each task runs in its own Pod. Scaling is handled by the cluster: you add nodes or use cluster autoscaler. No need to manually add Airflow workers — the cluster grows when the queue grows.
More workers = more chefs in the kitchen. Multiple schedulers = having a backup manager so if one is sick, the other can still assign work. Kubernetes scaling = the restaurant can open extra pop-up kitchens when there's a rush and close them when it's quiet.
Small config changes can have a big impact on throughput and stability.
Start with defaults, then adjust based on metrics: if the queue is always full, increase parallelism or add workers; if the scheduler is CPU-bound, increase min_file_process_interval or reduce DAG complexity.
Change one thing at a time and watch metrics. Tuning is iterative — what works for 10 DAGs may not work for 500.
How each executor runs tasks: from one at a time to many machines.
Top: Executor types. Bottom: Typical monitoring flow (Airflow → Prometheus → Grafana → Alerts)
In production, Airflow feeds metrics into Prometheus; Grafana visualizes them; alerting notifies on thresholds.
Airflow (scheduler, workers) exposes /metrics → Prometheus scrapes and stores time-series → Grafana queries Prometheus and shows dashboards → Alertmanager (or Grafana alerts) sends to Slack, email, or PagerDuty when rules fire.
Logs can go in parallel: task logs → remote storage (S3/GCS) or to Elasticsearch for search. The Airflow UI reads logs from the configured backend.
Key sections for production: executor, parallelism, and logging.
# airflow.cfg — [core] # Use Celery for distributed workers (or LocalExecutor for single machine) executor = CeleryExecutor # Max task instances running across the whole deployment parallelism = 32 # Max concurrent task instances per DAG dag_concurrency = 16 # Max active DAG runs per DAG (prevents one DAG from filling the queue) max_active_runs_per_dag = 3
# airflow.cfg — [celery] broker_url = redis://redis:6379/0 result_backend = db+postgresql://airflow:airflow@postgres/airflow # Concurrency per worker (tasks per worker process) worker_concurrency = 4
# airflow.cfg — [logging]
remote_logging = True
remote_base_log_folder = s3://your-bucket/airflow-logs/
remote_log_conn_id = aws_default
encrypt_s3_logs = False
# Enable Prometheus metrics (Airflow 2.x) # In airflow.cfg [metrics] or via env: # Metrics are exposed at /metrics when using Prometheus Exporter or # with apache-airflow-providers-prometheus # Example: statsd_exporter or custom exporter can scrape: # - airflow_task_fail_total # - airflow_dag_processing_total_parse_time # - airflow_scheduler_heartbeat
Exact config keys can vary by Airflow version. Always check the official airflow.cfg template for your version. Use environment variables (e.g. AIRFLOW__CORE__EXECUTOR) in containers instead of editing the file directly when possible.
Try these to reinforce production operations. Use the Learn and Code tabs as reference.
Your team has 50 DAGs, about 500 tasks per day, and runs everything on a single beefy server. You want parallelism but don't want to introduce Redis or Kubernetes yet.
Question: Which executor would you choose and why?
You need more than one task at a time (so not Sequential). You're on one machine and don't want extra infrastructure (no Celery broker, no K8s).
LocalExecutor. It allows multiple tasks to run in parallel on the same machine using subprocesses and the metadata DB as the queue. No Redis or Celery workers needed. When you outgrow the box, you can move to CeleryExecutor.
You use KubernetesExecutor. A task fails; you open the Airflow UI and click "Log" for that task instance. Where does the log content come from?
With default config, logs are written inside the task's Pod. When the Pod is removed after the run, those logs are gone. So you must configure remote logging (e.g. S3 or GCS). The UI then fetches logs from that remote storage. Without remote logging, you may see "log not found" for completed or failed tasks whose Pods are already deleted.
Your company enables "HA Scheduler" with two scheduler processes. A colleague says: "Now we can run twice as many DAGs." Is that correct?
No. With HA schedulers, only one scheduler process is active at a time (enforced by a distributed lock). The second is a standby. The benefit is high availability: if the active scheduler crashes, the standby takes over quickly. It does not double scheduling throughput.
Your Grafana dashboard shows that the "scheduler" queue often has 200+ tasks waiting. You have 10 Celery workers with concurrency 4 (40 slots). What are two things you could do?
1) Add more workers (or increase worker_concurrency) so more tasks run at once. 2) Increase parallelism in airflow.cfg so the scheduler is allowed to queue more tasks for execution (if the cap is currently lower than 40, raising it lets workers use all slots). Also check that dag_concurrency and max_active_runs_per_dag aren’t limiting specific DAGs too tightly.
Test your understanding of production operations. Click the answer you think is correct.