Run your Airflow tasks in Docker and Kubernetes for isolation, reproducibility, and scale.
Your Airflow workers usually run on the same machine (or a fixed set of machines). But sometimes you need a task to run in a completely isolated environment — with its own Python version, libraries, or even a different OS. That's where containers shine.
Imagine Airflow is a teacher assigning homework. Usually kids do homework at their desk (same room = same worker). But sometimes the teacher says: "This assignment must be done in the science lab" or "in the computer room." Containers are like those special rooms — each has exactly the tools and environment needed for that one task, and when the task is done, the room is cleaned up. No mess left on the main desk!
Containers give you:
Docker lets you package an app and its dependencies into a single unit called an image. You run that image as a container — a lightweight, isolated process. For Airflow, you only need a few ideas:
A read-only template (like a recipe). Example: python:3.9-slim or mycompany/etl-job:latest. You build or pull images; you don't run them directly.
A running instance of an image. When Airflow runs a Docker task, it starts a container from an image, runs a command inside it, then stops the container. One task = one (or more) containers.
The Docker engine that runs on the host. Airflow's worker must have access to Docker (or to a Kubernetes cluster) to run container-based operators.
For Airflow you don't need to be a Docker expert. You need: an image name, an optional command, and (for DockerOperator) a way for the worker to talk to Docker. That's enough to run tasks in containers.
The DockerOperator runs a Docker container as an Airflow task. You specify which image to use and what command to run inside the container. Airflow starts the container, waits for it to finish, and marks the task success or failure based on the container's exit code.
The Airflow worker must have the Docker Python library installed and permission to talk to the Docker daemon (e.g. the worker runs in a container that mounts the Docker socket, or Docker is installed on the worker host).
The KubernetesPodOperator creates a Pod (one or more containers) in a Kubernetes cluster, runs your task inside it, and then tears the Pod down. Kubernetes schedules the Pod onto any available node; you don't manage servers yourself.
DockerOperator is like asking your neighbor's kitchen to bake one cake: you send the recipe (image + command) and they give you the result. KubernetesPodOperator is like a cloud kitchen: you send the same recipe to "the cloud kitchen" and they find any free kitchen (node), bake the cake, and then close that kitchen. You don't care which kitchen it was — you just get the cake.
| Scenario | Use | Why |
|---|---|---|
| Local dev, single server | DockerOperator | Simple; no K8s cluster needed |
| Airflow on Kubernetes (e.g. Astronomer, Helm) | KubernetesPodOperator | Native; Pods run in same cluster |
| Heavy or variable workloads | KubernetesPodOperator | K8s scales nodes and schedules Pods |
| Strict dependency isolation per task | Either | Both run tasks in isolated containers |
| No Docker/K8s in your stack | BashOperator / PythonOperator | Containers are optional |
If you're already on Kubernetes, prefer KubernetesPodOperator. If you're on VMs or a single server with Docker, use DockerOperator. Both give you isolation and reproducibility; K8s adds orchestration and scaling.
How a single Airflow task flows when using DockerOperator vs KubernetesPodOperator:
DockerOperator flow
Task runs in a container on the same host (or host reachable by worker)
KubernetesPodOperator flow
Task runs in a Pod; K8s picks the node and cleans up when done
Docker: worker talks to Docker on one host. Kubernetes: worker talks to API; cluster runs Pod anywhere.
Run a Python script inside a container using the official Python image:
from airflow import DAG from airflow.providers.docker.operators.docker import DockerOperator from datetime import datetime with DAG( dag_id="docker_example_dag", start_date=datetime(2024, 1, 1), schedule="@daily", catchup=False, ) as dag: run_in_docker = DockerOperator( task_id="run_in_docker", image="python:3.9-slim", command="python -c \"print('Hello from inside Docker!')\"", auto_remove=True, docker_url="unix:///var/run/docker.sock", )
Parameters: image = which image to run. command = what to run inside the container. auto_remove=True = delete the container after it finishes. docker_url = how the worker reaches Docker (default is local socket).
Run the same logic as a Pod in Kubernetes:
from airflow import DAG from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator from datetime import datetime with DAG( dag_id="k8s_example_dag", start_date=datetime(2024, 1, 1), schedule="@daily", catchup=False, ) as dag: run_in_k8s = KubernetesPodOperator( task_id="run_in_k8s", namespace="default", image="python:3.9-slim", cmds=["python", "-c"], arguments=["print('Hello from inside Kubernetes!')"], name="airflow-k8s-task", get_logs=True, is_delete_operator_pod=True, )
Parameters: namespace = where to create the Pod. image = container image. cmds + arguments = command and args for the container. get_logs=True = stream logs to Airflow. is_delete_operator_pod=True = delete the Pod when the task finishes.
Airflow must run with a Kubernetes connection configured (e.g. in-cluster config or kubeconfig) so the worker can create Pods. Install the provider: pip install apache-airflow-providers-cncf-kubernetes.
Try these to solidify containers and Airflow. Answers use the same styling as the rest of the course.
Your DAG has Task A (needs Python 3.8 + pandas 1.2) and Task B (needs Python 3.11 + pandas 2.0). Both run on the same Airflow worker. What problem can this cause, and how do containers help?
Two kids need different toys in the same room: one needs Legos, the other needs Play-Doh. If you only have one table, they fight. Containers are like giving each kid their own room with exactly their toys — no fighting!
Problem: Conflicting dependencies on one worker (different Python or library versions) can cause import errors or wrong behavior.
Containers help: Run Task A in a container with Python 3.8 + pandas 1.2, and Task B in another container with Python 3.11 + pandas 2.0. Each task gets an isolated environment.
Your company runs Airflow on a single VM with Docker installed. You need one task to run a custom ETL image. Should you use DockerOperator or KubernetesPodOperator? Why?
Use DockerOperator. You have Docker on the VM and no Kubernetes cluster. DockerOperator talks directly to the Docker daemon on that host. KubernetesPodOperator would require a K8s cluster and extra configuration with no benefit in this setup.
Write a single task that uses DockerOperator to run the image alpine:latest with the command echo "Containers are cool". Task id: alpine_echo.
# Solution: from airflow.providers.docker.operators.docker import DockerOperator alpine_echo = DockerOperator( task_id="alpine_echo", image="alpine:latest", command='echo "Containers are cool"', auto_remove=True, )
List two situations where KubernetesPodOperator is a better fit than DockerOperator.
DockerOperator = one kitchen in your house. KubernetesPodOperator = a cloud kitchen service that has many kitchens and picks one for you. Use the cloud when you have lots of orders or many different "recipes" and don't want to manage the kitchens yourself.
1. Airflow is already running on Kubernetes (e.g. Helm, Astronomer) — Pods run natively in the same cluster.
2. You need automatic scaling or resource limits (CPU/memory) per task, or many concurrent tasks that shouldn't overload a single Docker host. Kubernetes schedules Pods across nodes and enforces limits.
Test your understanding! Click the answer you think is correct.