MODULE 12 OF 15

Containers & Airflow

Run your Airflow tasks in Docker and Kubernetes for isolation, reproducibility, and scale.

Why Containers with Airflow?

Your Airflow workers usually run on the same machine (or a fixed set of machines). But sometimes you need a task to run in a completely isolated environment — with its own Python version, libraries, or even a different OS. That's where containers shine.

Explain Like I'm 5

Imagine Airflow is a teacher assigning homework. Usually kids do homework at their desk (same room = same worker). But sometimes the teacher says: "This assignment must be done in the science lab" or "in the computer room." Containers are like those special rooms — each has exactly the tools and environment needed for that one task, and when the task is done, the room is cleaned up. No mess left on the main desk!

Containers give you:

  • Isolation — Task A's dependencies don't clash with Task B's.
  • Reproducibility — "Runs the same everywhere" — dev, staging, prod.
  • Flexibility — Run a data science image here, a Node.js script there, all from one DAG.

Docker Basics (What You Need to Know)

Docker lets you package an app and its dependencies into a single unit called an image. You run that image as a container — a lightweight, isolated process. For Airflow, you only need a few ideas:

Docker Concepts in 3 Steps

1
Image

A read-only template (like a recipe). Example: python:3.9-slim or mycompany/etl-job:latest. You build or pull images; you don't run them directly.

2
Container

A running instance of an image. When Airflow runs a Docker task, it starts a container from an image, runs a command inside it, then stops the container. One task = one (or more) containers.

3
Docker daemon

The Docker engine that runs on the host. Airflow's worker must have access to Docker (or to a Kubernetes cluster) to run container-based operators.

Key Takeaway

For Airflow you don't need to be a Docker expert. You need: an image name, an optional command, and (for DockerOperator) a way for the worker to talk to Docker. That's enough to run tasks in containers.

DockerOperator

The DockerOperator runs a Docker container as an Airflow task. You specify which image to use and what command to run inside the container. Airflow starts the container, waits for it to finish, and marks the task success or failure based on the container's exit code.

  • Use when: You have Docker on the same host as your Airflow worker (or the worker can reach a Docker daemon).
  • Good for: Single-machine setups, dev/staging, or when you don't need Kubernetes.

Requirement

The Airflow worker must have the Docker Python library installed and permission to talk to the Docker daemon (e.g. the worker runs in a container that mounts the Docker socket, or Docker is installed on the worker host).

KubernetesPodOperator

The KubernetesPodOperator creates a Pod (one or more containers) in a Kubernetes cluster, runs your task inside it, and then tears the Pod down. Kubernetes schedules the Pod onto any available node; you don't manage servers yourself.

  • Use when: You're running Airflow on Kubernetes (e.g. Astronomer, Helm chart) or your workers can authenticate to a K8s cluster.
  • Good for: Scale, resource limits, multi-team isolation, and "run anywhere" portability.

Explain Like I'm 5

DockerOperator is like asking your neighbor's kitchen to bake one cake: you send the recipe (image + command) and they give you the result. KubernetesPodOperator is like a cloud kitchen: you send the same recipe to "the cloud kitchen" and they find any free kitchen (node), bake the cake, and then close that kitchen. You don't care which kitchen it was — you just get the cake.

When to Use Which?

ScenarioUseWhy
Local dev, single serverDockerOperatorSimple; no K8s cluster needed
Airflow on Kubernetes (e.g. Astronomer, Helm)KubernetesPodOperatorNative; Pods run in same cluster
Heavy or variable workloadsKubernetesPodOperatorK8s scales nodes and schedules Pods
Strict dependency isolation per taskEitherBoth run tasks in isolated containers
No Docker/K8s in your stackBashOperator / PythonOperatorContainers are optional

Rule of Thumb

If you're already on Kubernetes, prefer KubernetesPodOperator. If you're on VMs or a single server with Docker, use DockerOperator. Both give you isolation and reproducibility; K8s adds orchestration and scaling.

Docker vs Kubernetes Flow

How a single Airflow task flows when using DockerOperator vs KubernetesPodOperator:

DockerOperator flow

Airflow Scheduler
Worker
Docker daemon
Container (task)

Task runs in a container on the same host (or host reachable by worker)

KubernetesPodOperator flow

Airflow Scheduler
Worker (K8s client)
Kubernetes API
Pod on any node

Task runs in a Pod; K8s picks the node and cleans up when done

Docker path (one host) Worker Docker Container Kubernetes path (cluster) Worker K8s API Pod (any node)

Docker: worker talks to Docker on one host. Kubernetes: worker talks to API; cluster runs Pod anywhere.

DockerOperator Example

Run a Python script inside a container using the official Python image:

from airflow import DAG
from airflow.providers.docker.operators.docker import DockerOperator
from datetime import datetime

with DAG(
    dag_id="docker_example_dag",
    start_date=datetime(2024, 1, 1),
    schedule="@daily",
    catchup=False,
) as dag:

    run_in_docker = DockerOperator(
        task_id="run_in_docker",
        image="python:3.9-slim",
        command="python -c \"print('Hello from inside Docker!')\"",
        auto_remove=True,
        docker_url="unix:///var/run/docker.sock",
    )

Parameters: image = which image to run. command = what to run inside the container. auto_remove=True = delete the container after it finishes. docker_url = how the worker reaches Docker (default is local socket).

KubernetesPodOperator Example

Run the same logic as a Pod in Kubernetes:

from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from datetime import datetime

with DAG(
    dag_id="k8s_example_dag",
    start_date=datetime(2024, 1, 1),
    schedule="@daily",
    catchup=False,
) as dag:

    run_in_k8s = KubernetesPodOperator(
        task_id="run_in_k8s",
        namespace="default",
        image="python:3.9-slim",
        cmds=["python", "-c"],
        arguments=["print('Hello from inside Kubernetes!')"],
        name="airflow-k8s-task",
        get_logs=True,
        is_delete_operator_pod=True,
    )

Parameters: namespace = where to create the Pod. image = container image. cmds + arguments = command and args for the container. get_logs=True = stream logs to Airflow. is_delete_operator_pod=True = delete the Pod when the task finishes.

Note

Airflow must run with a Kubernetes connection configured (e.g. in-cluster config or kubeconfig) so the worker can create Pods. Install the provider: pip install apache-airflow-providers-cncf-kubernetes.

Practice Exercises

Try these to solidify containers and Airflow. Answers use the same styling as the rest of the course.

Exercise 1: Why Containers?

Scenario

Your DAG has Task A (needs Python 3.8 + pandas 1.2) and Task B (needs Python 3.11 + pandas 2.0). Both run on the same Airflow worker. What problem can this cause, and how do containers help?

Explain Like I'm 5

Two kids need different toys in the same room: one needs Legos, the other needs Play-Doh. If you only have one table, they fight. Containers are like giving each kid their own room with exactly their toys — no fighting!

Answer

Problem: Conflicting dependencies on one worker (different Python or library versions) can cause import errors or wrong behavior.
Containers help: Run Task A in a container with Python 3.8 + pandas 1.2, and Task B in another container with Python 3.11 + pandas 2.0. Each task gets an isolated environment.

Exercise 2: Docker vs K8s Choice

Scenario

Your company runs Airflow on a single VM with Docker installed. You need one task to run a custom ETL image. Should you use DockerOperator or KubernetesPodOperator? Why?

Answer

Use DockerOperator. You have Docker on the VM and no Kubernetes cluster. DockerOperator talks directly to the Docker daemon on that host. KubernetesPodOperator would require a K8s cluster and extra configuration with no benefit in this setup.

Exercise 3: Write a DockerOperator Task

Challenge

Write a single task that uses DockerOperator to run the image alpine:latest with the command echo "Containers are cool". Task id: alpine_echo.

# Solution:
from airflow.providers.docker.operators.docker import DockerOperator

alpine_echo = DockerOperator(
    task_id="alpine_echo",
    image="alpine:latest",
    command='echo "Containers are cool"',
    auto_remove=True,
)

Exercise 4: When to Use KubernetesPodOperator

Scenario

List two situations where KubernetesPodOperator is a better fit than DockerOperator.

Explain Like I'm 5

DockerOperator = one kitchen in your house. KubernetesPodOperator = a cloud kitchen service that has many kitchens and picks one for you. Use the cloud when you have lots of orders or many different "recipes" and don't want to manage the kitchens yourself.

Answer

1. Airflow is already running on Kubernetes (e.g. Helm, Astronomer) — Pods run natively in the same cluster.
2. You need automatic scaling or resource limits (CPU/memory) per task, or many concurrent tasks that shouldn't overload a single Docker host. Kubernetes schedules Pods across nodes and enforces limits.

Module 12 Quiz

Test your understanding! Click the answer you think is correct.

1. What is the main benefit of running an Airflow task in a container?

2. DockerOperator runs the task by ___?

3. KubernetesPodOperator runs the task by ___?

4. When is DockerOperator a better choice than KubernetesPodOperator?

5. In Docker, an "image" is best described as ___?

6. What must the Airflow worker have to use DockerOperator?

7. KubernetesPodOperator is a better fit when ___?

8. What does "is_delete_operator_pod=True" do in KubernetesPodOperator?