Data Pipelines Made Simple

Distributed Computing for Pipelines

Ever wondered how data travels from your PostgreSQL database to Redshift dashboards? It’s not magic — it’s distributed computing. We’ll explain it like you’re five.

PostgreSQL

→

Airflow

→

DBT

→

Redshift

→

Metabase

Design Your Own Pipeline

What is Distributed Computing?

Think of a big kitchen with many chefs — each works on their own dish, but together they create a feast. That's distributed computing.

The Kitchen Analogy

📖 Classic Definition (from the textbook)

"You know you are using [a distributed system] when the crash of a computer you have never heard of prevents you from doing work." — Leslie Lamport

👨‍🍳

One chef (single machine): Makes one dish at a time. If the oven breaks, everything stops.

Many chefs (distributed): Chef A makes salad, Chef B grills meat, Chef C bakes dessert — all at once. If Chef B's grill fails, the salad and dessert still get done. The feast continues!

In data pipelines: PostgreSQL holds raw data, Airflow coordinates who does what, DBT transforms on one or many workers, S3 stores intermediate data, Redshift runs queries across many nodes. Each "chef" is a separate computer or service — they don't share a single kitchen (memory). They pass "dishes" (data) via the network.

Three Rules of Distributed Systems

No shared memory: Each machine has its own RAM and disk. Data moves by sending messages (files, API calls, streams) over the network. That's why we use S3, Kafka, or databases — they're the "messengers."
No single clock: Machines can't perfectly sync time. We use "logical" ordering (who ran first?) instead of "what time is it?" — Airflow's task IDs and timestamps help with this.
Geographic separation: Your source DB might be in Mumbai, S3 in us-east-1, Redshift somewhere else. Data travels across cables and clouds. That's distribution!

🌊

Data Flows Like a River

Watch the animation below — each droplet is a batch of data moving through the pipeline.

Each droplet = a batch of records flowing through the distributed pipeline

Why Pipelines Need Distributed Computing

Data doesn't fit on one machine. And even if it did, you'd want speed and resilience.

Data is Too Big for One Box

Layman Example

Imagine a library with 10 million books. One librarian can't read and sort them all. So you have:

Librarian A: Reads from Shelf 1–100
Librarian B: Reads from Shelf 101–200
Librarian C: Writes the sorted catalog

Same idea in pipelines: Redshift and Spark split data across nodes. Each node processes a chunk. Results are combined. That's distributed processing.

Distributed = resilience + speed

Scalability & Modularity (From the Textbook)

The PDF lists key benefits of distributed systems. Here's how they show up in pipelines:

Scalability: Add more Airflow workers or Redshift nodes — no single bottleneck. Data volume grows? Scale out.
Modularity: Swap PostgreSQL for MySQL, or Redshift for Snowflake. As long as they speak the same "language" (SQL, JDBC), the rest of the pipeline keeps working.
Resource sharing: One S3 bucket or Redshift cluster serves many pipelines. You don't replicate everything everywhere.
Inherently distributed: Money transfers, CDC from multiple regions — the work is naturally spread across locations.

⏱️

Logical Time & Ordering

Machines don't share a clock. So how do we know "what happened before what"? Enter logical time.

The "Happens-Before" Idea

📬

If you send a letter, it must have been written before it arrived. We don't need wall clocks — we use causal order: "A happened before B" means A could have influenced B.

In pipelines: Airflow uses execution_date and task instance IDs to order runs. Kafka uses offsets within a partition — message 5 comes before message 6. DBT uses model dependencies (staging runs before marts). We care about order, not "what time is it on server X."

📸

Global Snapshots & Consistent Reads

How do you get a "picture" of the whole system at one moment? The Chandy–Lamport snapshot algorithm is the classic answer.

What's a Consistent Snapshot?

A snapshot is consistent if it looks like everything froze at one instant — no "message half in flight" or "partially applied" state. In pipelines:

Bronze layer: Raw data landed from the source at a specific time. It's a point-in-time "photo" of the source.
CDC (Change Data Capture): Reads the source's transaction log at a specific LSN (Log Sequence Number) — that's a consistent snapshot.
Redshift / Snowflake: Time-travel queries let you "go back" to a consistent state.

🔑 From the PDF

The Chandy–Lamport algorithm uses special "marker" messages to record the state of each process and the messages in flight. Our pipelines achieve similar guarantees via transaction boundaries, LSNs, and immutable Bronze loads.

📋

Message Ordering

FIFO, causal, total order — and why Kafka partitions matter.

Ordering Guarantees in Pipelines

The PDF defines several message ordering paradigms. Here's the pipeline version:

Order	Meaning	Pipeline Example
FIFO	Same sender → same order	Kafka: messages in one partition are FIFO
Causal	If A caused B, B sees A first	Event ordering: click → add_to_cart → purchase
Total	Everyone agrees on order	Kafka partition key ensures order per entity

Layman Example

Imagine a queue at a coffee shop. FIFO = you get served in the order you arrived. Causal = your friend's order (which you inspired) comes after yours. Total = everyone agrees "person 3 was served before person 4." Kafka partitions give you FIFO per key (e.g. per customer_id).

Message-Passing vs Shared Memory

The PDF compares two paradigms. Our pipelines use message-passing:

Shared memory: All processes read/write the same RAM. (Think: one big whiteboard everyone can see.)
Message-passing: Data is sent as messages — files to S3, rows to a DB, events to Kafka. (Think: passing notes.)

We use message-passing because our nodes are geographically separate. S3 PUT, Kafka produce, PostgreSQL INSERT — these are all "send" operations. The receiver (next stage) does a "receive" (S3 GET, Kafka consume, SELECT). Blocking vs non-blocking? Airflow tasks are blocking: a task doesn't finish until its work is done. Kafka consumers can be non-blocking (poll, process async).

Granularity — Coarse vs Fine

The PDF defines granularity as the ratio of computation to communication. For pipelines:

Coarse-grained: Each task does a lot of work and talks to others rarely. Example: "Extract 1M rows from Postgres, load to S3" — one big batch, minimal back-and-forth.
Fine-grained: Lots of small messages, frequent sync. Example: Real-time stream processing where every event triggers a micro-update.

Loosely coupled systems (our pipelines: Postgres in Mumbai, S3 in us-east-1) work best with coarse-grained tasks. Fine-grained would drown in network latency. That's why we batch: daily DBT runs, hourly Airflow DAGs, not per-row RPCs.

Your Tools in Action

Every tool in our pipeline-designer is part of the distributed system. Click to see each one's role!

🌀

Airflow

Orchestrator — tells everyone what to do and when

🔄

DBT

Transform — cleans and shapes data (runs on warehouse)

🪣

Storage — holds data between stages (distributed by design)

📡

Kafka

Streaming — moves events in real-time across many consumers

🏗️

Redshift

Warehouse — queries run in parallel across nodes

🐘

PostgreSQL

Source — where raw data lives (can be one or many)

📊

Metabase

Analytics — reads from warehouse (distributed queries)

How Data Flows — Step by Step

Click the "Play" button to animate the flow. Watch data move through the pipeline!

Interactive Flow — Click "Start" to animate

PostgreSQL

→

Airflow

→

DBT

→

S3 / Bronze

→

Silver

→

Gold / Redshift

→

Metabase

Each step runs on different machines or services. Data is passed via network calls, file uploads, or SQL queries.

When Things Break — Fault Tolerance

In distributed systems, something will eventually fail. The good news: the pipeline keeps going.

What Happens When a Node Fails?

🔑 Key Idea

Airflow retries failed tasks. DBT can re-run from the last checkpoint. S3 doesn't lose data. Redshift has multiple copies. Distributed = built-in redundancy.

Click to simulate a failure and see how the system responds:

✅ Recovery: Airflow detected the failure, retried the DBT task, and it succeeded on attempt 2. Data continued flowing to Redshift. No manual intervention needed!

Termination Detection — When is the DAG "Done"?

The PDF has a whole chapter on termination detection: how do we know all distributed processes have finished? In pipelines:

Airflow: A DAG run is "success" only when all downstream tasks have completed. The scheduler tracks task states (queued, running, success, failed). No orphan tasks = terminated.
DBT: Runs models in dependency order. Exit code 0 means all models ran successfully — that's termination.
Kafka consumers: "Lag" = messages not yet processed. Lag = 0 means the consumer has caught up (termination for that run).

Checkpointing & Rollback Recovery

The PDF covers checkpoint-based and log-based recovery. In pipelines:

Checkpointing: DBT incremental models use unique_key + "max(updated_at)" to only process new rows. The checkpoint is "we've processed up to this point."
Rollback: Airflow retries a failed task from scratch. DBT can re-run from a specific model. Redshift/Snowflake support time-travel: restore to a prior state.
Idempotency: Running the same pipeline twice should produce the same result. That's the pipeline version of "recoverable" — no duplicate side effects.

🔑 From the PDF

Koo–Toueg coordinated checkpointing ensures a consistent global state before saving. Our pipelines achieve idempotency via unique keys, upserts, and "replace" semantics so re-runs don't corrupt data.

Quick Quiz

Test what you've learned. Pick the best answer!

In a distributed pipeline, why don't we use a "single shared memory"?

Machines are in different locations and can't share RAM — data must move over the network.

We could, but it would be too fast and overwhelm users.

Shared memory is illegal in data engineering.

What is Airflow's main job in a distributed pipeline?

To store the raw data in S3.

To orchestrate — decide when each task runs and in what order.

To run SQL transformations itself.

Why does Kafka offer FIFO order only within a partition?

Because total order across all partitions would be too slow.

Because partitioning allows parallel consumption — each partition is a separate ordered stream. Total order would require a single bottleneck.

Kafka doesn't offer FIFO; it's random order.

Distributed Computing for Pipelines

What is Distributed Computing?

The Kitchen Analogy

Three Rules of Distributed Systems

Data Flows Like a River

Why Pipelines Need Distributed Computing

Data is Too Big for One Box

Scalability & Modularity (From the Textbook)

Logical Time & Ordering

The "Happens-Before" Idea

Global Snapshots & Consistent Reads

What's a Consistent Snapshot?

Message Ordering

Ordering Guarantees in Pipelines

Message-Passing vs Shared Memory

Granularity — Coarse vs Fine

Your Tools in Action

How Data Flows — Step by Step

Interactive Flow — Click "Start" to animate

When Things Break — Fault Tolerance

What Happens When a Node Fails?

Termination Detection — When is the DAG "Done"?

Checkpointing & Rollback Recovery

Quick Quiz

In a distributed pipeline, why don't we use a "single shared memory"?

What is Airflow's main job in a distributed pipeline?

Why does Kafka offer FIFO order only within a partition?

Design Your Own Distributed Pipeline