Ever wondered how data travels from your PostgreSQL database to Redshift dashboards? Itβs not magic β itβs distributed computing. Weβll explain it like youβre five.
Think of a big kitchen with many chefs β each works on their own dish, but together they create a feast. That's distributed computing.
In data pipelines: PostgreSQL holds raw data, Airflow coordinates who does what, DBT transforms on one or many workers, S3 stores intermediate data, Redshift runs queries across many nodes. Each "chef" is a separate computer or service β they don't share a single kitchen (memory). They pass "dishes" (data) via the network.
Watch the animation below β each droplet is a batch of data moving through the pipeline.
Each droplet = a batch of records flowing through the distributed pipeline
Data doesn't fit on one machine. And even if it did, you'd want speed and resilience.
Imagine a library with 10 million books. One librarian can't read and sort them all. So you have:
Same idea in pipelines: Redshift and Spark split data across nodes. Each node processes a chunk. Results are combined. That's distributed processing.
Distributed = resilience + speed
The PDF lists key benefits of distributed systems. Here's how they show up in pipelines:
Machines don't share a clock. So how do we know "what happened before what"? Enter logical time.
In pipelines: Airflow uses execution_date and task instance IDs to order runs. Kafka uses offsets within a partition β message 5 comes before message 6. DBT uses model dependencies (staging runs before marts). We care about order, not "what time is it on server X."
How do you get a "picture" of the whole system at one moment? The ChandyβLamport snapshot algorithm is the classic answer.
A snapshot is consistent if it looks like everything froze at one instant β no "message half in flight" or "partially applied" state. In pipelines:
FIFO, causal, total order β and why Kafka partitions matter.
The PDF defines several message ordering paradigms. Here's the pipeline version:
| Order | Meaning | Pipeline Example |
|---|---|---|
| FIFO | Same sender β same order | Kafka: messages in one partition are FIFO |
| Causal | If A caused B, B sees A first | Event ordering: click β add_to_cart β purchase |
| Total | Everyone agrees on order | Kafka partition key ensures order per entity |
Imagine a queue at a coffee shop. FIFO = you get served in the order you arrived. Causal = your friend's order (which you inspired) comes after yours. Total = everyone agrees "person 3 was served before person 4." Kafka partitions give you FIFO per key (e.g. per customer_id).
The PDF compares two paradigms. Our pipelines use message-passing:
We use message-passing because our nodes are geographically separate. S3 PUT, Kafka produce, PostgreSQL INSERT β these are all "send" operations. The receiver (next stage) does a "receive" (S3 GET, Kafka consume, SELECT). Blocking vs non-blocking? Airflow tasks are blocking: a task doesn't finish until its work is done. Kafka consumers can be non-blocking (poll, process async).
The PDF defines granularity as the ratio of computation to communication. For pipelines:
Loosely coupled systems (our pipelines: Postgres in Mumbai, S3 in us-east-1) work best with coarse-grained tasks. Fine-grained would drown in network latency. That's why we batch: daily DBT runs, hourly Airflow DAGs, not per-row RPCs.
Every tool in our pipeline-designer is part of the distributed system. Click to see each one's role!
Click the "Play" button to animate the flow. Watch data move through the pipeline!
Each step runs on different machines or services. Data is passed via network calls, file uploads, or SQL queries.
In distributed systems, something will eventually fail. The good news: the pipeline keeps going.
Click to simulate a failure and see how the system responds:
The PDF has a whole chapter on termination detection: how do we know all distributed processes have finished? In pipelines:
The PDF covers checkpoint-based and log-based recovery. In pipelines:
unique_key + "max(updated_at)" to only process new rows. The checkpoint is "we've processed up to this point."Test what you've learned. Pick the best answer!