Module 4 — Core Mechanism

The Attention Mechanism

The single most important idea behind modern LLMs. Learn how models decide what to focus on — with analogies, math, and code.

Why Attention?

Before Attention was invented, sequence models (RNNs) had a serious flaw: they forgot things. The longer the sentence, the worse they performed.

ELI5

Imagine reading a whole book, but you can only remember the last sentence you read. That's an RNN. Now imagine you can flip back to any page whenever you need to — that's Attention!

The Keyhole Analogy

An RNN looks at a whole room through a tiny keyhole — it only sees what's directly in front. Attention gives you a wide-open door: you can see the entire room at once and choose where to look.

RNN vs Attention — How Information Flows

RNN (Serial Bottleneck) The cat sat down 100% 70% 30% 10% Memory fades over time → Attention (Direct Access) The cat sat down Focus Every word accessible at once!

Key Takeaway

  • RNNs process words one-by-one — early words get "washed out"
  • Attention lets the model look at all words simultaneously
  • This is why Transformers (which use Attention) dominate NLP

Query, Key, Value

Attention works through three vectors for every word: a Query, a Key, and a Value. Together they answer: "How much should I pay attention to each other word?"

ELI5

You walk into a library with a question (Query). Every book has a title (Key). You compare your question to each title — the better the match, the more you read that book's contents (Value).

Library Analogy — Step by Step

Query = "I want to learn about cats" (your question)

Keys = ["Animal Behavior", "Quantum Physics", "Cat Care Guide"] (book titles)

Values = [actual content of each book]

Your query matches "Cat Care Guide" best → you read mostly that book's content.

How Q, K, V Work Together

Query (Q) "What am I looking for?" Key 1 Key 2 Key 3 Scores 0.1 0.7 0.2 softmax → Value 1 Value 2 Value 3 WeightedSum Q matches K → scores → weight V → output

Interactive: Click a word to see its attention

Click any word in the sentence to see which other words it attends to most.

The cat sat on the mat

Scaled Dot-Product Attention

Now let's see the actual math. The formula is surprisingly elegant:

Attention(Q, K, V) = softmax( Q · KT / √dk ) × V

ELI5

Multiply Q and K to get "how similar are these?" scores. Divide by √dk so numbers don't get too big. Run softmax to turn scores into percentages. Multiply by V to get the final answer.

Temperature Analogy

The √dk division is like adjusting a thermostat. Without it, dot products get very large in high dimensions, making softmax "too confident" — it would pick one word and ignore everything else. Scaling keeps the temperature comfortable.

Numerical Example (3×3)

Step 1 — Compute Q · KT
Q = [[1,0,1],[0,1,1],[1,1,0]], K = [[1,1,0],[0,1,1],[1,0,1]]
Scores = [[2,1,2],[1,2,1],[1,2,1]]
Step 2 — Scale by √dk (dk = 3, √3 ≈ 1.73)
Scaled = [[1.15, 0.58, 1.15],[0.58, 1.15, 0.58],[0.58, 1.15, 0.58]]
Step 3 — Softmax (row-wise)
Row 1: [0.39, 0.22, 0.39]   Row 2: [0.22, 0.39, 0.22] (sums ~1.0 ✓)
Step 4 — Multiply by V
Output = softmax_weights × V → weighted combination of value vectors

Matrix Multiplication as Colored Grids

Q 1 0 1 0 1 1 1 1 0 × KT 1 0 1 1 1 0 0 1 1 = Scores 2 1 2 1 2 1 1 2 1 ÷ √3 → softmax → × V Higher scores (darker) = stronger attention

Interactive: Why √dk Matters

See how scaling affects the softmax distribution. Raw scores: [2, 1, 2]

3

Multi-Head Attention

One attention head finds one type of pattern. But language has many patterns simultaneously — syntax, semantics, coreference. The solution? Run multiple attention heads in parallel.

ELI5

Instead of sending one detective to investigate a crime scene, you send 8 detectives — each looking for something different. One checks fingerprints, another interviews witnesses, another studies the floor. Then they all share their findings.

8 Detectives Analogy

Head 1 might learn "which words are grammatically related." Head 2 might learn "which words refer to the same entity." Head 3 might learn "which words are nearby." Each head has its own Q, K, V projections — its own way of asking questions.

Multi-Head: Parallel Attention Heads → Merge

Input Head 1 (syntax) Head 2 (coref) Head 3 (local) Head 4 (sem.) ... Head 8 Concatall heads LinearWO Output

PyTorch Implementation

Python
import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.n_heads, self.d_k = n_heads, d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V):
        B, L, _ = Q.shape
        # Project & reshape to (B, heads, L, d_k)
        q = self.W_q(Q).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        k = self.W_k(K).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        v = self.W_v(V).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        # Scaled dot-product attention
        scores = (q @ k.transpose(-2,-1)) / self.d_k**0.5
        attn = torch.softmax(scores, dim=-1)
        out = (attn @ v).transpose(1,2).contiguous().view(B, L, -1)
        return self.W_o(out)

Self-Attention vs Cross-Attention

ELI5

Self-attention = talking to yourself, "which of my own words relate to each other?"
Cross-attention = asking someone else, "which of YOUR words help ME understand?"

Conversation Analogy

Self-attention is like re-reading your own essay to find connections between paragraphs. Cross-attention is like reading someone else's notes while writing your essay — Q comes from you, but K and V come from them.

Self-Attention

Q, K, V all come from the same sequence

Used in: encoder, decoder (masked)

Cross-Attention

Q from one sequence, K & V from another

Used in: decoder attending to encoder

When Is Each Used?

  • GPT (decoder-only): uses masked self-attention — each token can only see tokens before it
  • BERT (encoder-only): uses bidirectional self-attention — sees all tokens
  • T5 / translation models: uses both — self-attention in encoder & decoder, plus cross-attention between them

Complete Code

Here's a full working implementation of scaled dot-product attention with a test you can run.

ELI5

We're building the attention mechanism from scratch — first the basic function, then wrapping it in a class, then running it on fake data to see it actually work.

Assembly Analogy

Think of it like building a car engine: first we build one piston (single-head attention), then put 8 pistons together (multi-head), then start the engine (test with sample data).

Python — Full Implementation
import torch, torch.nn as nn, math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.softmax(scores, dim=-1)
    return weights @ V, weights

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=64, n_heads=4):
        super().__init__()
        self.n_heads, self.d_k = n_heads, d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, L, D = x.shape
        qkv = self.qkv(x).view(B, L, 3, self.n_heads, self.d_k)
        qkv = qkv.permute(2,0,3,1,4)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        attn_out, weights = scaled_dot_product_attention(Q, K, V, mask)
        out = attn_out.transpose(1,2).contiguous().view(B, L, D)
        return self.out(out), weights

# --- Test it! ---
x = torch.randn(1, 6, 64)  # batch=1, seq_len=6, d_model=64
mha = MultiHeadAttention(d_model=64, n_heads=4)
output, attn_weights = mha(x)
print(f"Input:   {x.shape}")
print(f"Output:  {output.shape}")
print(f"Weights: {attn_weights.shape}")
print(f"Attn weights (head 0, first token):")
print(attn_weights[0,0,0].data.numpy().round(3))

Interactive: Explore Attention Patterns

Select a sentence pattern to see how attention weights differ.

What You've Learned

  • Why attention — RNNs forget, attention remembers everything
  • Q, K, V — the query/key/value framework for computing relevance
  • Scaled dot-product — the math: softmax(QKT/√dk)V
  • Multi-head — parallel heads capture different patterns
  • Self vs cross — same-sequence vs cross-sequence attention

Quiz — Test Your Understanding

Question 1: What fundamental problem with RNNs did the Attention mechanism solve?

Question 2: In the attention mechanism, what does the Query (Q) represent?

Question 3: What is the correct formula for scaled dot-product attention?

Question 4: Why do we divide by √dk in the attention formula?

Question 5: Why does Multi-Head Attention use multiple heads instead of just one?

Question 6: What is the key difference between self-attention and cross-attention?

Quiz — Test Your Knowledge

Question 1: Why was the attention mechanism invented?

Question 2: In the Query-Key-Value framework, what role does the Key play?

Question 3: What is the correct formula for scaled dot-product attention?

Question 4: Why do we divide by √dk in the attention formula?

Question 5: Why does multi-head attention use multiple heads instead of one?

Question 6: What is the difference between self-attention and cross-attention?