Attention Mechanism | LLM Course | Fakhruddin Khambaty's Learning Hub

Why Attention?

Before Attention was invented, sequence models (RNNs) had a serious flaw: they forgot things. The longer the sentence, the worse they performed.

ELI5

Imagine reading a whole book, but you can only remember the last sentence you read. That's an RNN. Now imagine you can flip back to any page whenever you need to — that's Attention!

The Keyhole Analogy

An RNN looks at a whole room through a tiny keyhole — it only sees what's directly in front. Attention gives you a wide-open door: you can see the entire room at once and choose where to look.

RNN vs Attention — How Information Flows

Key Takeaway

RNNs process words one-by-one — early words get "washed out"
Attention lets the model look at all words simultaneously
This is why Transformers (which use Attention) dominate NLP

Query, Key, Value

Attention works through three vectors for every word: a Query, a Key, and a Value. Together they answer: "How much should I pay attention to each other word?"

ELI5

You walk into a library with a question (Query). Every book has a title (Key). You compare your question to each title — the better the match, the more you read that book's contents (Value).

Library Analogy — Step by Step

Query = "I want to learn about cats" (your question)

Keys = ["Animal Behavior", "Quantum Physics", "Cat Care Guide"] (book titles)

Values = [actual content of each book]

Your query matches "Cat Care Guide" best → you read mostly that book's content.

How Q, K, V Work Together

Interactive: Click a word to see its attention

Click any word in the sentence to see which other words it attends to most.

The cat sat on the mat

Scaled Dot-Product Attention

Now let's see the actual math. The formula is surprisingly elegant:

Attention(Q, K, V) = softmax( Q · K^T / √d_k ) × V

ELI5

Multiply Q and K to get "how similar are these?" scores. Divide by √d_k so numbers don't get too big. Run softmax to turn scores into percentages. Multiply by V to get the final answer.

Temperature Analogy

The √d_k division is like adjusting a thermostat. Without it, dot products get very large in high dimensions, making softmax "too confident" — it would pick one word and ignore everything else. Scaling keeps the temperature comfortable.

Numerical Example (3×3)

Step 1 — Compute Q · K^T
Q = [[1,0,1],[0,1,1],[1,1,0]], K = [[1,1,0],[0,1,1],[1,0,1]]
Scores = [[2,1,2],[1,2,1],[1,2,1]]

Step 2 — Scale by √d_k (d_k = 3, √3 ≈ 1.73)
Scaled = [[1.15, 0.58, 1.15],[0.58, 1.15, 0.58],[0.58, 1.15, 0.58]]

Step 3 — Softmax (row-wise)
Row 1: [0.39, 0.22, 0.39] Row 2: [0.22, 0.39, 0.22] (sums ~1.0 ✓)

Step 4 — Multiply by V
Output = softmax_weights × V → weighted combination of value vectors

Matrix Multiplication as Colored Grids

Interactive: Why √d_k Matters

See how scaling affects the softmax distribution. Raw scores: [2, 1, 2]

d_k = 3

Multi-Head Attention

One attention head finds one type of pattern. But language has many patterns simultaneously — syntax, semantics, coreference. The solution? Run multiple attention heads in parallel.

ELI5

Instead of sending one detective to investigate a crime scene, you send 8 detectives — each looking for something different. One checks fingerprints, another interviews witnesses, another studies the floor. Then they all share their findings.

8 Detectives Analogy

Head 1 might learn "which words are grammatically related." Head 2 might learn "which words refer to the same entity." Head 3 might learn "which words are nearby." Each head has its own Q, K, V projections — its own way of asking questions.

Multi-Head: Parallel Attention Heads → Merge

PyTorch Implementation

Python

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.n_heads, self.d_k = n_heads, d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V):
        B, L, _ = Q.shape
        # Project & reshape to (B, heads, L, d_k)
        q = self.W_q(Q).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        k = self.W_k(K).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        v = self.W_v(V).view(B, L, self.n_heads, self.d_k).transpose(1,2)
        # Scaled dot-product attention
        scores = (q @ k.transpose(-2,-1)) / self.d_k**0.5
        attn = torch.softmax(scores, dim=-1)
        out = (attn @ v).transpose(1,2).contiguous().view(B, L, -1)
        return self.W_o(out)

Self-Attention vs Cross-Attention

ELI5

Self-attention = talking to yourself, "which of my own words relate to each other?"
Cross-attention = asking someone else, "which of YOUR words help ME understand?"

Conversation Analogy

Self-attention is like re-reading your own essay to find connections between paragraphs. Cross-attention is like reading someone else's notes while writing your essay — Q comes from you, but K and V come from them.

Self-Attention

Q, K, V all come from the same sequence

Used in: encoder, decoder (masked)

Cross-Attention

Q from one sequence, K & V from another

Used in: decoder attending to encoder

When Is Each Used?

GPT (decoder-only): uses masked self-attention — each token can only see tokens before it
BERT (encoder-only): uses bidirectional self-attention — sees all tokens
T5 / translation models: uses both — self-attention in encoder & decoder, plus cross-attention between them

Complete Code

Here's a full working implementation of scaled dot-product attention with a test you can run.

ELI5

We're building the attention mechanism from scratch — first the basic function, then wrapping it in a class, then running it on fake data to see it actually work.

Assembly Analogy

Think of it like building a car engine: first we build one piston (single-head attention), then put 8 pistons together (multi-head), then start the engine (test with sample data).

Python — Full Implementation

import torch, torch.nn as nn, math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.softmax(scores, dim=-1)
    return weights @ V, weights

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=64, n_heads=4):
        super().__init__()
        self.n_heads, self.d_k = n_heads, d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, L, D = x.shape
        qkv = self.qkv(x).view(B, L, 3, self.n_heads, self.d_k)
        qkv = qkv.permute(2,0,3,1,4)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        attn_out, weights = scaled_dot_product_attention(Q, K, V, mask)
        out = attn_out.transpose(1,2).contiguous().view(B, L, D)
        return self.out(out), weights

# --- Test it! ---
x = torch.randn(1, 6, 64)  # batch=1, seq_len=6, d_model=64
mha = MultiHeadAttention(d_model=64, n_heads=4)
output, attn_weights = mha(x)
print(f"Input:   {x.shape}")
print(f"Output:  {output.shape}")
print(f"Weights: {attn_weights.shape}")
print(f"Attn weights (head 0, first token):")
print(attn_weights[0,0,0].data.numpy().round(3))

Interactive: Explore Attention Patterns

Select a sentence pattern to see how attention weights differ.

What You've Learned

Why attention — RNNs forget, attention remembers everything
Q, K, V — the query/key/value framework for computing relevance
Scaled dot-product — the math: softmax(QK^T/√d_k)V
Multi-head — parallel heads capture different patterns
Self vs cross — same-sequence vs cross-sequence attention

The Attention Mechanism

Why Attention?

ELI5

The Keyhole Analogy

RNN vs Attention — How Information Flows

Key Takeaway

Query, Key, Value

ELI5

Library Analogy — Step by Step

How Q, K, V Work Together

Interactive: Click a word to see its attention

Scaled Dot-Product Attention

ELI5

Temperature Analogy

Numerical Example (3×3)

Matrix Multiplication as Colored Grids

Interactive: Why √dk Matters

Multi-Head Attention

ELI5

8 Detectives Analogy

Multi-Head: Parallel Attention Heads → Merge

PyTorch Implementation

Self-Attention vs Cross-Attention

ELI5

Conversation Analogy

Self-Attention

Cross-Attention

When Is Each Used?

Complete Code

ELI5

Assembly Analogy

Interactive: Explore Attention Patterns

What You've Learned

Quiz — Test Your Understanding

Question 1: What fundamental problem with RNNs did the Attention mechanism solve?

Question 2: In the attention mechanism, what does the Query (Q) represent?

Question 3: What is the correct formula for scaled dot-product attention?

Question 4: Why do we divide by √dk in the attention formula?

Question 5: Why does Multi-Head Attention use multiple heads instead of just one?

Question 6: What is the key difference between self-attention and cross-attention?

Quiz — Test Your Knowledge

Question 1: Why was the attention mechanism invented?

Question 2: In the Query-Key-Value framework, what role does the Key play?

Question 3: What is the correct formula for scaled dot-product attention?

Question 4: Why do we divide by √dk in the attention formula?

Question 5: Why does multi-head attention use multiple heads instead of one?

Question 6: What is the difference between self-attention and cross-attention?

Interactive: Why √d_k Matters

Question 4: Why do we divide by √d_k in the attention formula?

Question 4: Why do we divide by √d_k in the attention formula?