The Transformer Architecture | LLM Course | Fakhruddin Khambaty's Learning Hub

Part 1: Encoder vs Decoder

The original 2017 "Attention Is All You Need" paper introduced the Transformer with two halves: an encoder (understands input) and a decoder (generates output). But modern LLMs pick and choose which half they need.

👶 Like You're 5

Think of the encoder as a reader — it reads the whole sentence and understands it. The decoder is a writer — it creates new text one word at a time. Some models only read (BERT), some only write (GPT), and some do both (T5).

💡 The Three Families

Encoder-Only (BERT) — reads the entire input at once. Great for classification, sentiment analysis, search. Sees all words simultaneously.
Decoder-Only (GPT, Llama) — generates text left-to-right, one token at a time. This is what ChatGPT uses. Can only look at past tokens.
Encoder-Decoder (T5, BART) — encoder reads input, decoder generates output. Great for translation, summarization.

🔀 Encoder vs Decoder — Side by Side

Interactive: Which Architecture Does Each Model Use?

Click a button to highlight the architecture each model family uses.

🎭 The Theater Analogy

Encoder = the audience. They watch the entire play and form an understanding of the whole story. Decoder = an improv actor. They can only react to what has happened so far — no peeking at the script! Encoder-Decoder = a translator at the UN. They listen to the full speech (encoder), then translate it sentence by sentence (decoder).

Part 2: Inside a Transformer Block

A Transformer is built by stacking identical blocks on top of each other. GPT-3 has 96 blocks. GPT-2 Small has 12. But every single block has the same internal structure:

👶 Like You're 5

Think of a factory assembly line with 4 stations. Every piece of text goes through:
Station 1 (Attention): "Look around — what context matters?"
Station 2 (Add & Norm): "Stabilize and remember the original input."
Station 3 (Feed-Forward): "Think deeply — process and transform."
Station 4 (Add & Norm): "Stabilize again."
Then the output goes into the next identical factory. GPT stacks 96 of these factories!

💡 The Four Components (in order)

Multi-Head Attention — each token looks at every other token to gather context
Add & LayerNorm — add the input back (residual) and normalize values
Feed-Forward Network — two linear layers that do the "thinking"
Add & LayerNorm — another residual connection and normalization

🎬 Data Flowing Through a Single Transformer Block

Watch each component light up as data passes through

🏢 The Office Building

Each Transformer block is like a floor in an office building. On every floor, you first have a meeting room (attention — everyone shares info), then a quiet desk (feed-forward — individual deep thinking). The elevator (residual connections) lets you carry information from earlier floors. GPT-3 is a 96-story building!

Part 3: Layer Normalization & Residuals

These two techniques seem simple, but they're absolutely critical for training deep networks. Without them, stacking 96 layers would be impossible — values would explode or vanish to zero.

Layer Normalization

LayerNorm takes the outputs of a layer and re-centers them to have mean = 0 and standard deviation = 1. Then it applies learnable scale (γ) and shift (β) parameters.

🌡️ The Thermostat

Imagine 96 rooms in a building. Without a thermostat, room 1 might be 70°F, room 50 might be 500°F, and room 96 might be 10,000°F — things keep getting hotter as you go deeper. LayerNorm is the thermostat that resets each room to a comfortable range. It keeps values stable no matter how deep you go.

Python / LayerNorm

import torch
import torch.nn as nn

# LayerNorm in 3 lines
x = torch.randn(2, 5)      # batch of 2, dim 5
norm = nn.LayerNorm(5)      # normalize over last dim
out = norm(x)                # mean≈0, std≈1 per sample

print("Before:", x[0])
print("After: ", out[0])
print("Mean:  ", out[0].mean().item())   # ≈ 0.0
print("Std:   ", out[0].std().item())    # ≈ 1.0

Residual Connections (Skip Connections)

A residual connection simply adds the input of a layer back to its output: output = layer(x) + x. That's it. One line of code, but it's revolutionary.

👶 Like You're 5

Imagine you're learning to draw a cat. A residual connection is like having a photocopy of your original drawing at every step. If step 5 accidentally makes the drawing worse, you still have the original to fall back on. The network learns: "What should I add to improve this?" rather than "What should the whole answer be?" — a much easier question!

↗️ The Skip Connection — Bypassing a Layer

💡 Why These Matter

LayerNorm prevents values from exploding/vanishing across 96+ layers
Residual connections let gradients flow directly backward, making deep training possible
Together they allow you to stack blocks essentially infinitely — the network depth becomes a choice, not a limitation
Without residuals, training a 96-layer network would be like whispering a message through 96 people — it gets garbled. With residuals, each person also gets the original message.

Part 4: The Feed-Forward Network

After attention has gathered context from other tokens, each token passes through a feed-forward network (FFN) independently. This is where the "thinking" and "knowledge storage" happens.

👶 Like You're 5

Attention is the group meeting where everyone shares information. The FFN is the quiet desk work afterward where each person processes what they heard and forms their own conclusions. Every token does this step completely alone — no looking at other tokens.

The Architecture: Expand → Activate → Compress

The FFN is just two linear layers with an activation in between. The key trick: the inner dimension is 4× bigger than the model dimension. It's like taking a deep breath — inhale (expand), process (activate), exhale (compress back).

🫁 The Breathing Analogy

Inhale (Linear 1): expand from 768 dims to 3,072 dims — create room to think in a bigger space.
Hold (GELU activation): apply non-linearity — decide which neurons fire.
Exhale (Linear 2): compress back from 3,072 to 768 dims — distill the essential information.

This expand-compress pattern lets the network temporarily work in a higher-dimensional space where patterns are easier to separate, then project the insights back down.

🫁 Expand → Activate → Compress

Python / FFN

import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=None):
        super().__init__()
        d_ff = d_ff or d_model * 4
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),   # expand: 768 → 3072
            nn.GELU(),                   # activate
            nn.Linear(d_ff, d_model),   # compress: 3072 → 768
        )

    def forward(self, x):
        return self.net(x)

ffn = FeedForward(768)
x = torch.randn(1, 10, 768)   # (batch, seq_len, d_model)
out = ffn(x)                     # same shape: (1, 10, 768)
print(out.shape)                 # torch.Size([1, 10, 768])

💡 Where Knowledge Lives

Research shows that factual knowledge (Paris is the capital of France) is stored primarily in FFN weights
Attention decides what to focus on; FFN decides what to do with it
The 4× expansion ratio is a design choice — some models use 8/3× with gated variants (SwiGLU)
FFN parameters make up ~⅔ of a Transformer's total parameter count

Part 5: Causal Masking

In a decoder-only model like GPT, there's one critical rule: a token can only attend to tokens that came before it (and itself). It cannot peek at future tokens. This is enforced by a causal mask.

👶 Like You're 5

Imagine reading a mystery novel. You can re-read earlier pages to find clues, but you're not allowed to flip ahead. Causal masking is like putting a physical blocker on the book that only lets you see pages you've already read. Each word can only look at words to its left — never to its right.

Why Masking is Essential

During training, GPT sees the entire sentence at once (for efficiency). But to learn to predict the next word, it must pretend it hasn't seen the future. The mask fills future positions with -∞ before softmax, which converts to attention weight = 0.

Interactive: Click a Word to See What It Can Attend To

Click any word below. Green = can see, grey = masked (can't peek!)

🔲 The Causal Mask — Lower Triangular Matrix

Green = allowed attention, Grey = masked (-∞)

Python / Causal Mask

import torch

def create_causal_mask(seq_len):
    """Lower-triangular mask: 1 = attend, 0 = block"""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    # Convert 0s to -inf for attention scores
    mask = mask.masked_fill(mask == 0, float('-inf'))
    mask = mask.masked_fill(mask == 1, 0.0)
    return mask

mask = create_causal_mask(5)
print(mask)
# tensor([[  0., -inf, -inf, -inf, -inf],
#         [  0.,   0., -inf, -inf, -inf],
#         [  0.,   0.,   0., -inf, -inf],
#         [  0.,   0.,   0.,   0., -inf],
#         [  0.,   0.,   0.,   0.,   0.]])

💡 Masking Takeaways

Causal mask = lower triangular matrix of 0s with -∞ above the diagonal
After adding the mask to attention scores, softmax turns -∞ → 0 attention weight
This forces autoregressive behavior: predict next token using only previous tokens
BERT doesn't use causal masking (it sees everything) — that's why BERT can't generate text

Part 6: Build a Full GPT-Style Transformer

Time to put it all together. Below is a complete, working GPT-style decoder-only Transformer in PyTorch. Every component from this module — attention, FFN, LayerNorm, residuals, masking — assembled into one model.

👶 What We're Building

We're assembling all the LEGO pieces from this course into the full robot. Input text → token embeddings + position encoding → N transformer blocks (each: masked attention → norm → FFN → norm) → project to vocabulary → output probability for next word. This is the architecture behind ChatGPT.

Step 1: Configuration

Python / Config

from dataclasses import dataclass

@dataclass
class GPTConfig:
    vocab_size:  int = 50257   # GPT-2 vocabulary size
    max_seq_len: int = 1024    # max context window
    d_model:     int = 768     # embedding dimension
    n_heads:     int = 12      # attention heads
    n_layers:    int = 12      # transformer blocks
    d_ff:        int = 3072    # FFN inner dimension (4 × 768)
    dropout:     float = 0.1

Step 2: The Transformer Block

Python / Transformer Block

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.n_heads = cfg.n_heads
        self.head_dim = cfg.d_model // cfg.n_heads
        self.qkv = nn.Linear(cfg.d_model, 3 * cfg.d_model)
        self.proj = nn.Linear(cfg.d_model, cfg.d_model)
        self.dropout = nn.Dropout(cfg.dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores + mask[:T, :T]
        attn = scores.softmax(dim=-1)
        attn = self.dropout(attn)

        out = (attn @ v).transpose(1, 2).reshape(B, T, C)
        return self.proj(out)

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(cfg.d_model, cfg.d_ff),
            nn.GELU(),
            nn.Linear(cfg.d_ff, cfg.d_model),
            nn.Dropout(cfg.dropout),
        )
    def forward(self, x):
        return self.net(x)

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.attn = MultiHeadAttention(cfg)
        self.ffn  = FeedForward(cfg)
        self.ln1  = nn.LayerNorm(cfg.d_model)
        self.ln2  = nn.LayerNorm(cfg.d_model)

    def forward(self, x, mask=None):
        # Pre-norm variant (used by GPT-2 and later)
        x = x + self.attn(self.ln1(x), mask)   # residual + attention
        x = x + self.ffn(self.ln2(x))          # residual + FFN
        return x

Step 3: The Full GPT Model

Python / Full GPT

class GPT(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg

        # Token + positional embeddings
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.d_model)
        self.pos_emb = nn.Embedding(cfg.max_seq_len, cfg.d_model)
        self.drop = nn.Dropout(cfg.dropout)

        # Stack of N transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(cfg) for _ in range(cfg.n_layers)
        ])

        # Final layer norm + projection to vocab
        self.ln_f = nn.LayerNorm(cfg.d_model)
        self.head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)

        # Causal mask (registered as buffer — not a parameter)
        mask = torch.triu(
            torch.full((cfg.max_seq_len, cfg.max_seq_len), float('-inf')),
            diagonal=1
        )
        self.register_buffer('mask', mask)

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_emb(idx)                              # (B, T, d_model)
        pos = self.pos_emb(torch.arange(T, device=idx.device)) # (T, d_model)
        x = self.drop(tok + pos)                             # combine embeddings

        for block in self.blocks:
            x = block(x, self.mask)                          # pass through each block

        x = self.ln_f(x)                                      # final layer norm
        logits = self.head(x)                                 # project to vocab size
        return logits                                          # (B, T, vocab_size)

Step 4: Instantiate and Count Parameters

Python / Parameter Count

cfg = GPTConfig()
model = GPT(cfg)

n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params:,}")
# → Model parameters: 124,439,808  (~124M — this is GPT-2 Small!)

# Quick test: feed random token IDs
idx = torch.randint(0, cfg.vocab_size, (2, 64))  # batch=2, seq_len=64
logits = model(idx)
print(f"Output shape: {logits.shape}")
# → Output shape: torch.Size([2, 64, 50257])
# For each token position → probability over 50,257 possible next tokens

Interactive: Transformer Parameter Calculator

Adjust the hyperparameters and see how the parameter count changes!

d_model:

n_layers:

n_heads:

vocab_size:

Total Parameters

124,439,808

≈ 124M (GPT-2 Small)

🏗️ The Full GPT Architecture — Top to Bottom

🎓 What You Just Built

A complete GPT-style decoder-only Transformer — the same architecture behind ChatGPT
Token + positional embeddings → N transformer blocks → final projection to vocabulary
Each block: masked multi-head attention → add & norm → FFN → add & norm
With default config: ~124M parameters (GPT-2 Small). Scale to 175B parameters by increasing d_model, n_layers, and n_heads.
The only missing piece: training data and compute. The architecture itself is surprisingly compact!

🔗 What's Next?

We have the architecture — but a randomly initialized GPT produces gibberish. In the next module, we'll cover training: how to feed it billions of tokens, compute the loss (cross-entropy), and iteratively update all 124M+ parameters until the model can write coherent text. The training loop is the same 5 steps from Module 3 — just scaled to hundreds of GPUs!

🏗️ The Transformer Architecture

Part 1: Encoder vs Decoder

👶 Like You're 5

💡 The Three Families

🔀 Encoder vs Decoder — Side by Side

Interactive: Which Architecture Does Each Model Use?

🎭 The Theater Analogy

Part 2: Inside a Transformer Block

👶 Like You're 5

💡 The Four Components (in order)

🎬 Data Flowing Through a Single Transformer Block

🏢 The Office Building

Part 3: Layer Normalization & Residuals

Layer Normalization

🌡️ The Thermostat

Residual Connections (Skip Connections)

👶 Like You're 5

↗️ The Skip Connection — Bypassing a Layer

💡 Why These Matter

Part 4: The Feed-Forward Network

👶 Like You're 5

The Architecture: Expand → Activate → Compress

🫁 The Breathing Analogy

🫁 Expand → Activate → Compress

💡 Where Knowledge Lives

Part 5: Causal Masking

👶 Like You're 5

Why Masking is Essential

Interactive: Click a Word to See What It Can Attend To

🔲 The Causal Mask — Lower Triangular Matrix

💡 Masking Takeaways

Part 6: Build a Full GPT-Style Transformer

👶 What We're Building

Step 1: Configuration

Step 2: The Transformer Block

Step 3: The Full GPT Model

Step 4: Instantiate and Count Parameters

Interactive: Transformer Parameter Calculator

🏗️ The Full GPT Architecture — Top to Bottom

🎓 What You Just Built

🔗 What's Next?

Quiz — Test Your Knowledge

Question 1: Which architecture family does GPT (the model behind ChatGPT) belong to?

Question 2: What are the four components inside a single Transformer block, in correct order?

Question 3: What does Layer Normalization (LayerNorm) do?

Question 4: What is the purpose of residual (skip) connections in a Transformer?

Question 5: What role does the Feed-Forward Network (FFN) play inside a Transformer block?

Question 6: Why does GPT use causal masking (filling future positions with -∞)?