Module 5 โ€” Core Architecture

๐Ÿ—๏ธ The Transformer Architecture

Build the complete architecture block by block โ€” the engine that powers every modern LLM from GPT to Llama.

Part 1: Encoder vs Decoder

The original 2017 "Attention Is All You Need" paper introduced the Transformer with two halves: an encoder (understands input) and a decoder (generates output). But modern LLMs pick and choose which half they need.

๐Ÿ‘ถ Like You're 5

Think of the encoder as a reader โ€” it reads the whole sentence and understands it. The decoder is a writer โ€” it creates new text one word at a time. Some models only read (BERT), some only write (GPT), and some do both (T5).

๐Ÿ’ก The Three Families

  • Encoder-Only (BERT) โ€” reads the entire input at once. Great for classification, sentiment analysis, search. Sees all words simultaneously.
  • Decoder-Only (GPT, Llama) โ€” generates text left-to-right, one token at a time. This is what ChatGPT uses. Can only look at past tokens.
  • Encoder-Decoder (T5, BART) โ€” encoder reads input, decoder generates output. Great for translation, summarization.

๐Ÿ”€ Encoder vs Decoder โ€” Side by Side

ENCODER Self-Attention (Bi-dir) Add & Norm Feed-Forward Add & Norm ร— N layers ๐Ÿ‘€ Sees ALL tokens BERT โ€ข RoBERTa โ€ข ELECTRA โŸท DECODER Masked Self-Attn (causal โ€” left only) Add & Norm Feed-Forward Add & Norm ร— N layers โžก๏ธ Sees only PAST tokens GPT โ€ข Llama โ€ข Mistral

Interactive: Which Architecture Does Each Model Use?

Click a button to highlight the architecture each model family uses.

๐ŸŽญ The Theater Analogy

Encoder = the audience. They watch the entire play and form an understanding of the whole story. Decoder = an improv actor. They can only react to what has happened so far โ€” no peeking at the script! Encoder-Decoder = a translator at the UN. They listen to the full speech (encoder), then translate it sentence by sentence (decoder).

Part 2: Inside a Transformer Block

A Transformer is built by stacking identical blocks on top of each other. GPT-3 has 96 blocks. GPT-2 Small has 12. But every single block has the same internal structure:

๐Ÿ‘ถ Like You're 5

Think of a factory assembly line with 4 stations. Every piece of text goes through:
Station 1 (Attention): "Look around โ€” what context matters?"
Station 2 (Add & Norm): "Stabilize and remember the original input."
Station 3 (Feed-Forward): "Think deeply โ€” process and transform."
Station 4 (Add & Norm): "Stabilize again."
Then the output goes into the next identical factory. GPT stacks 96 of these factories!

๐Ÿ’ก The Four Components (in order)

  • Multi-Head Attention โ€” each token looks at every other token to gather context
  • Add & LayerNorm โ€” add the input back (residual) and normalize values
  • Feed-Forward Network โ€” two linear layers that do the "thinking"
  • Add & LayerNorm โ€” another residual connection and normalization

๐ŸŽฌ Data Flowing Through a Single Transformer Block

Watch each component light up as data passes through

Input Embeddings ๐Ÿ” Multi-Head Attention Residual โž• Add & LayerNorm ๐Ÿง  Feed-Forward Network Residual โž• Add & LayerNorm โ†’ Next Block (or Output)

๐Ÿข The Office Building

Each Transformer block is like a floor in an office building. On every floor, you first have a meeting room (attention โ€” everyone shares info), then a quiet desk (feed-forward โ€” individual deep thinking). The elevator (residual connections) lets you carry information from earlier floors. GPT-3 is a 96-story building!

Part 3: Layer Normalization & Residuals

These two techniques seem simple, but they're absolutely critical for training deep networks. Without them, stacking 96 layers would be impossible โ€” values would explode or vanish to zero.

Layer Normalization

LayerNorm takes the outputs of a layer and re-centers them to have mean = 0 and standard deviation = 1. Then it applies learnable scale (ฮณ) and shift (ฮฒ) parameters.

๐ŸŒก๏ธ The Thermostat

Imagine 96 rooms in a building. Without a thermostat, room 1 might be 70ยฐF, room 50 might be 500ยฐF, and room 96 might be 10,000ยฐF โ€” things keep getting hotter as you go deeper. LayerNorm is the thermostat that resets each room to a comfortable range. It keeps values stable no matter how deep you go.

Python / LayerNorm
import torch
import torch.nn as nn

# LayerNorm in 3 lines
x = torch.randn(2, 5)      # batch of 2, dim 5
norm = nn.LayerNorm(5)      # normalize over last dim
out = norm(x)                # meanโ‰ˆ0, stdโ‰ˆ1 per sample

print("Before:", x[0])
print("After: ", out[0])
print("Mean:  ", out[0].mean().item())   # โ‰ˆ 0.0
print("Std:   ", out[0].std().item())    # โ‰ˆ 1.0

Residual Connections (Skip Connections)

A residual connection simply adds the input of a layer back to its output: output = layer(x) + x. That's it. One line of code, but it's revolutionary.

๐Ÿ‘ถ Like You're 5

Imagine you're learning to draw a cat. A residual connection is like having a photocopy of your original drawing at every step. If step 5 accidentally makes the drawing worse, you still have the original to fall back on. The network learns: "What should I add to improve this?" rather than "What should the whole answer be?" โ€” a much easier question!

โ†—๏ธ The Skip Connection โ€” Bypassing a Layer

x (input) Layer (Attn or FFN) SKIP (+ x) + Layer(x) + x ๐Ÿ’ก Safety net! If the layer learns nothing useful, the original x passes through

๐Ÿ’ก Why These Matter

  • LayerNorm prevents values from exploding/vanishing across 96+ layers
  • Residual connections let gradients flow directly backward, making deep training possible
  • Together they allow you to stack blocks essentially infinitely โ€” the network depth becomes a choice, not a limitation
  • Without residuals, training a 96-layer network would be like whispering a message through 96 people โ€” it gets garbled. With residuals, each person also gets the original message.

Part 4: The Feed-Forward Network

After attention has gathered context from other tokens, each token passes through a feed-forward network (FFN) independently. This is where the "thinking" and "knowledge storage" happens.

๐Ÿ‘ถ Like You're 5

Attention is the group meeting where everyone shares information. The FFN is the quiet desk work afterward where each person processes what they heard and forms their own conclusions. Every token does this step completely alone โ€” no looking at other tokens.

The Architecture: Expand โ†’ Activate โ†’ Compress

The FFN is just two linear layers with an activation in between. The key trick: the inner dimension is 4ร— bigger than the model dimension. It's like taking a deep breath โ€” inhale (expand), process (activate), exhale (compress back).

๐Ÿซ The Breathing Analogy

Inhale (Linear 1): expand from 768 dims to 3,072 dims โ€” create room to think in a bigger space.
Hold (GELU activation): apply non-linearity โ€” decide which neurons fire.
Exhale (Linear 2): compress back from 3,072 to 768 dims โ€” distill the essential information.

This expand-compress pattern lets the network temporarily work in a higher-dimensional space where patterns are easier to separate, then project the insights back down.

๐Ÿซ Expand โ†’ Activate โ†’ Compress

768 Input Linear 3,072 Expanded GELU โšก non-linearity Linear 768 Output ๐Ÿซ Inhale ๐Ÿ’จ Hold ๐Ÿซ Exhale inner dim = 4 ร— d_model 3072 = 4 ร— 768
Python / FFN
import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=None):
        super().__init__()
        d_ff = d_ff or d_model * 4
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),   # expand: 768 โ†’ 3072
            nn.GELU(),                   # activate
            nn.Linear(d_ff, d_model),   # compress: 3072 โ†’ 768
        )

    def forward(self, x):
        return self.net(x)

ffn = FeedForward(768)
x = torch.randn(1, 10, 768)   # (batch, seq_len, d_model)
out = ffn(x)                     # same shape: (1, 10, 768)
print(out.shape)                 # torch.Size([1, 10, 768])

๐Ÿ’ก Where Knowledge Lives

  • Research shows that factual knowledge (Paris is the capital of France) is stored primarily in FFN weights
  • Attention decides what to focus on; FFN decides what to do with it
  • The 4ร— expansion ratio is a design choice โ€” some models use 8/3ร— with gated variants (SwiGLU)
  • FFN parameters make up ~โ…” of a Transformer's total parameter count

Part 5: Causal Masking

In a decoder-only model like GPT, there's one critical rule: a token can only attend to tokens that came before it (and itself). It cannot peek at future tokens. This is enforced by a causal mask.

๐Ÿ‘ถ Like You're 5

Imagine reading a mystery novel. You can re-read earlier pages to find clues, but you're not allowed to flip ahead. Causal masking is like putting a physical blocker on the book that only lets you see pages you've already read. Each word can only look at words to its left โ€” never to its right.

Why Masking is Essential

During training, GPT sees the entire sentence at once (for efficiency). But to learn to predict the next word, it must pretend it hasn't seen the future. The mask fills future positions with -โˆž before softmax, which converts to attention weight = 0.

Interactive: Click a Word to See What It Can Attend To

Click any word below. Green = can see, grey = masked (can't peek!)

๐Ÿ”ฒ The Causal Mask โ€” Lower Triangular Matrix

Green = allowed attention, Grey = masked (-โˆž)

Keys (what you're looking at) โ†’ Queries (who's looking) โ†’ The cat sat on the The cat sat on the โœ“ -โˆž -โˆž -โˆž -โˆž โœ“ โœ“ -โˆž -โˆž -โˆž โœ“ โœ“ โœ“ -โˆž -โˆž โœ“ โœ“ โœ“ โœ“ -โˆž โœ“ โœ“ โœ“ โœ“ โœ“ = Can attend (score computed) = Masked with -โˆž (becomes 0 after softmax)
Python / Causal Mask
import torch

def create_causal_mask(seq_len):
    """Lower-triangular mask: 1 = attend, 0 = block"""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    # Convert 0s to -inf for attention scores
    mask = mask.masked_fill(mask == 0, float('-inf'))
    mask = mask.masked_fill(mask == 1, 0.0)
    return mask

mask = create_causal_mask(5)
print(mask)
# tensor([[  0., -inf, -inf, -inf, -inf],
#         [  0.,   0., -inf, -inf, -inf],
#         [  0.,   0.,   0., -inf, -inf],
#         [  0.,   0.,   0.,   0., -inf],
#         [  0.,   0.,   0.,   0.,   0.]])

๐Ÿ’ก Masking Takeaways

  • Causal mask = lower triangular matrix of 0s with -โˆž above the diagonal
  • After adding the mask to attention scores, softmax turns -โˆž โ†’ 0 attention weight
  • This forces autoregressive behavior: predict next token using only previous tokens
  • BERT doesn't use causal masking (it sees everything) โ€” that's why BERT can't generate text

Part 6: Build a Full GPT-Style Transformer

Time to put it all together. Below is a complete, working GPT-style decoder-only Transformer in PyTorch. Every component from this module โ€” attention, FFN, LayerNorm, residuals, masking โ€” assembled into one model.

๐Ÿ‘ถ What We're Building

We're assembling all the LEGO pieces from this course into the full robot. Input text โ†’ token embeddings + position encoding โ†’ N transformer blocks (each: masked attention โ†’ norm โ†’ FFN โ†’ norm) โ†’ project to vocabulary โ†’ output probability for next word. This is the architecture behind ChatGPT.

Step 1: Configuration

Python / Config
from dataclasses import dataclass

@dataclass
class GPTConfig:
    vocab_size:  int = 50257   # GPT-2 vocabulary size
    max_seq_len: int = 1024    # max context window
    d_model:     int = 768     # embedding dimension
    n_heads:     int = 12      # attention heads
    n_layers:    int = 12      # transformer blocks
    d_ff:        int = 3072    # FFN inner dimension (4 ร— 768)
    dropout:     float = 0.1

Step 2: The Transformer Block

Python / Transformer Block
import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.n_heads = cfg.n_heads
        self.head_dim = cfg.d_model // cfg.n_heads
        self.qkv = nn.Linear(cfg.d_model, 3 * cfg.d_model)
        self.proj = nn.Linear(cfg.d_model, cfg.d_model)
        self.dropout = nn.Dropout(cfg.dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores + mask[:T, :T]
        attn = scores.softmax(dim=-1)
        attn = self.dropout(attn)

        out = (attn @ v).transpose(1, 2).reshape(B, T, C)
        return self.proj(out)

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(cfg.d_model, cfg.d_ff),
            nn.GELU(),
            nn.Linear(cfg.d_ff, cfg.d_model),
            nn.Dropout(cfg.dropout),
        )
    def forward(self, x):
        return self.net(x)

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.attn = MultiHeadAttention(cfg)
        self.ffn  = FeedForward(cfg)
        self.ln1  = nn.LayerNorm(cfg.d_model)
        self.ln2  = nn.LayerNorm(cfg.d_model)

    def forward(self, x, mask=None):
        # Pre-norm variant (used by GPT-2 and later)
        x = x + self.attn(self.ln1(x), mask)   # residual + attention
        x = x + self.ffn(self.ln2(x))          # residual + FFN
        return x

Step 3: The Full GPT Model

Python / Full GPT
class GPT(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg

        # Token + positional embeddings
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.d_model)
        self.pos_emb = nn.Embedding(cfg.max_seq_len, cfg.d_model)
        self.drop = nn.Dropout(cfg.dropout)

        # Stack of N transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(cfg) for _ in range(cfg.n_layers)
        ])

        # Final layer norm + projection to vocab
        self.ln_f = nn.LayerNorm(cfg.d_model)
        self.head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)

        # Causal mask (registered as buffer โ€” not a parameter)
        mask = torch.triu(
            torch.full((cfg.max_seq_len, cfg.max_seq_len), float('-inf')),
            diagonal=1
        )
        self.register_buffer('mask', mask)

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_emb(idx)                              # (B, T, d_model)
        pos = self.pos_emb(torch.arange(T, device=idx.device)) # (T, d_model)
        x = self.drop(tok + pos)                             # combine embeddings

        for block in self.blocks:
            x = block(x, self.mask)                          # pass through each block

        x = self.ln_f(x)                                      # final layer norm
        logits = self.head(x)                                 # project to vocab size
        return logits                                          # (B, T, vocab_size)

Step 4: Instantiate and Count Parameters

Python / Parameter Count
cfg = GPTConfig()
model = GPT(cfg)

n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params:,}")
# โ†’ Model parameters: 124,439,808  (~124M โ€” this is GPT-2 Small!)

# Quick test: feed random token IDs
idx = torch.randint(0, cfg.vocab_size, (2, 64))  # batch=2, seq_len=64
logits = model(idx)
print(f"Output shape: {logits.shape}")
# โ†’ Output shape: torch.Size([2, 64, 50257])
# For each token position โ†’ probability over 50,257 possible next tokens

Interactive: Transformer Parameter Calculator

Adjust the hyperparameters and see how the parameter count changes!

Total Parameters
124,439,808
โ‰ˆ 124M (GPT-2 Small)

๐Ÿ—๏ธ The Full GPT Architecture โ€” Top to Bottom

Input Token IDs [3, 412, 87, ...] Token Embedding + Positional Encoding Transformer Block 1 Masked Attn โ†’ Add&Norm โ†’ FFN โ†’ Add&Norm (with residual connections) Transformer Block 2 โ‹ฎ ร— N layers Transformer Block N Final LayerNorm Linear โ†’ Softmax โ†’ Next Token Probs P("the") = 0.02, P("cat") = 0.35, P("dog") = 0.12 ...

๐ŸŽ“ What You Just Built

  • A complete GPT-style decoder-only Transformer โ€” the same architecture behind ChatGPT
  • Token + positional embeddings โ†’ N transformer blocks โ†’ final projection to vocabulary
  • Each block: masked multi-head attention โ†’ add & norm โ†’ FFN โ†’ add & norm
  • With default config: ~124M parameters (GPT-2 Small). Scale to 175B parameters by increasing d_model, n_layers, and n_heads.
  • The only missing piece: training data and compute. The architecture itself is surprisingly compact!

๐Ÿ”— What's Next?

We have the architecture โ€” but a randomly initialized GPT produces gibberish. In the next module, we'll cover training: how to feed it billions of tokens, compute the loss (cross-entropy), and iteratively update all 124M+ parameters until the model can write coherent text. The training loop is the same 5 steps from Module 3 โ€” just scaled to hundreds of GPUs!

Quiz โ€” Test Your Knowledge

Question 1: Which architecture family does GPT (the model behind ChatGPT) belong to?

Question 2: What are the four components inside a single Transformer block, in correct order?

Question 3: What does Layer Normalization (LayerNorm) do?

Question 4: What is the purpose of residual (skip) connections in a Transformer?

Question 5: What role does the Feed-Forward Network (FFN) play inside a Transformer block?

Question 6: Why does GPT use causal masking (filling future positions with -โˆž)?