Build the complete architecture block by block โ the engine that powers every modern LLM from GPT to Llama.
The original 2017 "Attention Is All You Need" paper introduced the Transformer with two halves: an encoder (understands input) and a decoder (generates output). But modern LLMs pick and choose which half they need.
Think of the encoder as a reader โ it reads the whole sentence and understands it. The decoder is a writer โ it creates new text one word at a time. Some models only read (BERT), some only write (GPT), and some do both (T5).
Click a button to highlight the architecture each model family uses.
Encoder = the audience. They watch the entire play and form an understanding of the whole story. Decoder = an improv actor. They can only react to what has happened so far โ no peeking at the script! Encoder-Decoder = a translator at the UN. They listen to the full speech (encoder), then translate it sentence by sentence (decoder).
A Transformer is built by stacking identical blocks on top of each other. GPT-3 has 96 blocks. GPT-2 Small has 12. But every single block has the same internal structure:
Think of a factory assembly line with 4 stations. Every piece of text goes through:
Station 1 (Attention): "Look around โ what context matters?"
Station 2 (Add & Norm): "Stabilize and remember the original input."
Station 3 (Feed-Forward): "Think deeply โ process and transform."
Station 4 (Add & Norm): "Stabilize again."
Then the output goes into the next identical factory. GPT stacks 96 of these factories!
Watch each component light up as data passes through
Each Transformer block is like a floor in an office building. On every floor, you first have a meeting room (attention โ everyone shares info), then a quiet desk (feed-forward โ individual deep thinking). The elevator (residual connections) lets you carry information from earlier floors. GPT-3 is a 96-story building!
These two techniques seem simple, but they're absolutely critical for training deep networks. Without them, stacking 96 layers would be impossible โ values would explode or vanish to zero.
LayerNorm takes the outputs of a layer and re-centers them to have mean = 0 and standard deviation = 1. Then it applies learnable scale (ฮณ) and shift (ฮฒ) parameters.
Imagine 96 rooms in a building. Without a thermostat, room 1 might be 70ยฐF, room 50 might be 500ยฐF, and room 96 might be 10,000ยฐF โ things keep getting hotter as you go deeper. LayerNorm is the thermostat that resets each room to a comfortable range. It keeps values stable no matter how deep you go.
import torch import torch.nn as nn # LayerNorm in 3 lines x = torch.randn(2, 5) # batch of 2, dim 5 norm = nn.LayerNorm(5) # normalize over last dim out = norm(x) # meanโ0, stdโ1 per sample print("Before:", x[0]) print("After: ", out[0]) print("Mean: ", out[0].mean().item()) # โ 0.0 print("Std: ", out[0].std().item()) # โ 1.0
A residual connection simply adds the input of a layer back to its output: output = layer(x) + x. That's it. One line of code, but it's revolutionary.
Imagine you're learning to draw a cat. A residual connection is like having a photocopy of your original drawing at every step. If step 5 accidentally makes the drawing worse, you still have the original to fall back on. The network learns: "What should I add to improve this?" rather than "What should the whole answer be?" โ a much easier question!
After attention has gathered context from other tokens, each token passes through a feed-forward network (FFN) independently. This is where the "thinking" and "knowledge storage" happens.
Attention is the group meeting where everyone shares information. The FFN is the quiet desk work afterward where each person processes what they heard and forms their own conclusions. Every token does this step completely alone โ no looking at other tokens.
The FFN is just two linear layers with an activation in between. The key trick: the inner dimension is 4ร bigger than the model dimension. It's like taking a deep breath โ inhale (expand), process (activate), exhale (compress back).
Inhale (Linear 1): expand from 768 dims to 3,072 dims โ create room to think in a bigger space.
Hold (GELU activation): apply non-linearity โ decide which neurons fire.
Exhale (Linear 2): compress back from 3,072 to 768 dims โ distill the essential information.
This expand-compress pattern lets the network temporarily work in a higher-dimensional space where patterns are easier to separate, then project the insights back down.
import torch.nn as nn class FeedForward(nn.Module): def __init__(self, d_model, d_ff=None): super().__init__() d_ff = d_ff or d_model * 4 self.net = nn.Sequential( nn.Linear(d_model, d_ff), # expand: 768 โ 3072 nn.GELU(), # activate nn.Linear(d_ff, d_model), # compress: 3072 โ 768 ) def forward(self, x): return self.net(x) ffn = FeedForward(768) x = torch.randn(1, 10, 768) # (batch, seq_len, d_model) out = ffn(x) # same shape: (1, 10, 768) print(out.shape) # torch.Size([1, 10, 768])
In a decoder-only model like GPT, there's one critical rule: a token can only attend to tokens that came before it (and itself). It cannot peek at future tokens. This is enforced by a causal mask.
Imagine reading a mystery novel. You can re-read earlier pages to find clues, but you're not allowed to flip ahead. Causal masking is like putting a physical blocker on the book that only lets you see pages you've already read. Each word can only look at words to its left โ never to its right.
During training, GPT sees the entire sentence at once (for efficiency). But to learn to predict the next word, it must pretend it hasn't seen the future. The mask fills future positions with -โ before softmax, which converts to attention weight = 0.
Click any word below. Green = can see, grey = masked (can't peek!)
Green = allowed attention, Grey = masked (-โ)
import torch def create_causal_mask(seq_len): """Lower-triangular mask: 1 = attend, 0 = block""" mask = torch.tril(torch.ones(seq_len, seq_len)) # Convert 0s to -inf for attention scores mask = mask.masked_fill(mask == 0, float('-inf')) mask = mask.masked_fill(mask == 1, 0.0) return mask mask = create_causal_mask(5) print(mask) # tensor([[ 0., -inf, -inf, -inf, -inf], # [ 0., 0., -inf, -inf, -inf], # [ 0., 0., 0., -inf, -inf], # [ 0., 0., 0., 0., -inf], # [ 0., 0., 0., 0., 0.]])
Time to put it all together. Below is a complete, working GPT-style decoder-only Transformer in PyTorch. Every component from this module โ attention, FFN, LayerNorm, residuals, masking โ assembled into one model.
We're assembling all the LEGO pieces from this course into the full robot. Input text โ token embeddings + position encoding โ N transformer blocks (each: masked attention โ norm โ FFN โ norm) โ project to vocabulary โ output probability for next word. This is the architecture behind ChatGPT.
from dataclasses import dataclass @dataclass class GPTConfig: vocab_size: int = 50257 # GPT-2 vocabulary size max_seq_len: int = 1024 # max context window d_model: int = 768 # embedding dimension n_heads: int = 12 # attention heads n_layers: int = 12 # transformer blocks d_ff: int = 3072 # FFN inner dimension (4 ร 768) dropout: float = 0.1
import torch import torch.nn as nn import math class MultiHeadAttention(nn.Module): def __init__(self, cfg): super().__init__() self.n_heads = cfg.n_heads self.head_dim = cfg.d_model // cfg.n_heads self.qkv = nn.Linear(cfg.d_model, 3 * cfg.d_model) self.proj = nn.Linear(cfg.d_model, cfg.d_model) self.dropout = nn.Dropout(cfg.dropout) def forward(self, x, mask=None): B, T, C = x.shape qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim) qkv = qkv.permute(2, 0, 3, 1, 4) q, k, v = qkv[0], qkv[1], qkv[2] scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim) if mask is not None: scores = scores + mask[:T, :T] attn = scores.softmax(dim=-1) attn = self.dropout(attn) out = (attn @ v).transpose(1, 2).reshape(B, T, C) return self.proj(out) class FeedForward(nn.Module): def __init__(self, cfg): super().__init__() self.net = nn.Sequential( nn.Linear(cfg.d_model, cfg.d_ff), nn.GELU(), nn.Linear(cfg.d_ff, cfg.d_model), nn.Dropout(cfg.dropout), ) def forward(self, x): return self.net(x) class TransformerBlock(nn.Module): def __init__(self, cfg): super().__init__() self.attn = MultiHeadAttention(cfg) self.ffn = FeedForward(cfg) self.ln1 = nn.LayerNorm(cfg.d_model) self.ln2 = nn.LayerNorm(cfg.d_model) def forward(self, x, mask=None): # Pre-norm variant (used by GPT-2 and later) x = x + self.attn(self.ln1(x), mask) # residual + attention x = x + self.ffn(self.ln2(x)) # residual + FFN return x
class GPT(nn.Module): def __init__(self, cfg): super().__init__() self.cfg = cfg # Token + positional embeddings self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.d_model) self.pos_emb = nn.Embedding(cfg.max_seq_len, cfg.d_model) self.drop = nn.Dropout(cfg.dropout) # Stack of N transformer blocks self.blocks = nn.ModuleList([ TransformerBlock(cfg) for _ in range(cfg.n_layers) ]) # Final layer norm + projection to vocab self.ln_f = nn.LayerNorm(cfg.d_model) self.head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False) # Causal mask (registered as buffer โ not a parameter) mask = torch.triu( torch.full((cfg.max_seq_len, cfg.max_seq_len), float('-inf')), diagonal=1 ) self.register_buffer('mask', mask) def forward(self, idx): B, T = idx.shape tok = self.tok_emb(idx) # (B, T, d_model) pos = self.pos_emb(torch.arange(T, device=idx.device)) # (T, d_model) x = self.drop(tok + pos) # combine embeddings for block in self.blocks: x = block(x, self.mask) # pass through each block x = self.ln_f(x) # final layer norm logits = self.head(x) # project to vocab size return logits # (B, T, vocab_size)
cfg = GPTConfig() model = GPT(cfg) n_params = sum(p.numel() for p in model.parameters()) print(f"Model parameters: {n_params:,}") # โ Model parameters: 124,439,808 (~124M โ this is GPT-2 Small!) # Quick test: feed random token IDs idx = torch.randint(0, cfg.vocab_size, (2, 64)) # batch=2, seq_len=64 logits = model(idx) print(f"Output shape: {logits.shape}") # โ Output shape: torch.Size([2, 64, 50257]) # For each token position โ probability over 50,257 possible next tokens
Adjust the hyperparameters and see how the parameter count changes!
We have the architecture โ but a randomly initialized GPT produces gibberish. In the next module, we'll cover training: how to feed it billions of tokens, compute the loss (cross-entropy), and iteratively update all 124M+ parameters until the model can write coherent text. The training loop is the same 5 steps from Module 3 โ just scaled to hundreds of GPUs!