Module 6 — Training

🏋️ Training LLMs

From random noise to Shakespeare — how a model learns to write by predicting one token at a time, billions of times.

Part 1: How LLMs Learn — Next-Token Prediction

The entire secret behind GPT, Llama, and every modern LLM is shockingly simple: given some words, predict the next one. That's the only training objective. No human labeling, no special rules — just predict the next token, trillions of times.

👶 Like You're 5

It's a fill-in-the-blank game. Someone says "The cat sat on the ___" and you guess "mat." The model plays this game on every sentence in the entire internet, billions of times. After enough practice, it gets really good at guessing — and that "guessing" is what we call intelligence.

🎯 Next-Token Prediction — The Core Objective

Input tokens → Model → Predict next The cat sat on the ??? 🧠 Transformer Model Prediction: "mat" (P=0.73) ✓

📖 The Autocomplete Analogy

Your phone keyboard's autocomplete is a tiny version of this. It predicts the next word based on what you've typed. An LLM is the same idea, but trained on the entire internet with a 124M+ parameter brain instead of a simple lookup table.

💡 Why This Works So Well

  • Self-supervised — no human labels needed, just raw text
  • To predict well, the model must learn grammar, facts, reasoning, even code
  • The internet has trillions of tokens — essentially unlimited free training data
  • Cross-entropy loss measures how surprised the model is by the real next token

Part 2: Data Preparation

Before training, we need to turn raw text into batches of input-target pairs. For our mini project we'll use TinyShakespeare (~1MB of Shakespeare plays). Real LLMs train on Common Crawl (petabytes of web text).

👶 Like You're 5

Imagine cutting a really long book into flashcards. Each flashcard shows a chunk of text and the answer is the next character/word. We slide a window across the entire book, creating millions of flashcards. Then we shuffle them into batches and feed them to the model.

Sliding Window Over Text

We pick a context_length (say 64 tokens), then slide across the text: tokens [0:64] → predict [1:65], tokens [1:65] → predict [2:66], and so on. Each window gives us 64 training examples at once thanks to causal masking.

📏 Sliding Window — Creating Training Pairs

Full text: "First Citizen: Before we proceed any further hear me speak..." F i r s t _ C i t i z e n : _ B e f o r e _ w e _ p r o c e e d ... context_length = 8 Input: [F,i,r,s,t,_,C,i] Target: [i,r,s,t,_,C,i,t] Target is just the input shifted by 1 position → Millions of pairs from one book!
Python / Data Prep
import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, text, tokenizer, ctx_len):
        self.tokens = tokenizer.encode(text)
        self.ctx_len = ctx_len

    def __len__(self):
        return len(self.tokens) - self.ctx_len

    def __getitem__(self, i):
        chunk = self.tokens[i : i + self.ctx_len + 1]
        x = torch.tensor(chunk[:-1])   # input
        y = torch.tensor(chunk[1:])    # target (shifted by 1)
        return x, y

# Load TinyShakespeare
with open("tiny_shakespeare.txt") as f:
    text = f.read()
print(f"Dataset: {len(text):,} characters")
# → Dataset: 1,115,394 characters

🍕 The Pizza Slice Analogy

Your text is one giant pizza. The sliding window is like cutting overlapping slices — each slice (batch) is a manageable piece, but every part of the pizza gets eaten. The DataLoader is the waiter who brings slices to the model in random order so it doesn't get bored eating the same corner repeatedly.

Part 3: Model Configuration

Before training, you set the model's hyperparameters — these define the size and capacity of your Transformer. Bigger values = more parameters = smarter but more expensive.

👶 Like You're 5

It's like customizing a robot before building it. How many brain layers? (n_layers) How many eyes to look around? (n_heads) How wide is each thought? (d_model) How many words does it know? (vocab_size) How far back can it remember? (context_length). Bigger numbers = smarter robot but takes longer to build.

Interactive: Configure Your Model

Drag the sliders to set hyperparameters and watch the parameter count update live!

Total Parameters
Adjust sliders above
Python / Config
from dataclasses import dataclass

@dataclass
class MiniGPTConfig:
    vocab_size:     int = 256     # character-level (ASCII)
    context_length: int = 64      # how far the model can "see"
    d_model:        int = 128     # embedding dimension
    n_heads:        int = 4       # attention heads
    n_layers:       int = 4       # transformer blocks
    dropout:        float = 0.1

🎛️ The Mixing Board Analogy

Configuring a model is like a sound engineer adjusting a mixing board before recording. Each slider (d_model, n_layers, n_heads) changes the richness, depth, and detail of the output. Turn everything to max and you get a symphony — but you need a massive studio (GPU) to run it.

Part 4: The Training Loop

Training is a loop of 5 steps repeated thousands of times. Each iteration, the model sees a batch, makes predictions, checks how wrong it was, and adjusts its weights to be less wrong next time.

👶 Like You're 5

Imagine practicing spelling quizzes. Each round: (1) Teacher reads a word, (2) You write your guess, (3) Teacher marks it right or wrong, (4) You look at what you got wrong and try to remember, (5) You write down your score. After 1,000 quizzes, you're a great speller! That's training.

🔄 The 5-Step Training Loop

① Forward Pass ② Compute Loss ③ Backward Pass ④ Optimizer Step ⑤ Log Metrics 🔁 Repeat ① Feed batch through model → ② Cross-entropy loss ③ Compute gradients → ④ Update weights (AdamW) ⑤ Print loss — repeat for thousands of steps

Interactive: Train for 1 Epoch

Click the button to simulate training — watch the loss decrease!

Loss: — | Epoch: 0
Python / Training Loop
import torch.nn as nn

model = MiniGPT(config)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    for batch_idx, (x, y) in enumerate(dataloader):
        # ① Forward pass
        logits = model(x)                  # (B, T, vocab_size)

        # ② Compute loss
        loss = criterion(
            logits.view(-1, config.vocab_size),
            y.view(-1)
        )

        # ③ Backward pass
        optimizer.zero_grad()
        loss.backward()

        # ④ Optimizer step
        optimizer.step()

        # ⑤ Log metrics
        if batch_idx % 100 == 0:
            print(f"Epoch {epoch} | Step {batch_idx} | Loss: {loss.item():.4f}")

🏔️ The Mountain Descent

Training is like being blindfolded on a mountain and trying to reach the valley (lowest loss). Each step: you feel the slope (gradients), take a step downhill (optimizer), and check your altitude (loss). After thousands of steps, you reach the bottom — your model has learned!

Part 5: Text Generation

Once trained, the model generates text by repeatedly predicting the next token and appending it. But how we pick from the probability distribution matters a lot — that's where temperature and top-k sampling come in.

👶 Like You're 5

Temperature is like a creativity dial. Turn it low (0.1) and the model always picks the safest, most obvious word — boring but correct. Turn it to 1.0 and it gets creative. Crank it to 2.0 and it goes crazy, picking random weird words. Top-K means "only consider the K best options" — like only letting yourself choose from the top 10 flavors at an ice cream shop instead of all 500.

🌡️ Temperature Effect on Probabilities

T = 0.1 (Safe) T = 1.0 (Balanced) T = 2.0 (Wild) mat rug floor mat rug floor mat rug floor Almost always "mat" Usually "mat", sometimes others Could be anything!

Interactive: Temperature Slider

Drag to see how temperature changes the output style:

0.1 (Boring)1.03.0 (Chaos)
Temperature: 1.0
Model says: Balanced and coherent text with some variety.
Python / Generate
@torch.no_grad()
def generate(model, idx, max_new, temperature=1.0, top_k=None):
    for _ in range(max_new):
        context = idx[:, -config.context_length:]
        logits = model(context)[:, -1, :]   # last position
        logits = logits / temperature            # scale by temperature
        if top_k:
            v, _ = logits.topk(top_k)
            logits[logits < v[:, [-1]]] = float('-inf')
        probs = logits.softmax(dim=-1)
        next_tok = torch.multinomial(probs, 1)
        idx = torch.cat([idx, next_tok], dim=-1)
    return idx

💡 Generation Strategies

  • Greedy (T→0): always pick highest probability — deterministic but repetitive
  • Sampling (T=1.0): sample from the full distribution — natural variety
  • Top-K: only sample from the K most likely tokens — avoids nonsense
  • Top-P (Nucleus): sample from tokens whose cumulative probability ≥ P — adaptive K

Part 6: PROJECT — Build & Train Mini-GPT

Time to put everything together. Below is a complete, working script that combines all modules: Config → Tokenizer → Model → DataLoader → Training → Generation. Run this and watch your model go from gibberish to Shakespeare-ish in minutes.

👶 Like You're 5

We're assembling the whole robot and turning it on. It starts babbling random letters. After training on Shakespeare for a few minutes, it starts writing things that look like Shakespeare — not perfect, but recognizably English with "thee" and "thou" and dramatic speeches. That's learning!

Python / Complete Mini-GPT (~60 lines)
import torch, torch.nn as nn
from dataclasses import dataclass

@dataclass
class Cfg:
    V=256; T=64; D=128; H=4; L=4; drop=0.1

class Block(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.ln1, self.ln2 = nn.LayerNorm(c.D), nn.LayerNorm(c.D)
        self.attn = nn.MultiheadAttention(c.D, c.H, dropout=c.drop, batch_first=True)
        self.ffn = nn.Sequential(nn.Linear(c.D, c.D*4), nn.GELU(), nn.Linear(c.D*4, c.D))
    def forward(self, x, mask):
        h = self.ln1(x)
        x = x + self.attn(h, h, h, attn_mask=mask, is_causal=False)[0]
        return x + self.ffn(self.ln2(x))

class MiniGPT(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.c = c
        self.tok = nn.Embedding(c.V, c.D)
        self.pos = nn.Embedding(c.T, c.D)
        self.blocks = nn.ModuleList([Block(c) for _ in range(c.L)])
        self.ln = nn.LayerNorm(c.D)
        self.head = nn.Linear(c.D, c.V, bias=False)
        mask = torch.triu(torch.full((c.T, c.T), float('-inf')), 1)
        self.register_buffer('mask', mask)
    def forward(self, x):
        B, T = x.shape
        x = self.tok(x) + self.pos(torch.arange(T, device=x.device))
        for b in self.blocks: x = b(x, self.mask[:T,:T])
        return self.head(self.ln(x))

# --- Load data & train ---
text = open("tiny_shakespeare.txt").read()
data = torch.tensor([ord(c) for c in text], dtype=torch.long)
cfg = Cfg()
model = MiniGPT(cfg)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

for step in range(3000):
    ix = torch.randint(len(data)-cfg.T-1, (32,))
    x = torch.stack([data[i:i+cfg.T] for i in ix])
    y = torch.stack([data[i+1:i+cfg.T+1] for i in ix])
    loss = nn.functional.cross_entropy(model(x).view(-1,cfg.V), y.view(-1))
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 500 == 0: print(f"Step {step}: loss={loss.item():.3f}")

# --- Generate ---
prompt = torch.tensor([[ord(c) for c in "ROMEO:"]])
with torch.no_grad():
    for _ in range(200):
        logits = model(prompt[:, -cfg.T:])[:, -1] / 0.8
        prompt = torch.cat([prompt, torch.multinomial(logits.softmax(-1),1)], 1)
print("".join(chr(t) for t in prompt[0]))

📊 Expected Output After Training

  • Step 0 — Loss: ~5.5 (random guessing across 256 chars = ln(256) ≈ 5.5)
  • Step 1000 — Loss: ~2.0 (learning common letters and spaces)
  • Step 3000 — Loss: ~1.4 (forming words and basic structure)
Sample Generated Output
ROMEO:
What is the matter with the world, that thou
Art so bestow'd upon thy gentle heart?
I prithee, tell me, what dost thou depart
From all the grace of heaven's sweet light?

🎭 The Shakespeare Parrot

Your Mini-GPT is like a parrot that listened to every Shakespeare play on loop. At first it just squawks random letters. After training, it speaks in iambic pentameter with "thee" and "thou" — it sounds like Shakespeare even though it doesn't truly understand the meaning. That's the power (and limitation) of next-token prediction.

🎓 What You've Accomplished

You built a language model from scratch. The same architecture (just bigger) powers ChatGPT. You understand: tokenization, embeddings, attention, transformers, training loops, and generation. The next step? Fine-tuning — teaching a pre-trained model to follow instructions.

Quiz — Test Your Knowledge

Question 1: What is the core training objective of models like GPT?

Question 2: What does cross-entropy loss measure during LLM training?

Question 3: What happens if the learning rate is set too high during training?

Question 4: What is a key tradeoff of using very large batch sizes?

Question 5: What does a low temperature (e.g. 0.1) do during text generation?

Question 6: What is the purpose of top-k sampling?

Quiz — Test Your Knowledge

Question 1: What is the core training objective of GPT and other modern LLMs?

Question 2: What does cross-entropy loss measure during LLM training?

Question 3: Why is the learning rate one of the most important hyperparameters in training?

Question 4: What is the tradeoff when choosing a larger batch size for training?

Question 5: What does the temperature parameter do during text generation?

Question 6: What is the purpose of top-k sampling in text generation?