From random noise to Shakespeare — how a model learns to write by predicting one token at a time, billions of times.
The entire secret behind GPT, Llama, and every modern LLM is shockingly simple: given some words, predict the next one. That's the only training objective. No human labeling, no special rules — just predict the next token, trillions of times.
It's a fill-in-the-blank game. Someone says "The cat sat on the ___" and you guess "mat." The model plays this game on every sentence in the entire internet, billions of times. After enough practice, it gets really good at guessing — and that "guessing" is what we call intelligence.
Your phone keyboard's autocomplete is a tiny version of this. It predicts the next word based on what you've typed. An LLM is the same idea, but trained on the entire internet with a 124M+ parameter brain instead of a simple lookup table.
Before training, we need to turn raw text into batches of input-target pairs. For our mini project we'll use TinyShakespeare (~1MB of Shakespeare plays). Real LLMs train on Common Crawl (petabytes of web text).
Imagine cutting a really long book into flashcards. Each flashcard shows a chunk of text and the answer is the next character/word. We slide a window across the entire book, creating millions of flashcards. Then we shuffle them into batches and feed them to the model.
We pick a context_length (say 64 tokens), then slide across the text: tokens [0:64] → predict [1:65], tokens [1:65] → predict [2:66], and so on. Each window gives us 64 training examples at once thanks to causal masking.
import torch from torch.utils.data import Dataset, DataLoader class TextDataset(Dataset): def __init__(self, text, tokenizer, ctx_len): self.tokens = tokenizer.encode(text) self.ctx_len = ctx_len def __len__(self): return len(self.tokens) - self.ctx_len def __getitem__(self, i): chunk = self.tokens[i : i + self.ctx_len + 1] x = torch.tensor(chunk[:-1]) # input y = torch.tensor(chunk[1:]) # target (shifted by 1) return x, y # Load TinyShakespeare with open("tiny_shakespeare.txt") as f: text = f.read() print(f"Dataset: {len(text):,} characters") # → Dataset: 1,115,394 characters
Your text is one giant pizza. The sliding window is like cutting overlapping slices — each slice (batch) is a manageable piece, but every part of the pizza gets eaten. The DataLoader is the waiter who brings slices to the model in random order so it doesn't get bored eating the same corner repeatedly.
Before training, you set the model's hyperparameters — these define the size and capacity of your Transformer. Bigger values = more parameters = smarter but more expensive.
It's like customizing a robot before building it. How many brain layers? (n_layers) How many eyes to look around? (n_heads) How wide is each thought? (d_model) How many words does it know? (vocab_size) How far back can it remember? (context_length). Bigger numbers = smarter robot but takes longer to build.
Drag the sliders to set hyperparameters and watch the parameter count update live!
from dataclasses import dataclass @dataclass class MiniGPTConfig: vocab_size: int = 256 # character-level (ASCII) context_length: int = 64 # how far the model can "see" d_model: int = 128 # embedding dimension n_heads: int = 4 # attention heads n_layers: int = 4 # transformer blocks dropout: float = 0.1
Configuring a model is like a sound engineer adjusting a mixing board before recording. Each slider (d_model, n_layers, n_heads) changes the richness, depth, and detail of the output. Turn everything to max and you get a symphony — but you need a massive studio (GPU) to run it.
Training is a loop of 5 steps repeated thousands of times. Each iteration, the model sees a batch, makes predictions, checks how wrong it was, and adjusts its weights to be less wrong next time.
Imagine practicing spelling quizzes. Each round: (1) Teacher reads a word, (2) You write your guess, (3) Teacher marks it right or wrong, (4) You look at what you got wrong and try to remember, (5) You write down your score. After 1,000 quizzes, you're a great speller! That's training.
Click the button to simulate training — watch the loss decrease!
import torch.nn as nn model = MiniGPT(config) optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) criterion = nn.CrossEntropyLoss() for epoch in range(10): for batch_idx, (x, y) in enumerate(dataloader): # ① Forward pass logits = model(x) # (B, T, vocab_size) # ② Compute loss loss = criterion( logits.view(-1, config.vocab_size), y.view(-1) ) # ③ Backward pass optimizer.zero_grad() loss.backward() # ④ Optimizer step optimizer.step() # ⑤ Log metrics if batch_idx % 100 == 0: print(f"Epoch {epoch} | Step {batch_idx} | Loss: {loss.item():.4f}")
Training is like being blindfolded on a mountain and trying to reach the valley (lowest loss). Each step: you feel the slope (gradients), take a step downhill (optimizer), and check your altitude (loss). After thousands of steps, you reach the bottom — your model has learned!
Once trained, the model generates text by repeatedly predicting the next token and appending it. But how we pick from the probability distribution matters a lot — that's where temperature and top-k sampling come in.
Temperature is like a creativity dial. Turn it low (0.1) and the model always picks the safest, most obvious word — boring but correct. Turn it to 1.0 and it gets creative. Crank it to 2.0 and it goes crazy, picking random weird words. Top-K means "only consider the K best options" — like only letting yourself choose from the top 10 flavors at an ice cream shop instead of all 500.
Drag to see how temperature changes the output style:
@torch.no_grad() def generate(model, idx, max_new, temperature=1.0, top_k=None): for _ in range(max_new): context = idx[:, -config.context_length:] logits = model(context)[:, -1, :] # last position logits = logits / temperature # scale by temperature if top_k: v, _ = logits.topk(top_k) logits[logits < v[:, [-1]]] = float('-inf') probs = logits.softmax(dim=-1) next_tok = torch.multinomial(probs, 1) idx = torch.cat([idx, next_tok], dim=-1) return idx
Time to put everything together. Below is a complete, working script that combines all modules: Config → Tokenizer → Model → DataLoader → Training → Generation. Run this and watch your model go from gibberish to Shakespeare-ish in minutes.
We're assembling the whole robot and turning it on. It starts babbling random letters. After training on Shakespeare for a few minutes, it starts writing things that look like Shakespeare — not perfect, but recognizably English with "thee" and "thou" and dramatic speeches. That's learning!
import torch, torch.nn as nn from dataclasses import dataclass @dataclass class Cfg: V=256; T=64; D=128; H=4; L=4; drop=0.1 class Block(nn.Module): def __init__(self, c): super().__init__() self.ln1, self.ln2 = nn.LayerNorm(c.D), nn.LayerNorm(c.D) self.attn = nn.MultiheadAttention(c.D, c.H, dropout=c.drop, batch_first=True) self.ffn = nn.Sequential(nn.Linear(c.D, c.D*4), nn.GELU(), nn.Linear(c.D*4, c.D)) def forward(self, x, mask): h = self.ln1(x) x = x + self.attn(h, h, h, attn_mask=mask, is_causal=False)[0] return x + self.ffn(self.ln2(x)) class MiniGPT(nn.Module): def __init__(self, c): super().__init__() self.c = c self.tok = nn.Embedding(c.V, c.D) self.pos = nn.Embedding(c.T, c.D) self.blocks = nn.ModuleList([Block(c) for _ in range(c.L)]) self.ln = nn.LayerNorm(c.D) self.head = nn.Linear(c.D, c.V, bias=False) mask = torch.triu(torch.full((c.T, c.T), float('-inf')), 1) self.register_buffer('mask', mask) def forward(self, x): B, T = x.shape x = self.tok(x) + self.pos(torch.arange(T, device=x.device)) for b in self.blocks: x = b(x, self.mask[:T,:T]) return self.head(self.ln(x)) # --- Load data & train --- text = open("tiny_shakespeare.txt").read() data = torch.tensor([ord(c) for c in text], dtype=torch.long) cfg = Cfg() model = MiniGPT(cfg) opt = torch.optim.AdamW(model.parameters(), lr=3e-4) for step in range(3000): ix = torch.randint(len(data)-cfg.T-1, (32,)) x = torch.stack([data[i:i+cfg.T] for i in ix]) y = torch.stack([data[i+1:i+cfg.T+1] for i in ix]) loss = nn.functional.cross_entropy(model(x).view(-1,cfg.V), y.view(-1)) opt.zero_grad(); loss.backward(); opt.step() if step % 500 == 0: print(f"Step {step}: loss={loss.item():.3f}") # --- Generate --- prompt = torch.tensor([[ord(c) for c in "ROMEO:"]]) with torch.no_grad(): for _ in range(200): logits = model(prompt[:, -cfg.T:])[:, -1] / 0.8 prompt = torch.cat([prompt, torch.multinomial(logits.softmax(-1),1)], 1) print("".join(chr(t) for t in prompt[0]))
ROMEO: What is the matter with the world, that thou Art so bestow'd upon thy gentle heart? I prithee, tell me, what dost thou depart From all the grace of heaven's sweet light?
Your Mini-GPT is like a parrot that listened to every Shakespeare play on loop. At first it just squawks random letters. After training, it speaks in iambic pentameter with "thee" and "thou" — it sounds like Shakespeare even though it doesn't truly understand the meaning. That's the power (and limitation) of next-token prediction.
You built a language model from scratch. The same architecture (just bigger) powers ChatGPT. You understand: tokenization, embeddings, attention, transformers, training loops, and generation. The next step? Fine-tuning — teaching a pre-trained model to follow instructions.