Everything you've learned comes together. You'll build a character-level GPT, train it on Shakespeare, and serve it with a web UI β from scratch.
Here's what we're building: a character-level GPT trained on Shakespeare that generates new text one character at a time. It's a miniature version of exactly how ChatGPT works β same architecture, same training objective, same generation process.
Imagine you read every Shakespeare play a thousand times. Eventually you'd be able to write something that sounds like Shakespeare, right? That's what our model does β it reads Shakespeare so many times that it learns the patterns and can write new stuff in the same style.
We'll use TinyShakespeare β about 1 MB of Shakespeare's complete works. It's small enough to train on a laptop but large enough to produce surprisingly good results.
A tokenizer is like giving every letter its own secret number. "A" = 0, "B" = 1, "C" = 2 β¦ The computer only understands numbers, so we convert every character in Shakespeare into a list of numbers, train on those numbers, and convert the output numbers back to letters.
import torch # Download Tiny Shakespeare (~1MB of text) import urllib.request url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" urllib.request.urlretrieve(url, "shakespeare.txt") # Load and inspect text = open("shakespeare.txt").read() print(f"Total characters: {len(text):,}") # ~1,115,394 # Build character-level vocabulary chars = sorted(set(text)) vocab_size = len(chars) # 65 unique characters print(f"Vocab: {''.join(chars)}") # Encode / decode functions stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for ch, i in stoi.items()} encode = lambda s: [stoi[c] for c in s] decode = lambda l: "".join(itos[i] for i in l) # Convert entire text to tensor data = torch.tensor(encode(text), dtype=torch.long) n = int(0.9 * len(data)) train_data, val_data = data[:n], data[n:]
Now we assemble everything from the course: token embeddings + positional encoding + N transformer blocks + final linear head. This is the exact architecture behind GPT-2 β just smaller.
Each module we learned is like a LEGO brick. Embeddings snap onto positional encoding. Transformer blocks (attention + feed-forward) stack on top of each other. The output head snaps on at the end. Put them all together and you get a GPT model!
# Reasonable defaults for a laptop-trainable model config = { "vocab_size": 65, # characters in Shakespeare "n_embd": 384, # embedding dimension "n_head": 6, # attention heads "n_layer": 6, # transformer blocks "block_size": 256, # context window "dropout": 0.2, # regularization } # Total params: ~10.7 million (GPT-4 has ~1.7 trillion!)
import torch.nn as nn import torch.nn.functional as F class Head(nn.Module): def __init__(self, head_size, n_embd, block_size, dropout): super().__init__() self.key = nn.Linear(n_embd, head_size, bias=False) self.query = nn.Linear(n_embd, head_size, bias=False) self.value = nn.Linear(n_embd, head_size, bias=False) self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size))) self.dropout = nn.Dropout(dropout) def forward(self, x): B, T, C = x.shape k, q = self.key(x), self.query(x) w = q @ k.transpose(-2, -1) * C**-0.5 w = w.masked_fill(self.tril[:T,:T] == 0, float("-inf")) w = self.dropout(F.softmax(w, dim=-1)) return w @ self.value(x) class MultiHead(nn.Module): def __init__(self, n_head, head_size, n_embd, block_size, dropout): super().__init__() self.heads = nn.ModuleList([Head(head_size, n_embd, block_size, dropout) for _ in range(n_head)]) self.proj = nn.Linear(n_embd, n_embd) self.dropout = nn.Dropout(dropout) def forward(self, x): return self.dropout(self.proj(torch.cat([h(x) for h in self.heads], dim=-1))) class FeedForward(nn.Module): def __init__(self, n_embd, dropout): super().__init__() self.net = nn.Sequential(nn.Linear(n_embd, 4*n_embd), nn.ReLU(), nn.Linear(4*n_embd, n_embd), nn.Dropout(dropout)) def forward(self, x): return self.net(x) class Block(nn.Module): def __init__(self, n_embd, n_head, block_size, dropout): super().__init__() self.sa = MultiHead(n_head, n_embd // n_head, n_embd, block_size, dropout) self.ffwd = FeedForward(n_embd, dropout) self.ln1 = nn.LayerNorm(n_embd) self.ln2 = nn.LayerNorm(n_embd) def forward(self, x): x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) return x class MiniGPT(nn.Module): def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout): super().__init__() self.block_size = block_size self.tok_emb = nn.Embedding(vocab_size, n_embd) self.pos_emb = nn.Embedding(block_size, n_embd) self.blocks = nn.Sequential(*[Block(n_embd, n_head, block_size, dropout) for _ in range(n_layer)]) self.ln_f = nn.LayerNorm(n_embd) self.head = nn.Linear(n_embd, vocab_size) def forward(self, idx, targets=None): B, T = idx.shape tok = self.tok_emb(idx) pos = self.pos_emb(torch.arange(T, device=idx.device)) x = self.ln_f(self.blocks(tok + pos)) logits = self.head(x) loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None return logits, loss
Training is the magic moment: we feed Shakespeare in, the model predicts the next character, checks if it was right, adjusts its weights, and repeats β thousands of times until it "gets" Shakespeare.
It's like practicing spelling. At first you get almost every letter wrong (loss = 4.0, basically random). But after thousands of tries, you start getting most letters right (loss = 1.5). The "loss" number tells you how confused the model still is β lower = smarter!
device = "cuda" if torch.cuda.is_available() else "cpu" model = MiniGPT(**config).to(device) optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) def get_batch(split, block_size=256, batch_size=64): d = train_data if split == "train" else val_data ix = torch.randint(len(d) - block_size, (batch_size,)) x = torch.stack([d[i:i+block_size] for i in ix]).to(device) y = torch.stack([d[i+1:i+block_size+1] for i in ix]).to(device) return x, y # Training loop β ~5000 steps, ~15min on GPU for step in range(5000): xb, yb = get_batch("train") logits, loss = model(xb, yb) optimizer.zero_grad(set_to_none=True) loss.backward() optimizer.step() if step % 500 == 0: print(f"Step {step:>5d} | Loss: {loss.item():.4f}") # Save the trained model torch.save(model.state_dict(), "mini_gpt.pt")
Click "Start Training" to watch the model learn (simulated).
Now the fun part! We give the model a starting prompt and let it predict one character at a time, feeding each prediction back in as input. The temperature parameter controls how creative vs. safe the output is.
Temperature is like a "creativity dial." Turn it low (0.3) = the model always picks the safest, most obvious next letter (boring but correct). Turn it high (1.5) = it takes wild guesses (creative but sometimes nonsense). The sweet spot is around 0.8.
@torch.no_grad() def generate(model, start="\n", max_tokens=500, temperature=0.8): idx = torch.tensor([encode(start)], device=device) for _ in range(max_tokens): context = idx[:, -model.block_size:] logits, _ = model(context) logits = logits[:, -1, :] / temperature probs = F.softmax(logits, dim=-1) next_id = torch.multinomial(probs, num_samples=1) idx = torch.cat([idx, next_id], dim=1) return decode(idx[0].tolist())
Let's wrap our model in a FastAPI backend and build a tiny HTML chat interface. Type a prompt, hit enter, and your Mini-GPT responds in Shakespearean English!
from fastapi import FastAPI from fastapi.responses import HTMLResponse from pydantic import BaseModel app = FastAPI() # Load model at startup (assumes mini_gpt.pt exists) model = MiniGPT(**config) model.load_state_dict(torch.load("mini_gpt.pt", map_location="cpu")) model.eval() class Prompt(BaseModel): text: str temperature: float = 0.8 @app.post("/generate") def gen(p: Prompt): output = generate(model, start=p.text, max_tokens=200, temperature=p.temperature) return {"response": output} @app.get("/", response_class=HTMLResponse) def home(): return open("chat.html").read() # Run: uvicorn server:app --reload
Type anything and get a Shakespearean response!
<!DOCTYPE html> <html><body> <div id="chat"></div> <input id="inp" placeholder="Type here..."> <button onclick="send()">Send</button> <script> async function send() { const text = document.getElementById("inp").value; addMsg(text, "user"); const res = await fetch("/generate", { method: "POST", headers: {"Content-Type": "application/json"}, body: JSON.stringify({text, temperature: 0.8}) }); const data = await res.json(); addMsg(data.response, "bot"); } </script> </body></html>
You went from "What's an LLM?" to building a working text generator from scratch.
You now understand the entire LLM pipeline β the same one behind ChatGPT, Claude, Gemini, and Llama. Scale up the data, scale up the parameters, add RLHF, and you're building frontier AI. The only difference between your Mini-GPT and GPT-4 is scale. You've got the fundamentals. Now go build something amazing!
You've built a Mini-GPT from scratch! Let's make sure the key concepts stuck. Answer all 5 questions.