Module 9 β€” Capstone

πŸ† Capstone: Build Your Own Mini-ChatGPT

Everything you've learned comes together. You'll build a character-level GPT, train it on Shakespeare, and serve it with a web UI β€” from scratch.

Part 1: Project Plan & Architecture

Here's what we're building: a character-level GPT trained on Shakespeare that generates new text one character at a time. It's a miniature version of exactly how ChatGPT works β€” same architecture, same training objective, same generation process.

πŸ‘Ά Like You're 5

Imagine you read every Shakespeare play a thousand times. Eventually you'd be able to write something that sounds like Shakespeare, right? That's what our model does β€” it reads Shakespeare so many times that it learns the patterns and can write new stuff in the same style.

πŸ—ΊοΈ Full Pipeline β€” From Raw Text to Web Chat

End-to-End Mini-GPT Pipeline πŸ“„ Raw Text Shakespeare 1MB πŸ”€ Tokenizer Char β†’ Integer 🧠 GPT Model 6 layers, 6 heads πŸ‹οΈ Training Loss: 4.0 β†’ 1.5 ✨ Generate Temperature ctrl 🌐 Web Chat UI FastAPI + HTML/JS Tech Stack Python PyTorch FastAPI HTML/JS ~10M params

πŸ† What You'll Walk Away With

  • A working text generator trained on Shakespeare
  • A complete GPT implementation β€” the same architecture behind ChatGPT
  • A web chat interface where you can talk to your model
  • Deep understanding of every piece of the LLM pipeline

Part 2: Step 1 β€” Dataset & Tokenizer

We'll use TinyShakespeare β€” about 1 MB of Shakespeare's complete works. It's small enough to train on a laptop but large enough to produce surprisingly good results.

πŸ‘Ά Like You're 5

A tokenizer is like giving every letter its own secret number. "A" = 0, "B" = 1, "C" = 2 … The computer only understands numbers, so we convert every character in Shakespeare into a list of numbers, train on those numbers, and convert the output numbers back to letters.

Download & Build the Vocab

Python
import torch

# Download Tiny Shakespeare (~1MB of text)
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "shakespeare.txt")

# Load and inspect
text = open("shakespeare.txt").read()
print(f"Total characters: {len(text):,}")  # ~1,115,394

# Build character-level vocabulary
chars = sorted(set(text))
vocab_size = len(chars)  # 65 unique characters
print(f"Vocab: {''.join(chars)}")

# Encode / decode functions
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join(itos[i] for i in l)

# Convert entire text to tensor
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

πŸ”€ Character Tokenization in Action

encode("Hello") β†’ [20, 43, 50, 50, 53] β†’ decode β†’ "Hello" H e l l o ↓ ↓ ↓ ↓ ↓ 20 43 50 50 53 βœ… Fully reversible! decode(encode(x)) == x

Part 3: Step 2 β€” Build the GPT Model

Now we assemble everything from the course: token embeddings + positional encoding + N transformer blocks + final linear head. This is the exact architecture behind GPT-2 β€” just smaller.

🧱 The LEGO Analogy

Each module we learned is like a LEGO brick. Embeddings snap onto positional encoding. Transformer blocks (attention + feed-forward) stack on top of each other. The output head snaps on at the end. Put them all together and you get a GPT model!

Model Configuration

Python
# Reasonable defaults for a laptop-trainable model
config = {
    "vocab_size": 65,        # characters in Shakespeare
    "n_embd": 384,            # embedding dimension
    "n_head": 6,              # attention heads
    "n_layer": 6,             # transformer blocks
    "block_size": 256,        # context window
    "dropout": 0.2,           # regularization
}
# Total params: ~10.7 million (GPT-4 has ~1.7 trillion!)

Complete GPT Class

Python
import torch.nn as nn
import torch.nn.functional as F

class Head(nn.Module):
    def __init__(self, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k, q = self.key(x), self.query(x)
        w = q @ k.transpose(-2, -1) * C**-0.5
        w = w.masked_fill(self.tril[:T,:T] == 0, float("-inf"))
        w = self.dropout(F.softmax(w, dim=-1))
        return w @ self.value(x)

class MultiHead(nn.Module):
    def __init__(self, n_head, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size, n_embd, block_size, dropout) for _ in range(n_head)])
        self.proj  = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        return self.dropout(self.proj(torch.cat([h(x) for h in self.heads], dim=-1)))

class FeedForward(nn.Module):
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(n_embd, 4*n_embd), nn.ReLU(), nn.Linear(4*n_embd, n_embd), nn.Dropout(dropout))
    def forward(self, x): return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        self.sa   = MultiHead(n_head, n_embd // n_head, n_embd, block_size, dropout)
        self.ffwd = FeedForward(n_embd, dropout)
        self.ln1  = nn.LayerNorm(n_embd)
        self.ln2  = nn.LayerNorm(n_embd)
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout):
        super().__init__()
        self.block_size = block_size
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Embedding(block_size, n_embd)
        self.blocks  = nn.Sequential(*[Block(n_embd, n_head, block_size, dropout) for _ in range(n_layer)])
        self.ln_f    = nn.LayerNorm(n_embd)
        self.head    = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = self.ln_f(self.blocks(tok + pos))
        logits = self.head(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None
        return logits, loss

πŸ’‘ Architecture Breakdown

  • Head β€” one attention head (Query, Key, Value + causal mask)
  • MultiHead β€” 6 heads in parallel, then project back
  • FeedForward β€” expand 4Γ—, ReLU, project back (the "thinking" step)
  • Block β€” LayerNorm β†’ Attention β†’ LayerNorm β†’ FFN (with residual connections)
  • MiniGPT β€” Token embed + position embed β†’ 6 Blocks β†’ output head

Part 4: Step 3 β€” Train the Model

Training is the magic moment: we feed Shakespeare in, the model predicts the next character, checks if it was right, adjusts its weights, and repeats β€” thousands of times until it "gets" Shakespeare.

πŸ‘Ά Like You're 5

It's like practicing spelling. At first you get almost every letter wrong (loss = 4.0, basically random). But after thousands of tries, you start getting most letters right (loss = 1.5). The "loss" number tells you how confused the model still is β€” lower = smarter!

Training Script

Python
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MiniGPT(**config).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

def get_batch(split, block_size=256, batch_size=64):
    d = train_data if split == "train" else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix]).to(device)
    y = torch.stack([d[i+1:i+block_size+1] for i in ix]).to(device)
    return x, y

# Training loop β€” ~5000 steps, ~15min on GPU
for step in range(5000):
    xb, yb = get_batch("train")
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    if step % 500 == 0:
        print(f"Step {step:>5d} | Loss: {loss.item():.4f}")

# Save the trained model
torch.save(model.state_dict(), "mini_gpt.pt")

Watch It Learn

πŸ‹οΈ Interactive: Simulated Training Progress

Click "Start Training" to watch the model learn (simulated).

Step 0 / 5000 Loss: 4.17
Model output will appear here as training progresses…

πŸ“‰ Expected Loss Curve

Training Loss Over 5000 Steps 4.0 3.0 2.0 1.5 0 1000 2500 5000 ~4.17 ~1.48 βœ“ Rapid learning Fine-tuning

Part 5: Step 4 β€” Generate Text

Now the fun part! We give the model a starting prompt and let it predict one character at a time, feeding each prediction back in as input. The temperature parameter controls how creative vs. safe the output is.

πŸ‘Ά Like You're 5

Temperature is like a "creativity dial." Turn it low (0.3) = the model always picks the safest, most obvious next letter (boring but correct). Turn it high (1.5) = it takes wild guesses (creative but sometimes nonsense). The sweet spot is around 0.8.

Generation Function

Python
@torch.no_grad()
def generate(model, start="\n", max_tokens=500, temperature=0.8):
    idx = torch.tensor([encode(start)], device=device)
    for _ in range(max_tokens):
        context = idx[:, -model.block_size:]
        logits, _ = model(context)
        logits = logits[:, -1, :] / temperature
        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_id], dim=1)
    return decode(idx[0].tolist())

Temperature Comparison

T = 0.3 β€” Safe & Predictable
KING RICHARD II:
I am the king of the world, and the world
is the world of the king, and the king
shall be the king of the world.
T = 0.8 β€” Balanced βœ“
KING RICHARD II:
What say'st thou? Dost thou not perceive
that I am faint with weeping? Let me rest,
for heavenly comfort hath thy words beguiled.
T = 1.5 β€” Wild & Creative
KING RICHXRD IZ:
Fhath, bey'rt ouncelly grabeβ€”'twungon
mine efflarg! Swooden thee crothwick
dapperjankt fulvion! Spraze!

Part 6: Step 5 β€” Simple Chat Web Interface

Let's wrap our model in a FastAPI backend and build a tiny HTML chat interface. Type a prompt, hit enter, and your Mini-GPT responds in Shakespearean English!

FastAPI Backend

Python β€” server.py
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from pydantic import BaseModel

app = FastAPI()
# Load model at startup (assumes mini_gpt.pt exists)
model = MiniGPT(**config)
model.load_state_dict(torch.load("mini_gpt.pt", map_location="cpu"))
model.eval()

class Prompt(BaseModel):
    text: str
    temperature: float = 0.8

@app.post("/generate")
def gen(p: Prompt):
    output = generate(model, start=p.text, max_tokens=200, temperature=p.temperature)
    return {"response": output}

@app.get("/", response_class=HTMLResponse)
def home():
    return open("chat.html").read()

# Run: uvicorn server:app --reload

Try It β€” Simulated Chat Demo

πŸ’¬ Chat with Mini-GPT (Simulated)

Type anything and get a Shakespearean response!

Mini-GPT Shakespeare
Hark! I am thy Mini-GPT, trained upon the works of the Bard. Speak, and I shall respond in kind.

HTML Frontend

HTML β€” chat.html
<!DOCTYPE html>
<html><body>
  <div id="chat"></div>
  <input id="inp" placeholder="Type here...">
  <button onclick="send()">Send</button>
  <script>
  async function send() {
    const text = document.getElementById("inp").value;
    addMsg(text, "user");
    const res = await fetch("/generate", {
      method: "POST",
      headers: {"Content-Type": "application/json"},
      body: JSON.stringify({text, temperature: 0.8})
    });
    const data = await res.json();
    addMsg(data.response, "bot");
  }
  </script>
</body></html>

What You've Accomplished

πŸŽ‰πŸ†πŸŽ‰

You went from "What's an LLM?" to building a working text generator from scratch.

πŸ—ΊοΈ Your Complete Journey

  • Module 1 β€” Learned what LLMs are and why they matter
  • Module 2 β€” Turned text into numbers (tokenization & embeddings)
  • Module 3 β€” Refreshed neural network fundamentals
  • Module 4 β€” Understood the attention mechanism β€” the core innovation
  • Module 5 β€” Built the full Transformer architecture
  • Module 6 β€” Trained models with next-token prediction
  • Module 7 β€” Fine-tuned models for specific tasks
  • Module 8 β€” Deployed models to production
  • Module 9 (Here!) β€” Put it ALL together into a working Mini-GPT πŸ†

Revisit Any Module

πŸš€ Where to Go Next

You now understand the entire LLM pipeline β€” the same one behind ChatGPT, Claude, Gemini, and Llama. Scale up the data, scale up the parameters, add RLHF, and you're building frontier AI. The only difference between your Mini-GPT and GPT-4 is scale. You've got the fundamentals. Now go build something amazing!

Final Quiz β€” Capstone Check

You've built a Mini-GPT from scratch! Let's make sure the key concepts stuck. Answer all 5 questions.

Q1: What is the correct order of steps in the Mini-GPT pipeline built in this capstone?
Q2: What does the character-level tokenizer do in the Mini-GPT pipeline?
Q3: In the capstone project, how is the Shakespeare text prepared for training?
Q4: What does the temperature parameter control during text generation?
Q5: What does the training loss value tell you about the model?

Quiz β€” Test Your Knowledge

Q1: What is the correct order of steps in the Mini-GPT pipeline built in this capstone?
Q2: What does the character-level tokenizer do in this pipeline?
Q3: How is the training data prepared for the Mini-GPT model?
Q4: What happens when you set the temperature to a very high value (e.g., 1.5) during text generation?
Q5: What does a decreasing training loss (e.g., from 4.0 to 1.5) indicate?