Capstone: Build Your Own Mini-ChatGPT

Part 1: Project Plan & Architecture

Here's what we're building: a character-level GPT trained on Shakespeare that generates new text one character at a time. It's a miniature version of exactly how ChatGPT works — same architecture, same training objective, same generation process.

👶 Like You're 5

Imagine you read every Shakespeare play a thousand times. Eventually you'd be able to write something that sounds like Shakespeare, right? That's what our model does — it reads Shakespeare so many times that it learns the patterns and can write new stuff in the same style.

🗺️ Full Pipeline — From Raw Text to Web Chat

🏆 What You'll Walk Away With

A working text generator trained on Shakespeare
A complete GPT implementation — the same architecture behind ChatGPT
A web chat interface where you can talk to your model
Deep understanding of every piece of the LLM pipeline

Part 2: Step 1 — Dataset & Tokenizer

We'll use TinyShakespeare — about 1 MB of Shakespeare's complete works. It's small enough to train on a laptop but large enough to produce surprisingly good results.

👶 Like You're 5

A tokenizer is like giving every letter its own secret number. "A" = 0, "B" = 1, "C" = 2 … The computer only understands numbers, so we convert every character in Shakespeare into a list of numbers, train on those numbers, and convert the output numbers back to letters.

Download & Build the Vocab

Python

import torch

# Download Tiny Shakespeare (~1MB of text)
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "shakespeare.txt")

# Load and inspect
text = open("shakespeare.txt").read()
print(f"Total characters: {len(text):,}")  # ~1,115,394

# Build character-level vocabulary
chars = sorted(set(text))
vocab_size = len(chars)  # 65 unique characters
print(f"Vocab: {''.join(chars)}")

# Encode / decode functions
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join(itos[i] for i in l)

# Convert entire text to tensor
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

🔤 Character Tokenization in Action

Part 3: Step 2 — Build the GPT Model

Now we assemble everything from the course: token embeddings + positional encoding + N transformer blocks + final linear head. This is the exact architecture behind GPT-2 — just smaller.

🧱 The LEGO Analogy

Each module we learned is like a LEGO brick. Embeddings snap onto positional encoding. Transformer blocks (attention + feed-forward) stack on top of each other. The output head snaps on at the end. Put them all together and you get a GPT model!

Model Configuration

Python

# Reasonable defaults for a laptop-trainable model
config = {
    "vocab_size": 65,        # characters in Shakespeare
    "n_embd": 384,            # embedding dimension
    "n_head": 6,              # attention heads
    "n_layer": 6,             # transformer blocks
    "block_size": 256,        # context window
    "dropout": 0.2,           # regularization
}
# Total params: ~10.7 million (GPT-4 has ~1.7 trillion!)

Complete GPT Class

Python

import torch.nn as nn
import torch.nn.functional as F

class Head(nn.Module):
    def __init__(self, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k, q = self.key(x), self.query(x)
        w = q @ k.transpose(-2, -1) * C**-0.5
        w = w.masked_fill(self.tril[:T,:T] == 0, float("-inf"))
        w = self.dropout(F.softmax(w, dim=-1))
        return w @ self.value(x)

class MultiHead(nn.Module):
    def __init__(self, n_head, head_size, n_embd, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size, n_embd, block_size, dropout) for _ in range(n_head)])
        self.proj  = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        return self.dropout(self.proj(torch.cat([h(x) for h in self.heads], dim=-1)))

class FeedForward(nn.Module):
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(n_embd, 4*n_embd), nn.ReLU(), nn.Linear(4*n_embd, n_embd), nn.Dropout(dropout))
    def forward(self, x): return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        self.sa   = MultiHead(n_head, n_embd // n_head, n_embd, block_size, dropout)
        self.ffwd = FeedForward(n_embd, dropout)
        self.ln1  = nn.LayerNorm(n_embd)
        self.ln2  = nn.LayerNorm(n_embd)
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout):
        super().__init__()
        self.block_size = block_size
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Embedding(block_size, n_embd)
        self.blocks  = nn.Sequential(*[Block(n_embd, n_head, block_size, dropout) for _ in range(n_layer)])
        self.ln_f    = nn.LayerNorm(n_embd)
        self.head    = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = self.ln_f(self.blocks(tok + pos))
        logits = self.head(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None
        return logits, loss

💡 Architecture Breakdown

Head — one attention head (Query, Key, Value + causal mask)
MultiHead — 6 heads in parallel, then project back
FeedForward — expand 4×, ReLU, project back (the "thinking" step)
Block — LayerNorm → Attention → LayerNorm → FFN (with residual connections)
MiniGPT — Token embed + position embed → 6 Blocks → output head

Part 4: Step 3 — Train the Model

Training is the magic moment: we feed Shakespeare in, the model predicts the next character, checks if it was right, adjusts its weights, and repeats — thousands of times until it "gets" Shakespeare.

👶 Like You're 5

It's like practicing spelling. At first you get almost every letter wrong (loss = 4.0, basically random). But after thousands of tries, you start getting most letters right (loss = 1.5). The "loss" number tells you how confused the model still is — lower = smarter!

Training Script

Python

device = "cuda" if torch.cuda.is_available() else "cpu"
model = MiniGPT(**config).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

def get_batch(split, block_size=256, batch_size=64):
    d = train_data if split == "train" else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix]).to(device)
    y = torch.stack([d[i+1:i+block_size+1] for i in ix]).to(device)
    return x, y

# Training loop — ~5000 steps, ~15min on GPU
for step in range(5000):
    xb, yb = get_batch("train")
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    if step % 500 == 0:
        print(f"Step {step:>5d} | Loss: {loss.item():.4f}")

# Save the trained model
torch.save(model.state_dict(), "mini_gpt.pt")

Watch It Learn

🏋️ Interactive: Simulated Training Progress

Click "Start Training" to watch the model learn (simulated).

Step 0 / 5000 Loss: 4.17

Model output will appear here as training progresses…

📉 Expected Loss Curve

Part 5: Step 4 — Generate Text

Now the fun part! We give the model a starting prompt and let it predict one character at a time, feeding each prediction back in as input. The temperature parameter controls how creative vs. safe the output is.

👶 Like You're 5

Temperature is like a "creativity dial." Turn it low (0.3) = the model always picks the safest, most obvious next letter (boring but correct). Turn it high (1.5) = it takes wild guesses (creative but sometimes nonsense). The sweet spot is around 0.8.

Generation Function

Python

@torch.no_grad()
def generate(model, start="\n", max_tokens=500, temperature=0.8):
    idx = torch.tensor([encode(start)], device=device)
    for _ in range(max_tokens):
        context = idx[:, -model.block_size:]
        logits, _ = model(context)
        logits = logits[:, -1, :] / temperature
        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_id], dim=1)
    return decode(idx[0].tolist())

Temperature Comparison

T = 0.3 — Safe & Predictable

KING RICHARD II:
I am the king of the world, and the world
is the world of the king, and the king
shall be the king of the world.

T = 0.8 — Balanced ✓

KING RICHARD II:
What say'st thou? Dost thou not perceive
that I am faint with weeping? Let me rest,
for heavenly comfort hath thy words beguiled.

T = 1.5 — Wild & Creative

KING RICHXRD IZ:
Fhath, bey'rt ouncelly grabe—'twungon
mine efflarg! Swooden thee crothwick
dapperjankt fulvion! Spraze!

Part 6: Step 5 — Simple Chat Web Interface

Let's wrap our model in a FastAPI backend and build a tiny HTML chat interface. Type a prompt, hit enter, and your Mini-GPT responds in Shakespearean English!

FastAPI Backend

Python — server.py

from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from pydantic import BaseModel

app = FastAPI()
# Load model at startup (assumes mini_gpt.pt exists)
model = MiniGPT(**config)
model.load_state_dict(torch.load("mini_gpt.pt", map_location="cpu"))
model.eval()

class Prompt(BaseModel):
    text: str
    temperature: float = 0.8

@app.post("/generate")
def gen(p: Prompt):
    output = generate(model, start=p.text, max_tokens=200, temperature=p.temperature)
    return {"response": output}

@app.get("/", response_class=HTMLResponse)
def home():
    return open("chat.html").read()

# Run: uvicorn server:app --reload

Try It — Simulated Chat Demo

💬 Chat with Mini-GPT (Simulated)

Type anything and get a Shakespearean response!

Mini-GPT Shakespeare

Hark! I am thy Mini-GPT, trained upon the works of the Bard. Speak, and I shall respond in kind.

HTML Frontend

HTML — chat.html

<!DOCTYPE html>
<html><body>
  <div id="chat"></div>
  <input id="inp" placeholder="Type here...">
  <button onclick="send()">Send</button>
  <script>
  async function send() {
    const text = document.getElementById("inp").value;
    addMsg(text, "user");
    const res = await fetch("/generate", {
      method: "POST",
      headers: {"Content-Type": "application/json"},
      body: JSON.stringify({text, temperature: 0.8})
    });
    const data = await res.json();
    addMsg(data.response, "bot");
  }
  </script>
</body></html>

What You've Accomplished

🎉🏆🎉

You went from "What's an LLM?" to building a working text generator from scratch.

🗺️ Your Complete Journey

Module 1 — Learned what LLMs are and why they matter
Module 2 — Turned text into numbers (tokenization & embeddings)
Module 3 — Refreshed neural network fundamentals
Module 4 — Understood the attention mechanism — the core innovation
Module 5 — Built the full Transformer architecture
Module 6 — Trained models with next-token prediction
Module 7 — Fine-tuned models for specific tasks
Module 8 — Deployed models to production
Module 9 (Here!) — Put it ALL together into a working Mini-GPT 🏆

Revisit Any Module

What Are LLMs? Text to Numbers NN Refresher Attention Transformers Training Fine-Tuning Deployment

🚀 Where to Go Next

You now understand the entire LLM pipeline — the same one behind ChatGPT, Claude, Gemini, and Llama. Scale up the data, scale up the parameters, add RLHF, and you're building frontier AI. The only difference between your Mini-GPT and GPT-4 is scale. You've got the fundamentals. Now go build something amazing!

Final Quiz — Capstone Check

You've built a Mini-GPT from scratch! Let's make sure the key concepts stuck. Answer all 5 questions.

Q1: What is the correct order of steps in the Mini-GPT pipeline built in this capstone?

Q2: What does the character-level tokenizer do in the Mini-GPT pipeline?

Q3: In the capstone project, how is the Shakespeare text prepared for training?

Q4: What does the temperature parameter control during text generation?

Q5: What does the training loss value tell you about the model?

Quiz — Test Your Knowledge

Q1: What is the correct order of steps in the Mini-GPT pipeline built in this capstone?

Q2: What does the character-level tokenizer do in this pipeline?

Q3: How is the training data prepared for the Mini-GPT model?

Q4: What happens when you set the temperature to a very high value (e.g., 1.5) during text generation?

Q5: What does a decreasing training loss (e.g., from 4.0 to 1.5) indicate?

🏆 Capstone: Build Your Own Mini-ChatGPT

Part 1: Project Plan & Architecture

👶 Like You're 5

🗺️ Full Pipeline — From Raw Text to Web Chat

🏆 What You'll Walk Away With

Part 2: Step 1 — Dataset & Tokenizer

👶 Like You're 5

Download & Build the Vocab

🔤 Character Tokenization in Action

Part 3: Step 2 — Build the GPT Model

🧱 The LEGO Analogy

Model Configuration

Complete GPT Class

💡 Architecture Breakdown

Part 4: Step 3 — Train the Model

👶 Like You're 5

Training Script

Watch It Learn

🏋️ Interactive: Simulated Training Progress

📉 Expected Loss Curve

Part 5: Step 4 — Generate Text

👶 Like You're 5

Generation Function

Temperature Comparison

Part 6: Step 5 — Simple Chat Web Interface

FastAPI Backend

Try It — Simulated Chat Demo

💬 Chat with Mini-GPT (Simulated)

HTML Frontend

What You've Accomplished

🗺️ Your Complete Journey

Revisit Any Module

🚀 Where to Go Next

Final Quiz — Capstone Check

Quiz — Test Your Knowledge