LLM from Scratch
Build a tiny language model with PyTorch: tokenization, embeddings, transformer blocks, next-token prediction. Understand how GPT-style models work under the hood.
π Before you start, learn these courses:
The Scenario
Imagine thisβ¦
ChatGPT, Claude, Gemini β they all feel like magic. You type a question and a human-sounding answer appears. But what's actually happening? At the core, every LLM does one thing: predict the next word given the previous words. Repeat that thousands of times and you get paragraphs. In this project, you strip away the scale and build a tiny language model from scratch. It won't write essays, but you'll see exactly how attention, embeddings, and layers combine to model language. Perfect for interviews and for saying "I understand how transformers work."
What You'll Build
- Tokenizer: Character-level encoder/decoder (text β numbers β text).
- Dataset: A PyTorch Dataset that creates context-window + target pairs.
- Model: Embedding β N transformer blocks (self-attention + FFN + LayerNorm) β linear head.
- Training: Cross-entropy loss, AdamW optimizer, loss logging.
- Generation: Feed a prompt, sample next token, repeat.
Prerequisites
| Python | Classes, functions, list comprehensions. Comfortable writing 50+ line scripts. | Python Course |
| PyTorch basics | Tensors, autograd, nn.Module, DataLoader. You should know how to define a simple neural network. | Deep Learning |
| Transformer concept | What attention is, why positional encoding exists. We'll build it step by step, but having read about it helps. | Gen AI Intro |
Step-by-Step Plan
- 1Get text data. Download a small corpus (e.g. Tiny Shakespeare, ~1 MB).
- 2Build a tokenizer. Character-level: map each unique character to an integer and back.
- 3Create a Dataset. Sliding window: input = context_length chars, target = next char for each position.
- 4Embedding + positional encoding. Turn token IDs into vectors and add position information.
- 5Build self-attention. Scaled dot-product attention with a causal mask (can't see the future).
- 6Build a transformer block. Combine attention + feed-forward + layer norm + residual connections.
- 7Full model + training loop. Stack blocks, add a linear head, train with cross-entropy and AdamW.
- 8Generate text. Given a prompt, autoregressively sample tokens to produce new text.
Get Text Data
Download a small corpus to train on
What and why
A language model learns from text. We need a file with enough text to learn patterns (word order, punctuation, style) but small enough to train on a laptop. Tiny Shakespeare (~1 MB, 40,000 lines) is the classic choice β it's all of Shakespeare's plays in one text file.
import urllib.request
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
urllib.request.urlretrieve(url, 'input.txt')
with open('input.txt', 'r') as f:
text = f.read()
print(f"Total characters: {len(text):,}")
print(f"First 200 chars:\n{text[:200]}")
What just happened
We downloaded the Tiny Shakespeare dataset (~1.1 million characters) and loaded it into a Python string. We printed the length and a preview. This text is what the model will learn from β it will try to predict the next character given previous characters.
Step 1 complete!
- Downloaded ~1.1M characters of Shakespeare text
- Loaded into a Python string variable
Build a Character-Level Tokenizer
Convert characters to numbers and back
Why tokenize?
Neural networks work with numbers, not letters. We need to convert every character (a, b, c, space, newline, etc.) to a unique integer. "Hello" becomes [20, 17, 24, 24, 27]. Character-level is the simplest tokenizer β real LLMs use subword tokenizers (BPE), but the concept is identical.
chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {''.join(chars)}")
# Character to integer and back
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
def encode(s):
return [char_to_idx[c] for c in s]
def decode(ids):
return ''.join(idx_to_char[i] for i in ids)
# Test it
print(encode("Hello"))
print(decode(encode("Hello")))
Line by line
sorted(set(text)): Get every unique character in the text, sorted. That's our vocabulary (~65 characters: letters, digits, punctuation, spaces, newlines).
char_to_idx: Dictionary mapping each character to a number. E.g. 'a' β 0, 'b' β 1, etc.
idx_to_char: Reverse dictionary: number β character.
encode/decode: Convert a string to a list of ints and vice versa. The model works in integer-land; we decode back to text for humans.
Step 2 complete!
- Vocabulary of ~65 unique characters
- encode() and decode() functions working
Create a PyTorch Dataset
Sliding window of context β next token pairs
What's a "context window"?
The model looks at the last N characters (the context) and predicts the next one. If context_length=8 and the text is "Hello World", one training example is: input = "Hello Wo", target = "ello Wor". For every position in the input, the target is the next character. We slide this window across the entire text to create thousands of training examples.
import torch
from torch.utils.data import Dataset, DataLoader
CONTEXT_LENGTH = 64
class CharDataset(Dataset):
def __init__(self, text, context_length):
self.data = torch.tensor(encode(text), dtype=torch.long)
self.context_length = context_length
def __len__(self):
return len(self.data) - self.context_length
def __getitem__(self, idx):
chunk = self.data[idx : idx + self.context_length + 1]
x = chunk[:-1] # input: first N tokens
y = chunk[1:] # target: shifted by 1
return x, y
# Split: 90% train, 10% val
split = int(0.9 * len(text))
train_ds = CharDataset(text[:split], CONTEXT_LENGTH)
val_ds = CharDataset(text[split:], CONTEXT_LENGTH)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=64, shuffle=False)
print(f"Train examples: {len(train_ds):,}")
print(f"Val examples: {len(val_ds):,}")
What each part does
CharDataset: Stores the entire text as a tensor of integers. Each __getitem__ returns a window of 64 characters (input) and the same window shifted by 1 (target). So for every position, the target is "what comes next".
DataLoader: Feeds batches of 64 examples at a time to the model. shuffle=True randomizes the order each epoch so the model doesn't memorize the sequence.
Step 3 complete!
- Dataset class with sliding context windows
- 90/10 train/val split
- DataLoaders ready (batch_size=64)
Embedding + Positional Encoding
Turn token IDs into vectors and add position info
Why embeddings?
The number 42 doesn't tell the model anything about the character it represents. An embedding maps each token ID to a learned vector (e.g. 128 numbers). Similar characters end up with similar vectors after training. Positional encoding tells the model where each token is in the sequence β without it, "cat sat" and "sat cat" look the same.
import torch.nn as nn
D_MODEL = 128 # embedding dimension
class TokenAndPositionEmbedding(nn.Module):
def __init__(self, vocab_size, d_model, max_len):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
def forward(self, x):
# x shape: (batch, seq_len)
seq_len = x.size(1)
positions = torch.arange(seq_len, device=x.device)
tok = self.token_emb(x) # (B, T, d_model)
pos = self.pos_emb(positions) # (T, d_model)
return tok + pos # add them together
What this does
token_emb: Lookup table: token ID β 128-dimensional vector. Learned during training.
pos_emb: Lookup table: position (0, 1, 2, β¦) β 128-dimensional vector. Also learned.
tok + pos: We add them element-wise. Now each token vector encodes both "what character am I" and "where am I in the sequence".
Step 4 complete!
- Token embedding (vocab_size β d_model)
- Positional embedding (max_len β d_model)
- Combined by addition
Build Self-Attention
The core mechanism that lets tokens "look at" each other
What is attention?
Each token asks: "Which other tokens in the sequence should I pay attention to?" For example, in "The cat sat on the mat", when processing "sat", the model might attend strongly to "cat" (who sat?) and "on" (where?). Attention computes a weighted average of all token vectors, where the weights are learned. Causal mask means a token can only look at tokens before it β not the future. That's how GPT works: left to right.
import math
class CausalSelfAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model)
self.out_proj = nn.Linear(d_model, d_model)
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv(x) # (B, T, 3*C)
q, k, v = qkv.chunk(3, dim=-1)
# Reshape for multi-head: (B, n_heads, T, head_dim)
q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
# Scaled dot-product attention
scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
# Causal mask: prevent attending to future tokens
mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
scores = scores.masked_fill(mask, float('-inf'))
weights = torch.softmax(scores, dim=-1)
out = weights @ v # (B, n_heads, T, head_dim)
# Concatenate heads and project
out = out.transpose(1, 2).contiguous().view(B, T, C)
return self.out_proj(out)
Breaking it down
Q, K, V: Every token produces three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), Value ("what info do I give?"). We compute all three at once with one linear layer.
Multi-head: Instead of one big attention, we split into 4 smaller "heads" (128/4=32 each). Each head learns different patterns (one might learn grammar, another semantics).
scores = Q @ K^T / sqrt(d): Dot product tells how similar Q and K are. Divide by sqrt to prevent numbers from getting too large.
Causal mask: Upper triangular matrix of -infinity. After softmax, -inf becomes 0 weight β so future tokens contribute nothing.
weights @ V: Weighted average of Values. Each token gets a mix of information from tokens it attended to.
Step 5 complete!
- Multi-head self-attention with causal mask
- Q, K, V projections + output projection
- Scaled dot-product with softmax
Build a Transformer Block
Attention + Feed-Forward + LayerNorm + Residuals
What's a "block"?
A transformer block is one "layer" of processing. It has two sub-layers: (1) self-attention (mix information across positions) and (2) feed-forward network (process each position independently). We add residual connections (shortcuts) and layer normalization to help training. Then we stack multiple blocks β more blocks = more capacity to learn patterns.
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = CausalSelfAttention(d_model, n_heads)
self.ln2 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.GELU(),
nn.Linear(4 * d_model, d_model),
)
def forward(self, x):
x = x + self.attn(self.ln1(x)) # attention + residual
x = x + self.ff(self.ln2(x)) # feed-forward + residual
return x
Line by line
ln1, ln2 (LayerNorm): Normalize the values so they don't explode or vanish. Applied before each sub-layer (pre-norm style, like GPT-2).
attn: Our causal self-attention from Step 5.
ff: Two linear layers with GELU activation in between. Expands to 4x the dimension then compresses back. This adds non-linearity and capacity.
x = x + ...: The "+" is the residual connection. The original input flows through unchanged and the block's output is added on top. This helps gradients flow during training (prevents vanishing gradients).
Step 6 complete!
- TransformerBlock: attention + FFN
- Pre-norm LayerNorm
- Residual connections on both sub-layers
Full Model + Training Loop
Stack blocks, add head, train with cross-entropy
N_LAYERS = 4
N_HEADS = 4
class MiniGPT(nn.Module):
def __init__(self):
super().__init__()
self.embed = TokenAndPositionEmbedding(vocab_size, D_MODEL, CONTEXT_LENGTH)
self.blocks = nn.Sequential(*[
TransformerBlock(D_MODEL, N_HEADS) for _ in range(N_LAYERS)
])
self.ln_f = nn.LayerNorm(D_MODEL)
self.head = nn.Linear(D_MODEL, vocab_size)
def forward(self, x, targets=None):
x = self.embed(x)
x = self.blocks(x)
x = self.ln_f(x)
logits = self.head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
loss = nn.functional.cross_entropy(
logits.view(-1, vocab_size),
targets.view(-1)
)
return logits, loss
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MiniGPT().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
What's happening
MiniGPT: Our full model. Embedding β 4 transformer blocks β final LayerNorm β linear head that predicts a score for every possible next character (vocab_size scores).
cross_entropy: The loss function. It measures how far our predictions are from the actual next character. Lower = better.
AdamW: The optimizer that updates model weights to reduce the loss. lr=3e-4 is a common learning rate for small transformers.
Training loop
EPOCHS = 5
for epoch in range(EPOCHS):
model.train()
total_loss = 0
for batch_idx, (xb, yb) in enumerate(train_loader):
xb, yb = xb.to(device), yb.to(device)
logits, loss = model(xb, yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
_, loss = model(xb, yb)
val_loss += loss.item()
val_loss /= len(val_loader)
print(f"Epoch {epoch+1}/{EPOCHS} | Train loss: {avg_loss:.4f} | Val loss: {val_loss:.4f}")
Expected behavior
Loss starts around 4.0 (random guessing among ~65 chars) and should drop to ~1.5β2.0 after 5 epochs. On a CPU this takes 10-30 minutes. On a GPU it's much faster. If loss doesn't decrease, double-check the learning rate and data loading.
Step 7 complete!
- MiniGPT model with ~500K parameters
- Training loop with train + validation loss
- Loss decreasing from ~4.0 toward ~1.5
Generate Text!
Feed a prompt and watch your model write
@torch.no_grad()
def generate(model, prompt, max_new_tokens=200, temperature=0.8):
model.eval()
tokens = torch.tensor(encode(prompt), dtype=torch.long, device=device).unsqueeze(0)
for _ in range(max_new_tokens):
# Crop to context length
context = tokens[:, -CONTEXT_LENGTH:]
logits, _ = model(context)
logits = logits[:, -1, :] / temperature
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
tokens = torch.cat([tokens, next_token], dim=1)
return decode(tokens[0].tolist())
# Try it!
print(generate(model, "ROMEO:\n", max_new_tokens=300))
How generation works
Autoregressive: Feed the prompt tokens to the model. Take the last position's logits (predictions for what comes next). Divide by temperature (lower = more confident/repetitive, higher = more random/creative). Convert to probabilities with softmax. Sample one token. Append it. Repeat.
The output won't be perfect Shakespeare, but you'll see it learned basic patterns: character names, line breaks, iambic-ish phrasing. That's a language model working!
Project complete!
You built a GPT-style language model from scratch. Here's what you can put on your resume:
- Implemented a character-level transformer language model in PyTorch
- Built custom tokenizer, dataset pipeline, multi-head causal attention, and transformer blocks
- Trained on Shakespeare corpus; achieved text generation with learned linguistic patterns
- Deep understanding of attention mechanism, positional encoding, and autoregressive decoding
What's next?
- Try a subword tokenizer (BPE) instead of character-level
- Increase model size (more layers, bigger d_model) and train longer
- Add dropout for regularization
- Fine-tune on your own text (emails, code, song lyrics)
- Read the original "Attention Is All You Need" paper