Advanced Pro

LLM from Scratch

Build a tiny language model with PyTorch: tokenization, embeddings, transformer blocks, next-token prediction. Understand how GPT-style models work under the hood.

PythonPyTorchLLM

📚 Before you start, learn these courses:

Python → LLM Course →

The Scenario

Imagine this…

ChatGPT, Claude, Gemini — they all feel like magic. You type a question and a human-sounding answer appears. But what's actually happening? At the core, every LLM does one thing: predict the next word given the previous words. Repeat that thousands of times and you get paragraphs. In this project, you strip away the scale and build a tiny language model from scratch. It won't write essays, but you'll see exactly how attention, embeddings, and layers combine to model language. Perfect for interviews and for saying "I understand how transformers work."

What You'll Build

Text → Tokenize → Embed → Transformer → Next Token

Tokenizer: Character-level encoder/decoder (text → numbers → text).
Dataset: A PyTorch Dataset that creates context-window + target pairs.
Model: Embedding → N transformer blocks (self-attention + FFN + LayerNorm) → linear head.
Training: Cross-entropy loss, AdamW optimizer, loss logging.
Generation: Feed a prompt, sample next token, repeat.

Prerequisites

Python	Classes, functions, list comprehensions. Comfortable writing 50+ line scripts.	Python Course
PyTorch basics	Tensors, autograd, nn.Module, DataLoader. You should know how to define a simple neural network.	Deep Learning
Transformer concept	What attention is, why positional encoding exists. We'll build it step by step, but having read about it helps.	Gen AI Intro

Step-by-Step Plan

1
Get text data. Download a small corpus (e.g. Tiny Shakespeare, ~1 MB).
2
Build a tokenizer. Character-level: map each unique character to an integer and back.
3
Create a Dataset. Sliding window: input = context_length chars, target = next char for each position.
4
Embedding + positional encoding. Turn token IDs into vectors and add position information.
5
Build self-attention. Scaled dot-product attention with a causal mask (can't see the future).
6
Build a transformer block. Combine attention + feed-forward + layer norm + residual connections.
7
Full model + training loop. Stack blocks, add a linear head, train with cross-entropy and AdamW.
8
Generate text. Given a prompt, autoregressively sample tokens to produce new text.

Get Text Data

Download a small corpus to train on

What and why

A language model learns from text. We need a file with enough text to learn patterns (word order, punctuation, style) but small enough to train on a laptop. Tiny Shakespeare (~1 MB, 40,000 lines) is the classic choice — it's all of Shakespeare's plays in one text file.

Python

import urllib.request

url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
urllib.request.urlretrieve(url, 'input.txt')

with open('input.txt', 'r') as f:
    text = f.read()

print(f"Total characters: {len(text):,}")
print(f"First 200 chars:\n{text[:200]}")

What just happened

We downloaded the Tiny Shakespeare dataset (~1.1 million characters) and loaded it into a Python string. We printed the length and a preview. This text is what the model will learn from — it will try to predict the next character given previous characters.

Step 1 complete!

Downloaded ~1.1M characters of Shakespeare text
Loaded into a Python string variable

Build a Character-Level Tokenizer

Convert characters to numbers and back

Why tokenize?

Neural networks work with numbers, not letters. We need to convert every character (a, b, c, space, newline, etc.) to a unique integer. "Hello" becomes [20, 17, 24, 24, 27]. Character-level is the simplest tokenizer — real LLMs use subword tokenizers (BPE), but the concept is identical.

Python

chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {''.join(chars)}")

# Character to integer and back
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

def encode(s):
    return [char_to_idx[c] for c in s]

def decode(ids):
    return ''.join(idx_to_char[i] for i in ids)

# Test it
print(encode("Hello"))
print(decode(encode("Hello")))

Line by line

sorted(set(text)): Get every unique character in the text, sorted. That's our vocabulary (~65 characters: letters, digits, punctuation, spaces, newlines).

char_to_idx: Dictionary mapping each character to a number. E.g. 'a' → 0, 'b' → 1, etc.

idx_to_char: Reverse dictionary: number → character.

encode/decode: Convert a string to a list of ints and vice versa. The model works in integer-land; we decode back to text for humans.

Step 2 complete!

Vocabulary of ~65 unique characters
encode() and decode() functions working

Create a PyTorch Dataset

Sliding window of context → next token pairs

What's a "context window"?

The model looks at the last N characters (the context) and predicts the next one. If context_length=8 and the text is "Hello World", one training example is: input = "Hello Wo", target = "ello Wor". For every position in the input, the target is the next character. We slide this window across the entire text to create thousands of training examples.

Python

import torch
from torch.utils.data import Dataset, DataLoader

CONTEXT_LENGTH = 64

class CharDataset(Dataset):
    def __init__(self, text, context_length):
        self.data = torch.tensor(encode(text), dtype=torch.long)
        self.context_length = context_length

    def __len__(self):
        return len(self.data) - self.context_length

    def __getitem__(self, idx):
        chunk = self.data[idx : idx + self.context_length + 1]
        x = chunk[:-1]   # input: first N tokens
        y = chunk[1:]     # target: shifted by 1
        return x, y

# Split: 90% train, 10% val
split = int(0.9 * len(text))
train_ds = CharDataset(text[:split], CONTEXT_LENGTH)
val_ds   = CharDataset(text[split:], CONTEXT_LENGTH)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=64, shuffle=False)

print(f"Train examples: {len(train_ds):,}")
print(f"Val examples:   {len(val_ds):,}")

What each part does

CharDataset: Stores the entire text as a tensor of integers. Each __getitem__ returns a window of 64 characters (input) and the same window shifted by 1 (target). So for every position, the target is "what comes next".

DataLoader: Feeds batches of 64 examples at a time to the model. shuffle=True randomizes the order each epoch so the model doesn't memorize the sequence.

Step 3 complete!

Dataset class with sliding context windows
90/10 train/val split
DataLoaders ready (batch_size=64)

Embedding + Positional Encoding

Turn token IDs into vectors and add position info

Why embeddings?

The number 42 doesn't tell the model anything about the character it represents. An embedding maps each token ID to a learned vector (e.g. 128 numbers). Similar characters end up with similar vectors after training. Positional encoding tells the model where each token is in the sequence — without it, "cat sat" and "sat cat" look the same.

Python

import torch.nn as nn

D_MODEL = 128  # embedding dimension

class TokenAndPositionEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)

    def forward(self, x):
        # x shape: (batch, seq_len)
        seq_len = x.size(1)
        positions = torch.arange(seq_len, device=x.device)
        tok = self.token_emb(x)        # (B, T, d_model)
        pos = self.pos_emb(positions)   # (T, d_model)
        return tok + pos               # add them together

What this does

token_emb: Lookup table: token ID → 128-dimensional vector. Learned during training.

pos_emb: Lookup table: position (0, 1, 2, …) → 128-dimensional vector. Also learned.

tok + pos: We add them element-wise. Now each token vector encodes both "what character am I" and "where am I in the sequence".

Step 4 complete!

Token embedding (vocab_size → d_model)
Positional embedding (max_len → d_model)
Combined by addition

Build Self-Attention

The core mechanism that lets tokens "look at" each other

What is attention?

Each token asks: "Which other tokens in the sequence should I pay attention to?" For example, in "The cat sat on the mat", when processing "sat", the model might attend strongly to "cat" (who sat?) and "on" (where?). Attention computes a weighted average of all token vectors, where the weights are learned. Causal mask means a token can only look at tokens before it — not the future. That's how GPT works: left to right.

Python

import math

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x)  # (B, T, 3*C)
        q, k, v = qkv.chunk(3, dim=-1)

        # Reshape for multi-head: (B, n_heads, T, head_dim)
        q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)

        # Scaled dot-product attention
        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        # Causal mask: prevent attending to future tokens
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        scores = scores.masked_fill(mask, float('-inf'))

        weights = torch.softmax(scores, dim=-1)
        out = weights @ v  # (B, n_heads, T, head_dim)

        # Concatenate heads and project
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out)

Breaking it down

Q, K, V: Every token produces three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), Value ("what info do I give?"). We compute all three at once with one linear layer.

Multi-head: Instead of one big attention, we split into 4 smaller "heads" (128/4=32 each). Each head learns different patterns (one might learn grammar, another semantics).

scores = Q @ K^T / sqrt(d): Dot product tells how similar Q and K are. Divide by sqrt to prevent numbers from getting too large.

Causal mask: Upper triangular matrix of -infinity. After softmax, -inf becomes 0 weight — so future tokens contribute nothing.

weights @ V: Weighted average of Values. Each token gets a mix of information from tokens it attended to.

Step 5 complete!

Multi-head self-attention with causal mask
Q, K, V projections + output projection
Scaled dot-product with softmax

Build a Transformer Block

Attention + Feed-Forward + LayerNorm + Residuals

What's a "block"?

A transformer block is one "layer" of processing. It has two sub-layers: (1) self-attention (mix information across positions) and (2) feed-forward network (process each position independently). We add residual connections (shortcuts) and layer normalization to help training. Then we stack multiple blocks — more blocks = more capacity to learn patterns.

Python

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1  = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads)
        self.ln2  = nn.LayerNorm(d_model)
        self.ff   = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # attention + residual
        x = x + self.ff(self.ln2(x))     # feed-forward + residual
        return x

Line by line

ln1, ln2 (LayerNorm): Normalize the values so they don't explode or vanish. Applied before each sub-layer (pre-norm style, like GPT-2).

attn: Our causal self-attention from Step 5.

ff: Two linear layers with GELU activation in between. Expands to 4x the dimension then compresses back. This adds non-linearity and capacity.

x = x + ...: The "+" is the residual connection. The original input flows through unchanged and the block's output is added on top. This helps gradients flow during training (prevents vanishing gradients).

Step 6 complete!

TransformerBlock: attention + FFN
Pre-norm LayerNorm
Residual connections on both sub-layers

Full Model + Training Loop

Stack blocks, add head, train with cross-entropy

Python

N_LAYERS = 4
N_HEADS  = 4

class MiniGPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed = TokenAndPositionEmbedding(vocab_size, D_MODEL, CONTEXT_LENGTH)
        self.blocks = nn.Sequential(*[
            TransformerBlock(D_MODEL, N_HEADS) for _ in range(N_LAYERS)
        ])
        self.ln_f = nn.LayerNorm(D_MODEL)
        self.head = nn.Linear(D_MODEL, vocab_size)

    def forward(self, x, targets=None):
        x = self.embed(x)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)   # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = nn.functional.cross_entropy(
                logits.view(-1, vocab_size),
                targets.view(-1)
            )
        return logits, loss

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MiniGPT().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

What's happening

MiniGPT: Our full model. Embedding → 4 transformer blocks → final LayerNorm → linear head that predicts a score for every possible next character (vocab_size scores).

cross_entropy: The loss function. It measures how far our predictions are from the actual next character. Lower = better.

AdamW: The optimizer that updates model weights to reduce the loss. lr=3e-4 is a common learning rate for small transformers.

Training loop

Python

EPOCHS = 5

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    for batch_idx, (xb, yb) in enumerate(train_loader):
        xb, yb = xb.to(device), yb.to(device)
        logits, loss = model(xb, yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            _, loss = model(xb, yb)
            val_loss += loss.item()
    val_loss /= len(val_loader)

    print(f"Epoch {epoch+1}/{EPOCHS} | Train loss: {avg_loss:.4f} | Val loss: {val_loss:.4f}")

Expected behavior

Loss starts around 4.0 (random guessing among ~65 chars) and should drop to ~1.5–2.0 after 5 epochs. On a CPU this takes 10-30 minutes. On a GPU it's much faster. If loss doesn't decrease, double-check the learning rate and data loading.

Step 7 complete!

MiniGPT model with ~500K parameters
Training loop with train + validation loss
Loss decreasing from ~4.0 toward ~1.5

Generate Text!

Feed a prompt and watch your model write

Python

@torch.no_grad()
def generate(model, prompt, max_new_tokens=200, temperature=0.8):
    model.eval()
    tokens = torch.tensor(encode(prompt), dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        # Crop to context length
        context = tokens[:, -CONTEXT_LENGTH:]
        logits, _ = model(context)
        logits = logits[:, -1, :] / temperature
        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        tokens = torch.cat([tokens, next_token], dim=1)

    return decode(tokens[0].tolist())

# Try it!
print(generate(model, "ROMEO:\n", max_new_tokens=300))

How generation works

Autoregressive: Feed the prompt tokens to the model. Take the last position's logits (predictions for what comes next). Divide by temperature (lower = more confident/repetitive, higher = more random/creative). Convert to probabilities with softmax. Sample one token. Append it. Repeat.

The output won't be perfect Shakespeare, but you'll see it learned basic patterns: character names, line breaks, iambic-ish phrasing. That's a language model working!

Project complete!

You built a GPT-style language model from scratch. Here's what you can put on your resume:

Implemented a character-level transformer language model in PyTorch
Built custom tokenizer, dataset pipeline, multi-head causal attention, and transformer blocks
Trained on Shakespeare corpus; achieved text generation with learned linguistic patterns
Deep understanding of attention mechanism, positional encoding, and autoregressive decoding

What's next?

Try a subword tokenizer (BPE) instead of character-level
Increase model size (more layers, bigger d_model) and train longer
Add dropout for regularization
Fine-tune on your own text (emails, code, song lyrics)
Read the original "Attention Is All You Need" paper

Unlock the Full Build Guide

Get every line of PyTorch code with detailed explanations — from tokenizer to text generation. Build a working LLM on your laptop.

Upgrade to Pro

Back to all projects