How do we teach a computer to "read"? Spoiler: we convert every word into numbers. Let's see exactly how that works.
Here's the fundamental problem: computers are gloriously dumb calculators. They can add, subtract, multiply, and compare numbers at lightning speed, but they have absolutely no idea what a "word" is.
Deep inside every computer, everything is stored as 0s and 1s (binary). Your photos? Numbers. Your music? Numbers. This web page? You guessed it... numbers.
Imagine your best friend only speaks Math. You want to tell them about your day, but you can only pass them numbers on a piece of paper. You'd need a codebook: "1 = happy", "2 = sad", "3 = hungry"... That's exactly the challenge we face with computers!
Imagine meeting someone from the Planet Computoria. They don't know any human language, but they're a math genius. To communicate, you create a dictionary: "apple" = 42, "eat" = 7, "I" = 1. Now you can say "I eat apple" as [1, 7, 42]. They understand perfectly! That's tokenization in a nutshell.
Before we can turn words into numbers, we first need to decide: what counts as one "piece" of text? This is called tokenization, and it's the very first step in every LLM.
A token is the smallest unit of text that the model works with. It could be a whole word, part of a word, or even a single character. Most modern LLMs use subword tokens — chunks somewhere between characters and full words.
Think of words as LEGO creations. Instead of having a unique brick for every creation (billions needed!), we have a smaller set of reusable bricks. "playing" = "play" brick + "ing" brick. "played" = "play" brick + "ed" brick. Same root, different endings — and the model learns what each brick means!
The most popular tokenization method is called Byte Pair Encoding (BPE). It starts with individual characters and repeatedly merges the most frequent pair. Let's walk through it step by step:
Remember how you started texting "laughing out loud" then shortened it to "LOL"? BPE does the same thing automatically. It looks at tons of text, finds letter combos that show up together constantly (like "th", "ing", "tion"), and creates shortcuts for them. The most common combos become single tokens. Rare words stay broken into smaller pieces.
Type any text below and watch it get split into tokens in real-time!
This uses a simplified BPE-like algorithm for demonstration
# Install: pip install transformers from transformers import AutoTokenizer # Load GPT-2's tokenizer (same family as ChatGPT) tokenizer = AutoTokenizer.from_pretrained("gpt2") text = "Hello, I am learning about LLMs!" # Tokenize the text tokens = tokenizer.tokenize(text) print("Tokens:", tokens) # ['Hello', ',', ' I', ' am', ' learning', ' about', ' LL', 'Ms', '!'] # Convert tokens to their numeric IDs token_ids = tokenizer.encode(text) print("Token IDs:", token_ids) # [15496, 11, 314, 716, 4673, 546, 27140, 10128, 0] # Decode back to text decoded = tokenizer.decode(token_ids) print("Decoded:", decoded) # "Hello, I am learning about LLMs!"
Token IDs (like 15496 for "Hello") are just labels — the number itself has no meaning. We need something smarter: a way to represent words so that similar words have similar numbers. Enter: embeddings.
An embedding is a list of numbers (a vector) that represents a word's meaning. Think of it as GPS coordinates — but instead of 2 dimensions (latitude, longitude), embeddings use hundreds of dimensions!
Imagine every word lives in a gigantic city. Words with similar meanings are neighbors. "King" and "Queen" live on the same street. "Cat" and "Dog" share an apartment building. "Happy" and "Joyful" are literally roommates. The embedding is each word's home address in this city — a set of coordinates that tells you exactly where it lives.
Each word gets a "personality profile" — a list of numbers describing it. Maybe number 1 measures "how royal" it is, number 2 measures "how feminine," number 3 measures "how alive," and so on. "King" might be [0.9 royal, 0.1 feminine, 0.8 alive]. "Queen" would be [0.9 royal, 0.9 feminine, 0.8 alive]. Similar profiles = similar meanings!
In practice, an embedding is a vector of hundreds of decimal numbers. Here's what the embedding for "cat" might look like (simplified to 8 dimensions):
Real embeddings have 768+ dimensions — we show 8 for clarity
Pick a word and see which words are closest in embedding space!
Click a word above to see its nearest neighbors!
# Install: pip install gensim import gensim.downloader as api # Load pre-trained word vectors (this downloads ~66MB) model = api.load("glove-wiki-gigaword-50") # See the embedding for a word (50 numbers!) vector = model["cat"] print(f"Shape: {vector.shape}") # (50,) print(f"First 5 values: {vector[:5]}") # [0.22, -0.08, 0.48, ...] # Find the most similar words similar = model.most_similar("cat", topn=5) print("Words closest to 'cat':") for word, score in similar: print(f" {word}: {score:.3f}") # dog: 0.922, cats: 0.899, pet: 0.875, ... # The famous analogy: king - man + woman ≈ queen result = model.most_similar( positive=["king", "woman"], negative=["man"], topn=1 ) print(f"king - man + woman = {result[0][0]}") # queen!
We've turned words into number vectors. But there's a sneaky problem: we lost word order! The vectors for "dog bites man" and "man bites dog" would be the same set of numbers. Clearly, order changes everything.
Imagine a recipe that says "add sugar, then add salt." If you shuffle those instructions to "add salt, then add sugar," the cake still works. But if a fire safety guide says "call 911, then leave the building" and you shuffle it to "leave the building, then call 911"... that's a very different situation! Word order carries meaning, and we need to preserve it.
Click Shuffle to rearrange the words and see how the meaning changes!
Imagine a concert where everyone has the same ticket that just says "admitted." Chaos! Nobody knows where to sit. Now imagine each ticket says "Row 3, Seat 7." That's positional encoding! We stamp each word-vector with a special "seat number" so the model knows the word's exact position in the sentence. Even if words get processed in parallel, the model knows that "dog" came first and "man" came last.
Transformers use a clever trick: they generate position information using sine and cosine waves at different frequencies. Each position gets a unique pattern, like a fingerprint made of waves.
import numpy as np def positional_encoding(max_len, d_model): """Generate positional encodings using sin/cos waves.""" pe = np.zeros((max_len, d_model)) position = np.arange(0, max_len).reshape(-1, 1) div_term = np.exp( np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model) ) pe[:, 0::2] = np.sin(position * div_term) # Even indices pe[:, 1::2] = np.cos(position * div_term) # Odd indices return pe # Generate encodings for 10 positions, 8 dimensions pe = positional_encoding(10, 8) print("Position 0:", np.round(pe[0], 3)) print("Position 1:", np.round(pe[1], 3)) # Each position has a unique wave fingerprint!
Theory is great, but building it yourself is where the real learning happens. Let's code two tokenizers: a simple character-level one, then a real BPE tokenizer.
The simplest possible tokenizer: every character is a token. Like spelling out every word letter by letter.
class CharTokenizer: """A dead-simple character-level tokenizer.""" def __init__(self): self.char_to_id = {} self.id_to_char = {} def train(self, text): """Build vocabulary from text.""" chars = sorted(set(text)) self.char_to_id = {ch: i for i, ch in enumerate(chars)} self.id_to_char = {i: ch for ch, i in self.char_to_id.items()} print(f"Vocabulary size: {len(chars)} characters") print(f"Vocab: {chars}") def encode(self, text): """Convert text to list of integer IDs.""" return [self.char_to_id[ch] for ch in text] def decode(self, ids): """Convert list of integer IDs back to text.""" return "".join(self.id_to_char[i] for i in ids) # Try it out! tok = CharTokenizer() tok.train("hello world") encoded = tok.encode("hello") print(f"'hello' → {encoded}") # [3, 2, 5, 5, 6] decoded = tok.decode(encoded) print(f"{encoded} → '{decoded}'") # 'hello'
Now the real deal. This is the same algorithm GPT uses (simplified). We start with characters and keep merging the most common pair.
from collections import Counter class SimpleBPE: """A minimal BPE tokenizer built from scratch.""" def __init__(self, num_merges=20): self.num_merges = num_merges self.merges = {} # (pair) → merged token self.vocab = {} # token → id def _get_pairs(self, tokens): """Count frequency of adjacent token pairs.""" pairs = Counter() for i in range(len(tokens) - 1): pairs[(tokens[i], tokens[i + 1])] += 1 return pairs def _merge_pair(self, tokens, pair, merged): """Replace all occurrences of pair with merged token.""" new_tokens = [] i = 0 while i < len(tokens): if (i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]): new_tokens.append(merged) i += 2 else: new_tokens.append(tokens[i]) i += 1 return new_tokens def train(self, text): """Learn BPE merges from training text.""" tokens = list(text) for step in range(self.num_merges): pairs = self._get_pairs(tokens) if not pairs: break best_pair = pairs.most_common(1)[0][0] merged = best_pair[0] + best_pair[1] self.merges[best_pair] = merged tokens = self._merge_pair(tokens, best_pair, merged) print(f"Step {step+1}: merge '{best_pair[0]}'+'{best_pair[1]}' → '{merged}'") # Build vocabulary from unique tokens unique = sorted(set(tokens)) self.vocab = {tok: i for i, tok in enumerate(unique)} print(f"\nFinal vocab ({len(self.vocab)} tokens): {unique}") def encode(self, text): """Tokenize new text using learned merges.""" tokens = list(text) for pair, merged in self.merges.items(): tokens = self._merge_pair(tokens, pair, merged) return tokens, [self.vocab.get(t, -1) for t in tokens] # Train on sample text bpe = SimpleBPE(num_merges=10) bpe.train("low low low lowest newest widest") # Tokenize new text tokens, ids = bpe.encode("lowest") print(f"\n'lowest' → tokens: {tokens}, ids: {ids}")
We built a machine that reads a lot of text, notices which letter combos keep showing up together (like "l" and "o" always appear as "lo"), and creates shortcuts. Next time it sees new text, it uses those shortcuts to chop it into efficient pieces. That's the same core idea behind GPT's tokenizer — just with more training data and more merge steps!