Module 2

Text to Numbers

How do we teach a computer to "read"? Spoiler: we convert every word into numbers. Let's see exactly how that works.

Part 1: Why Computers Can't Read Words

Here's the fundamental problem: computers are gloriously dumb calculators. They can add, subtract, multiply, and compare numbers at lightning speed, but they have absolutely no idea what a "word" is.

Deep inside every computer, everything is stored as 0s and 1s (binary). Your photos? Numbers. Your music? Numbers. This web page? You guessed it... numbers.

ELI5: Why Numbers?

Imagine your best friend only speaks Math. You want to tell them about your day, but you can only pass them numbers on a piece of paper. You'd need a codebook: "1 = happy", "2 = sad", "3 = hungry"... That's exactly the challenge we face with computers!

From Words to Binary to Numbers

"Hello" 01001000 01100101 01101100 01101100 01101111 72 101 108 108 111 Token ID: 15339 Human text Binary (0s & 1s) ASCII codes Token ID

The Foreigner Who Only Speaks Math

Imagine meeting someone from the Planet Computoria. They don't know any human language, but they're a math genius. To communicate, you create a dictionary: "apple" = 42, "eat" = 7, "I" = 1. Now you can say "I eat apple" as [1, 7, 42]. They understand perfectly! That's tokenization in a nutshell.

The Big Idea

  • Computers process numbers, not letters or words
  • We need a systematic way to convert text → numbers
  • The conversion must be reversible (numbers → text too)
  • Similar words should ideally get similar number representations

Part 2: Tokenization — Chopping Text into Pieces

Before we can turn words into numbers, we first need to decide: what counts as one "piece" of text? This is called tokenization, and it's the very first step in every LLM.

What Are Tokens?

A token is the smallest unit of text that the model works with. It could be a whole word, part of a word, or even a single character. Most modern LLMs use subword tokens — chunks somewhere between characters and full words.

How "playing" Gets Tokenized

"playing" tokenize! "play" + "ing" Token 1 (root) Token 2 (suffix)

ELI5: Tokens Are Like LEGO Bricks

Think of words as LEGO creations. Instead of having a unique brick for every creation (billions needed!), we have a smaller set of reusable bricks. "playing" = "play" brick + "ing" brick. "played" = "play" brick + "ed" brick. Same root, different endings — and the model learns what each brick means!

Byte Pair Encoding (BPE) — How Tokens Are Learned

The most popular tokenization method is called Byte Pair Encoding (BPE). It starts with individual characters and repeatedly merges the most frequent pair. Let's walk through it step by step:

BPE: Step-by-Step Merging

Corpus: "low low low lowest newest" Step 0: l o w _ l o w _ l o w e s t _ n e w e s t Start: characters Step 1: lo w _ lo w _ lo w e s t _ n e w e s t Merge: l+o → lo Step 2: low _ low _ low e s t _ n e w e s t Merge: lo+w → low Step 3: low _ low _ low es t _ n e w es t Merge: e+s → es Step 4: low _ low _ low est _ n e w est Merge: es+t → est

BPE Is Like Texting Shorthand

Remember how you started texting "laughing out loud" then shortened it to "LOL"? BPE does the same thing automatically. It looks at tons of text, finds letter combos that show up together constantly (like "th", "ing", "tion"), and creates shortcuts for them. The most common combos become single tokens. Rare words stay broken into smaller pieces.

Why Subword Tokenization Wins

  • Word-level: Needs a HUGE vocabulary. Can't handle new words. "ChatGPT" → unknown!
  • Character-level: Tiny vocabulary but very long sequences. Slow to train.
  • Subword (BPE): Sweet spot! Common words stay whole, rare words get split into known pieces.

Interactive Tokenizer Playground

Type any text below and watch it get split into tokens in real-time!

This uses a simplified BPE-like algorithm for demonstration

Code: Using a Real Tokenizer

# Install: pip install transformers
from transformers import AutoTokenizer

# Load GPT-2's tokenizer (same family as ChatGPT)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello, I am learning about LLMs!"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# ['Hello', ',', ' I', ' am', ' learning', ' about', ' LL', 'Ms', '!']

# Convert tokens to their numeric IDs
token_ids = tokenizer.encode(text)
print("Token IDs:", token_ids)
# [15496, 11, 314, 716, 4673, 546, 27140, 10128, 0]

# Decode back to text
decoded = tokenizer.decode(token_ids)
print("Decoded:", decoded)
# "Hello, I am learning about LLMs!"

Part 3: Word Embeddings — Words as Coordinates

Token IDs (like 15496 for "Hello") are just labels — the number itself has no meaning. We need something smarter: a way to represent words so that similar words have similar numbers. Enter: embeddings.

Words on a Map

An embedding is a list of numbers (a vector) that represents a word's meaning. Think of it as GPS coordinates — but instead of 2 dimensions (latitude, longitude), embeddings use hundreds of dimensions!

The Word City Analogy

Imagine every word lives in a gigantic city. Words with similar meanings are neighbors. "King" and "Queen" live on the same street. "Cat" and "Dog" share an apartment building. "Happy" and "Joyful" are literally roommates. The embedding is each word's home address in this city — a set of coordinates that tells you exactly where it lives.

The Famous King − Man + Woman = Queen

Gender dimension → Royalty dimension → King 👑 Queen 👑 Man 🧑 Woman 👩 + "female" + "female" + "royalty" + "royalty" King − Man + Woman ≈ Queen

ELI5: What's an Embedding?

Each word gets a "personality profile" — a list of numbers describing it. Maybe number 1 measures "how royal" it is, number 2 measures "how feminine," number 3 measures "how alive," and so on. "King" might be [0.9 royal, 0.1 feminine, 0.8 alive]. "Queen" would be [0.9 royal, 0.9 feminine, 0.8 alive]. Similar profiles = similar meanings!

Embedding = A List of Numbers

In practice, an embedding is a vector of hundreds of decimal numbers. Here's what the embedding for "cat" might look like (simplified to 8 dimensions):

Embedding Vector for "cat" (8 dimensions shown)

Real embeddings have 768+ dimensions — we show 8 for clarity

Word Similarity Playground

Pick a word and see which words are closest in embedding space!

Click a word above to see its nearest neighbors!

Code: Working with Embeddings

# Install: pip install gensim
import gensim.downloader as api

# Load pre-trained word vectors (this downloads ~66MB)
model = api.load("glove-wiki-gigaword-50")

# See the embedding for a word (50 numbers!)
vector = model["cat"]
print(f"Shape: {vector.shape}")    # (50,)
print(f"First 5 values: {vector[:5]}") # [0.22, -0.08, 0.48, ...]

# Find the most similar words
similar = model.most_similar("cat", topn=5)
print("Words closest to 'cat':")
for word, score in similar:
    print(f"  {word}: {score:.3f}")
# dog: 0.922, cats: 0.899, pet: 0.875, ...

# The famous analogy: king - man + woman ≈ queen
result = model.most_similar(
    positive=["king", "woman"],
    negative=["man"],
    topn=1
)
print(f"king - man + woman = {result[0][0]}")
# queen!

Part 4: Positional Encoding — Order Matters!

We've turned words into number vectors. But there's a sneaky problem: we lost word order! The vectors for "dog bites man" and "man bites dog" would be the same set of numbers. Clearly, order changes everything.

ELI5: Why Position Matters

Imagine a recipe that says "add sugar, then add salt." If you shuffle those instructions to "add salt, then add sugar," the cake still works. But if a fire safety guide says "call 911, then leave the building" and you shuffle it to "leave the building, then call 911"... that's a very different situation! Word order carries meaning, and we need to preserve it.

Word Order Changes Meaning!

Click Shuffle to rearrange the words and see how the meaning changes!

dog bites man
🐕 A dog attacks a person — scary news!

The Seat Number Analogy

Concert Seating

Imagine a concert where everyone has the same ticket that just says "admitted." Chaos! Nobody knows where to sit. Now imagine each ticket says "Row 3, Seat 7." That's positional encoding! We stamp each word-vector with a special "seat number" so the model knows the word's exact position in the sentence. Even if words get processed in parallel, the model knows that "dog" came first and "man" came last.

How It Works: Sin/Cos Waves

Transformers use a clever trick: they generate position information using sine and cosine waves at different frequencies. Each position gets a unique pattern, like a fingerprint made of waves.

Positional Encoding Waves

sin (slow) sin (fast) cos (slow) pos 1 pos 2 pos 3 pos 4 pos 5 Each position gets a unique combination of wave values

Why Sin/Cos? Three Superpowers

  • Unique fingerprint: Every position gets a different pattern of values
  • Relative distances: The model can learn that position 5 is "3 steps after" position 2
  • Infinite length: Works for any sequence length — no upper limit!
import numpy as np

def positional_encoding(max_len, d_model):
    """Generate positional encodings using sin/cos waves."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(0, max_len).reshape(-1, 1)
    div_term = np.exp(
        np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)
    )

    pe[:, 0::2] = np.sin(position * div_term)  # Even indices
    pe[:, 1::2] = np.cos(position * div_term)  # Odd indices
    return pe

# Generate encodings for 10 positions, 8 dimensions
pe = positional_encoding(10, 8)
print("Position 0:", np.round(pe[0], 3))
print("Position 1:", np.round(pe[1], 3))
# Each position has a unique wave fingerprint!

Part 5: Build a Tokenizer from Scratch

Theory is great, but building it yourself is where the real learning happens. Let's code two tokenizers: a simple character-level one, then a real BPE tokenizer.

Step 1: Character-Level Tokenizer

The simplest possible tokenizer: every character is a token. Like spelling out every word letter by letter.

class CharTokenizer:
    """A dead-simple character-level tokenizer."""

    def __init__(self):
        self.char_to_id = {}
        self.id_to_char = {}

    def train(self, text):
        """Build vocabulary from text."""
        chars = sorted(set(text))
        self.char_to_id = {ch: i for i, ch in enumerate(chars)}
        self.id_to_char = {i: ch for ch, i in self.char_to_id.items()}
        print(f"Vocabulary size: {len(chars)} characters")
        print(f"Vocab: {chars}")

    def encode(self, text):
        """Convert text to list of integer IDs."""
        return [self.char_to_id[ch] for ch in text]

    def decode(self, ids):
        """Convert list of integer IDs back to text."""
        return "".join(self.id_to_char[i] for i in ids)

# Try it out!
tok = CharTokenizer()
tok.train("hello world")

encoded = tok.encode("hello")
print(f"'hello' → {encoded}")  # [3, 2, 5, 5, 6]

decoded = tok.decode(encoded)
print(f"{encoded} → '{decoded}'")  # 'hello'

Step 2: BPE Tokenizer from Scratch

Now the real deal. This is the same algorithm GPT uses (simplified). We start with characters and keep merging the most common pair.

from collections import Counter

class SimpleBPE:
    """A minimal BPE tokenizer built from scratch."""

    def __init__(self, num_merges=20):
        self.num_merges = num_merges
        self.merges = {}     # (pair) → merged token
        self.vocab = {}      # token → id

    def _get_pairs(self, tokens):
        """Count frequency of adjacent token pairs."""
        pairs = Counter()
        for i in range(len(tokens) - 1):
            pairs[(tokens[i], tokens[i + 1])] += 1
        return pairs

    def _merge_pair(self, tokens, pair, merged):
        """Replace all occurrences of pair with merged token."""
        new_tokens = []
        i = 0
        while i < len(tokens):
            if (i < len(tokens) - 1
                and tokens[i] == pair[0]
                and tokens[i + 1] == pair[1]):
                new_tokens.append(merged)
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        return new_tokens

    def train(self, text):
        """Learn BPE merges from training text."""
        tokens = list(text)

        for step in range(self.num_merges):
            pairs = self._get_pairs(tokens)
            if not pairs:
                break

            best_pair = pairs.most_common(1)[0][0]
            merged = best_pair[0] + best_pair[1]
            self.merges[best_pair] = merged
            tokens = self._merge_pair(tokens, best_pair, merged)
            print(f"Step {step+1}: merge '{best_pair[0]}'+'{best_pair[1]}' → '{merged}'")

        # Build vocabulary from unique tokens
        unique = sorted(set(tokens))
        self.vocab = {tok: i for i, tok in enumerate(unique)}
        print(f"\nFinal vocab ({len(self.vocab)} tokens): {unique}")

    def encode(self, text):
        """Tokenize new text using learned merges."""
        tokens = list(text)
        for pair, merged in self.merges.items():
            tokens = self._merge_pair(tokens, pair, merged)
        return tokens, [self.vocab.get(t, -1) for t in tokens]

# Train on sample text
bpe = SimpleBPE(num_merges=10)
bpe.train("low low low lowest newest widest")

# Tokenize new text
tokens, ids = bpe.encode("lowest")
print(f"\n'lowest' → tokens: {tokens}, ids: {ids}")

ELI5: What We Just Built

We built a machine that reads a lot of text, notices which letter combos keep showing up together (like "l" and "o" always appear as "lo"), and creates shortcuts. Next time it sees new text, it uses those shortcuts to chop it into efficient pieces. That's the same core idea behind GPT's tokenizer — just with more training data and more merge steps!

Module 2: Key Takeaways

  • Computers need numbers: Text must be converted to numerical form before any processing
  • Tokenization: Splits text into subword pieces using BPE — a balance between words and characters
  • Embeddings: Convert token IDs into rich vectors where similar words have similar vectors
  • Positional encoding: Stamps each token with its position using sin/cos waves
  • The pipeline: Text → Tokens → Token IDs → Embeddings + Position → Ready for the neural network!

The Full Text-to-Numbers Pipeline

Raw Text "Hello world" Tokenize ["Hello","world"] Token IDs [15339, 995] Embeddings [0.2, -0.5, ...] + Position Ready! ✓ This is the input pipeline for every Transformer / LLM

🧪 Quiz — Test Your Understanding!

Question 1: Why do computers need text converted to numbers before processing it?

Question 2: What does tokenization do to a piece of text?

Question 3: In Byte Pair Encoding (BPE), what happens at each training step?

Question 4: What is the purpose of word embeddings?

Question 5: Why do Transformers need positional encoding?

Question 6: What is the main tradeoff when choosing vocabulary size for a tokenizer?

🧪 Quiz — Test Your Understanding!

Question 1: Why can't a neural network process the raw text "Hello World" directly?

Question 2: What is the main purpose of tokenization in an LLM pipeline?

Question 3: In Byte Pair Encoding (BPE), how does the algorithm build its vocabulary?

Question 4: Why do we use word embeddings instead of just feeding raw token IDs (like 15339 for "Hello") into the model?

Question 5: What problem does positional encoding solve in Transformers?

Question 6: What is the key tradeoff when choosing a tokenizer's vocabulary size?