Module 3 โ€” Building Blocks

๐Ÿง  Neural Network Refresher

Fast-track the building blocks that LLMs use. No PhD required โ€” just analogies, visuals, and code you can actually run.

Part 1: What is a Neural Network?

๐Ÿ‘ถ Like You're 5

A neural network is a program that learns by example. You show it thousands of examples ("this picture is a cat", "this picture is a dog"), and it figures out the patterns โ€” pointy ears, whiskers, fur color โ€” all on its own. You never tell it the rules; it discovers them.

It's called "neural" because it was loosely inspired by how brain cells (neurons) connect to each other. But don't overthink that โ€” it's really just math organized in layers.

๐Ÿญ The Factory Assembly Line

Imagine a factory with assembly lines. Raw materials (your data) enter on one end. At each station, workers (neurons) look at what they receive, do a small transformation, and pass the result to the next station. The first station might notice edges, the second notices shapes, the third recognizes faces. By the end of the line, the factory produces a finished product (a prediction). Each worker has a set of knobs (weights) that control how they transform the data. Training = turning those knobs until the factory produces correct answers.

๐ŸŽฌ How Data Flows Through a Neural Network

Input Hidden Layer 1 Hidden Layer 2 Output xโ‚ xโ‚‚ xโ‚ƒ hโ‚ hโ‚‚ hโ‚ƒ hโ‚„ hโ‚… hโ‚† hโ‚‡ ลท sqft beds age price โฌค Watch the dots flow! Each dot = a piece of data moving through the network

Inside a Single Neuron

Every neuron in the network does the same simple thing: take inputs, multiply each by a weight, add a bias, then apply an activation function. That's it โ€” just multiply, add, squish.

๐Ÿฝ๏ธ The Restaurant Tip Calculator

Imagine you're calculating a tip. Your inputs are: food quality (1-10) and service quality (1-10). You care about service twice as much as food, so your weights are: food = 0.1, service = 0.2. You always tip at least 5%, so your bias = 5.

tip = (food ร— 0.1) + (service ร— 0.2) + 5

If food = 8 and service = 9: tip = (8ร—0.1) + (9ร—0.2) + 5 = 0.8 + 1.8 + 5 = 7.6%

A neuron is exactly this! Multiply each input by its weight, add them up, add the bias, done. Training = adjusting those weights and bias until the predictions are correct.

Interactive: Build a Neuron!

Drag the sliders to change inputs and weights. Watch the neuron compute its output in real time.

5.0
0.7
0.5
z = x ร— w + b = 4.00
ReLU(z) = max(0, z) = 4.00
4.00

๐Ÿ’ก Try making the weight negative โ€” the neuron "flips" its response! Set bias very negative to see ReLU clamp to 0.

๐Ÿ’ก Key Takeaways

  • A neural network = layers of neurons connected together
  • Each neuron computes: output = activation(inputs ร— weights + bias)
  • Weights control how much each input matters
  • Bias shifts the output up or down
  • Activation function (like ReLU) adds non-linearity โ€” without it, the whole network would just be a fancy linear equation
  • More layers = the network can learn more complex patterns

Part 2: Forward Pass & Backpropagation

Training a neural network is a two-step dance that repeats thousands of times: forward (make a prediction) and backward (learn from mistakes).

Forward Pass: Making a Prediction

Data flows from left to right through the network. Each layer transforms the data a little bit. At the end, you get a prediction. This is the forward pass โ€” just plugging numbers into the formula, layer by layer.

๐Ÿ‘ถ Like You're 5

Imagine passing a message along a chain of friends. You whisper "5" to the first friend. They multiply by 2 and whisper "10" to the next. That friend adds 3 and whispers "13" to the next. And so on until the last friend shouts out the final answer. That's a forward pass!

Loss: How Wrong Were We?

After the forward pass, we compare our prediction to the correct answer. The difference is called the loss. The bigger the loss, the worse our prediction. Our goal: minimize the loss.

๐ŸŽฏ The Dart Board

You throw a dart at a bullseye. The loss is the distance between where your dart landed and the center. A loss of 0 = perfect bullseye. Training = throwing darts over and over, adjusting your aim each time. The loss function is your tape measure โ€” it tells you exactly how far off you were.

Backpropagation: Learning from Mistakes

After computing the loss, we need to figure out which weights caused the error and how to fix them. Backpropagation sends error signals backward through the network โ€” from output to input โ€” computing how much each weight contributed to the mistake.

๐Ÿ“ The Teacher Grading Papers

Imagine a student gets a math problem wrong. The teacher doesn't just say "wrong!" โ€” they trace back through the student's work: "You made an error in step 3, and that caused steps 4, 5, and 6 to be wrong too." The teacher gives specific feedback for each step: "Adjust step 3 a lot, step 2 a little, step 1 is fine." That's backpropagation โ€” it assigns blame proportionally to each weight.

๐ŸŽฌ Forward Pass (Blue) โ†’ Loss โ†’ Backward Pass (Red)

Input Hidden Output Loss = |ลท - y|ยฒ ร— wโ‚ ร— wโ‚‚ โ†’ FORWARD PASS โ†’ โˆ‚L/โˆ‚wโ‚‚ adjust wโ‚‚ adjust wโ‚ โ† BACKPROPAGATION โ† ๐Ÿ”ต Blue = data flows forward to make a prediction | ๐Ÿ”ด Red = error flows backward to fix weights

๐Ÿ’ก The Big Picture

  • Forward pass: data goes left โ†’ right, producing a prediction
  • Loss: measures how wrong the prediction is
  • Backpropagation: error goes right โ†’ left, computing gradients (how much to adjust each weight)
  • Gradient descent: actually updates the weights by a small step in the direction that reduces the loss
  • Repeat this thousands of times โ†’ the network gets better and better

Part 3: PyTorch Crash Course

๐Ÿ‘ถ Like You're 5

PyTorch is like a super-powered calculator that can do math on huge tables of numbers really fast (using your GPU), AND it automatically figures out derivatives (gradients) for you. Instead of doing calculus by hand for backpropagation, PyTorch does it in one line: loss.backward(). It's the tool that every major AI lab uses.

Tensors: The Building Block

A tensor is just a multi-dimensional array of numbers. Think of it as a NumPy array that can also run on GPUs. Scalars, vectors, matrices โ€” they're all tensors.

Python / PyTorch
import torch

# Scalar (0-D tensor)
x = torch.tensor(3.14)

# Vector (1-D tensor)
v = torch.tensor([1.0, 2.0, 3.0])

# Matrix (2-D tensor)
m = torch.tensor([[1, 2], [3, 4], [5, 6]])

# Random tensor (common for initializing weights)
w = torch.randn(3, 4)  # 3 rows, 4 cols, random normal

# Basic operations
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b)      # tensor([5., 7., 9.])
print(a * b)      # tensor([ 4., 10., 18.])  element-wise
print(a @ b)      # tensor(32.)  dot product

# Move to GPU (if available)
if torch.cuda.is_available():
    a = a.to('cuda')

Autograd: Automatic Differentiation

This is PyTorch's killer feature. Set requires_grad=True on a tensor, do any math you want, and PyTorch will automatically compute the gradient when you call .backward().

Python / Autograd
# Tell PyTorch to track gradients for this tensor
x = torch.tensor(3.0, requires_grad=True)

# Do some math: y = xยฒ + 2x + 1
y = x**2 + 2*x + 1

# Compute gradient: dy/dx = 2x + 2 = 2(3) + 2 = 8
y.backward()

print(x.grad)  # tensor(8.)  โ† PyTorch computed this automatically!

๐Ÿช„ Why Autograd is Magic

Imagine you built a Rube Goldberg machine with 100 steps. Now someone asks: "If I push the first domino 1mm further, how much further will the ball at step 100 travel?" You'd have to trace through all 100 steps with calculus. With Autograd, PyTorch builds the machine, watches it run, and automatically computes the answer. That's how backpropagation works in practice โ€” Autograd handles all the chain-rule calculus for you.

๐Ÿ’ก PyTorch Cheat Sheet

  • torch.tensor(data) โ€” create a tensor
  • torch.randn(shape) โ€” random tensor (normal distribution)
  • tensor.to('cuda') โ€” move to GPU
  • requires_grad=True โ€” track gradients
  • loss.backward() โ€” compute all gradients
  • optimizer.step() โ€” update weights
  • optimizer.zero_grad() โ€” reset gradients to zero

Part 4: The Training Loop

Every neural network trains the same way: five steps that repeat over and over. Once you know this loop, you can train any neural network โ€” from a 10-neuron toy to GPT-4.

๐Ÿ‘ถ Like You're 5

It's like learning to throw a basketball. You throw (forward pass), see how far you missed (loss), think about what went wrong (backward), adjust your form a little (update weights), and throw again (repeat). After 10,000 throws, you barely miss!

๐Ÿ”„ The Training Loop โ€” Watch Each Step Light Up

โ‘  Forward ลท = model(x) โ‘ก Loss L = f(ลท, y) โ‘ข Backward L.backward() โ‘ฃ Update optim.step() โ‘ค Zero zero_grad() EPOCH repeat 1,000s of times

The Training Loop in Code

Here it is โ€” the 5-step loop distilled to its essence. Memorize this and you can train anything:

Python / Training Loop
for epoch in range(1000):
    y_pred = model(X)              # โ‘  Forward pass
    loss = loss_fn(y_pred, y)      # โ‘ก Compute loss
    loss.backward()                # โ‘ข Backpropagation
    optimizer.step()               # โ‘ฃ Update weights
    optimizer.zero_grad()           # โ‘ค Reset gradients

๐Ÿ“‰ Watch the Loss Decrease Over Training

Epochs โ†’ Loss โ†’ 0 250 500 750 1000 High loss (bad!) Low loss (good!)

๐Ÿ’ก Why Does the Loss Go Down?

  • Each iteration, backpropagation figures out which direction to nudge each weight to reduce the loss
  • Learning rate controls how big each nudge is โ€” too big and you overshoot, too small and training is painfully slow
  • The curve is steep at first (lots to learn) then flattens out (fine-tuning details)
  • If the loss stops decreasing, you might need: more data, a bigger model, or a different learning rate

Part 5: Code โ€” Train a Simple Neural Net

Let's put everything together and build a neural network that predicts house prices from square footage. This is the simplest possible end-to-end example.

๐Ÿ‘ถ What We're Building

We'll create fake house data (square footage โ†’ price), build a tiny neural network (2 layers), train it for 500 epochs, and watch the loss go down. By the end, the network will have learned the relationship between house size and price โ€” without us ever telling it the formula.

Step 1: Create the Dataset

Python
import torch
import torch.nn as nn

# Fake data: price โ‰ˆ 200 * sqft + noise
torch.manual_seed(42)
sqft = torch.rand(100, 1) * 5         # 0-5 (in thousands)
price = 200 * sqft + 50 + torch.randn(100, 1) * 20

print(f"sqft range: {sqft.min():.1f} - {sqft.max():.1f}")
print(f"price range: {price.min():.0f} - {price.max():.0f}")

Step 2: Define the Model

Python
class HousePriceNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(1, 16),    # 1 input (sqft) โ†’ 16 hidden neurons
            nn.ReLU(),
            nn.Linear(16, 1),   # 16 hidden โ†’ 1 output (price)
        )

    def forward(self, x):
        return self.layers(x)

model = HousePriceNet()
print(model)  # Shows the architecture

Step 3: The Training Loop

Python
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

losses = []
for epoch in range(500):
    pred = model(sqft)                # โ‘  Forward
    loss = loss_fn(pred, price)       # โ‘ก Loss
    loss.backward()                   # โ‘ข Backward
    optimizer.step()                  # โ‘ฃ Update
    optimizer.zero_grad()              # โ‘ค Zero grad
    losses.append(loss.item())

    if epoch % 100 == 0:
        print(f"Epoch {epoch:>3d} | Loss: {loss.item():.1f}")

# Output:
# Epoch   0 | Loss: 132847.2
# Epoch 100 | Loss: 1205.3
# Epoch 200 | Loss: 418.7
# Epoch 300 | Loss: 394.2
# Epoch 400 | Loss: 389.8

Step 4: Make Predictions

Python
# Predict price for a 3,000 sqft house
with torch.no_grad():
    test_sqft = torch.tensor([[3.0]])
    predicted = model(test_sqft)
    print(f"3,000 sqft โ†’ predicted price: ${predicted.item():.0f}k")
    # Expected: ~$650k  (200 * 3 + 50 = 650)

Interactive: Watch Training in Action

Click "Train" to simulate 500 epochs. Watch the loss drop and the model's prediction line fit the data!

Epoch
0
Loss
โ€”
Square Footage (thousands) Price ($k)

๐ŸŽ“ What You Just Learned

  • How to create a dataset with PyTorch tensors
  • How to define a model using nn.Module and nn.Sequential
  • The 5-step training loop: forward โ†’ loss โ†’ backward โ†’ step โ†’ zero_grad
  • How to make predictions with torch.no_grad()
  • This exact pattern scales from 2 layers to GPT-4's 96 layers โ€” only the model definition changes!

๐Ÿ”— How This Connects to LLMs

LLMs like GPT are just much bigger neural networks. Instead of predicting house prices from square footage, they predict the next word from all the previous words. Instead of 2 layers, they have 96+. Instead of 16 neurons per layer, they have 12,288. But the training loop? Exactly the same 5 steps. That's the beauty of deep learning โ€” the core idea scales from a toy to the most powerful AI on Earth.

Quiz โ€” Test Your Understanding

Question 1: How does a neural network learn?

Question 2: What is the main purpose of an activation function like ReLU?

Question 3: What does backpropagation do during training?

Question 4: What is the key advantage of PyTorch tensors over regular NumPy arrays?

Question 5: What does the loss function measure?

Question 6: A model performs great on training data but poorly on new, unseen data. What is this called?

Quiz โ€” Test Your Knowledge

Question 1: What is a neural network fundamentally?

Question 2: What is the main purpose of the ReLU activation function?

Question 3: What does backpropagation do?

Question 4: What is the key advantage of PyTorch tensors over NumPy arrays?

Question 5: What does the loss function measure?

Question 6: A model scores 99% on training data but only 55% on new data it hasn't seen. What is this called?