Fast-track the building blocks that LLMs use. No PhD required โ just analogies, visuals, and code you can actually run.
A neural network is a program that learns by example. You show it thousands of examples ("this picture is a cat", "this picture is a dog"), and it figures out the patterns โ pointy ears, whiskers, fur color โ all on its own. You never tell it the rules; it discovers them.
It's called "neural" because it was loosely inspired by how brain cells (neurons) connect to each other. But don't overthink that โ it's really just math organized in layers.
Imagine a factory with assembly lines. Raw materials (your data) enter on one end. At each station, workers (neurons) look at what they receive, do a small transformation, and pass the result to the next station. The first station might notice edges, the second notices shapes, the third recognizes faces. By the end of the line, the factory produces a finished product (a prediction). Each worker has a set of knobs (weights) that control how they transform the data. Training = turning those knobs until the factory produces correct answers.
Every neuron in the network does the same simple thing: take inputs, multiply each by a weight, add a bias, then apply an activation function. That's it โ just multiply, add, squish.
Imagine you're calculating a tip. Your inputs are: food quality (1-10) and service quality (1-10). You care about service twice as much as food, so your weights are: food = 0.1, service = 0.2. You always tip at least 5%, so your bias = 5.
tip = (food ร 0.1) + (service ร 0.2) + 5
If food = 8 and service = 9: tip = (8ร0.1) + (9ร0.2) + 5 = 0.8 + 1.8 + 5 = 7.6%
A neuron is exactly this! Multiply each input by its weight, add them up, add the bias, done. Training = adjusting those weights and bias until the predictions are correct.
Drag the sliders to change inputs and weights. Watch the neuron compute its output in real time.
๐ก Try making the weight negative โ the neuron "flips" its response! Set bias very negative to see ReLU clamp to 0.
Training a neural network is a two-step dance that repeats thousands of times: forward (make a prediction) and backward (learn from mistakes).
Data flows from left to right through the network. Each layer transforms the data a little bit. At the end, you get a prediction. This is the forward pass โ just plugging numbers into the formula, layer by layer.
Imagine passing a message along a chain of friends. You whisper "5" to the first friend. They multiply by 2 and whisper "10" to the next. That friend adds 3 and whispers "13" to the next. And so on until the last friend shouts out the final answer. That's a forward pass!
After the forward pass, we compare our prediction to the correct answer. The difference is called the loss. The bigger the loss, the worse our prediction. Our goal: minimize the loss.
You throw a dart at a bullseye. The loss is the distance between where your dart landed and the center. A loss of 0 = perfect bullseye. Training = throwing darts over and over, adjusting your aim each time. The loss function is your tape measure โ it tells you exactly how far off you were.
After computing the loss, we need to figure out which weights caused the error and how to fix them. Backpropagation sends error signals backward through the network โ from output to input โ computing how much each weight contributed to the mistake.
Imagine a student gets a math problem wrong. The teacher doesn't just say "wrong!" โ they trace back through the student's work: "You made an error in step 3, and that caused steps 4, 5, and 6 to be wrong too." The teacher gives specific feedback for each step: "Adjust step 3 a lot, step 2 a little, step 1 is fine." That's backpropagation โ it assigns blame proportionally to each weight.
PyTorch is like a super-powered calculator that can do math on huge tables of numbers really fast (using your GPU), AND it automatically figures out derivatives (gradients) for you. Instead of doing calculus by hand for backpropagation, PyTorch does it in one line: loss.backward(). It's the tool that every major AI lab uses.
A tensor is just a multi-dimensional array of numbers. Think of it as a NumPy array that can also run on GPUs. Scalars, vectors, matrices โ they're all tensors.
import torch # Scalar (0-D tensor) x = torch.tensor(3.14) # Vector (1-D tensor) v = torch.tensor([1.0, 2.0, 3.0]) # Matrix (2-D tensor) m = torch.tensor([[1, 2], [3, 4], [5, 6]]) # Random tensor (common for initializing weights) w = torch.randn(3, 4) # 3 rows, 4 cols, random normal # Basic operations a = torch.tensor([1.0, 2.0, 3.0]) b = torch.tensor([4.0, 5.0, 6.0]) print(a + b) # tensor([5., 7., 9.]) print(a * b) # tensor([ 4., 10., 18.]) element-wise print(a @ b) # tensor(32.) dot product # Move to GPU (if available) if torch.cuda.is_available(): a = a.to('cuda')
This is PyTorch's killer feature. Set requires_grad=True on a tensor, do any math you want, and PyTorch will automatically compute the gradient when you call .backward().
# Tell PyTorch to track gradients for this tensor x = torch.tensor(3.0, requires_grad=True) # Do some math: y = xยฒ + 2x + 1 y = x**2 + 2*x + 1 # Compute gradient: dy/dx = 2x + 2 = 2(3) + 2 = 8 y.backward() print(x.grad) # tensor(8.) โ PyTorch computed this automatically!
Imagine you built a Rube Goldberg machine with 100 steps. Now someone asks: "If I push the first domino 1mm further, how much further will the ball at step 100 travel?" You'd have to trace through all 100 steps with calculus. With Autograd, PyTorch builds the machine, watches it run, and automatically computes the answer. That's how backpropagation works in practice โ Autograd handles all the chain-rule calculus for you.
torch.tensor(data) โ create a tensortorch.randn(shape) โ random tensor (normal distribution)tensor.to('cuda') โ move to GPUrequires_grad=True โ track gradientsloss.backward() โ compute all gradientsoptimizer.step() โ update weightsoptimizer.zero_grad() โ reset gradients to zeroEvery neural network trains the same way: five steps that repeat over and over. Once you know this loop, you can train any neural network โ from a 10-neuron toy to GPT-4.
It's like learning to throw a basketball. You throw (forward pass), see how far you missed (loss), think about what went wrong (backward), adjust your form a little (update weights), and throw again (repeat). After 10,000 throws, you barely miss!
Here it is โ the 5-step loop distilled to its essence. Memorize this and you can train anything:
for epoch in range(1000): y_pred = model(X) # โ Forward pass loss = loss_fn(y_pred, y) # โก Compute loss loss.backward() # โข Backpropagation optimizer.step() # โฃ Update weights optimizer.zero_grad() # โค Reset gradients
Let's put everything together and build a neural network that predicts house prices from square footage. This is the simplest possible end-to-end example.
We'll create fake house data (square footage โ price), build a tiny neural network (2 layers), train it for 500 epochs, and watch the loss go down. By the end, the network will have learned the relationship between house size and price โ without us ever telling it the formula.
import torch import torch.nn as nn # Fake data: price โ 200 * sqft + noise torch.manual_seed(42) sqft = torch.rand(100, 1) * 5 # 0-5 (in thousands) price = 200 * sqft + 50 + torch.randn(100, 1) * 20 print(f"sqft range: {sqft.min():.1f} - {sqft.max():.1f}") print(f"price range: {price.min():.0f} - {price.max():.0f}")
class HousePriceNet(nn.Module): def __init__(self): super().__init__() self.layers = nn.Sequential( nn.Linear(1, 16), # 1 input (sqft) โ 16 hidden neurons nn.ReLU(), nn.Linear(16, 1), # 16 hidden โ 1 output (price) ) def forward(self, x): return self.layers(x) model = HousePriceNet() print(model) # Shows the architecture
loss_fn = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.01) losses = [] for epoch in range(500): pred = model(sqft) # โ Forward loss = loss_fn(pred, price) # โก Loss loss.backward() # โข Backward optimizer.step() # โฃ Update optimizer.zero_grad() # โค Zero grad losses.append(loss.item()) if epoch % 100 == 0: print(f"Epoch {epoch:>3d} | Loss: {loss.item():.1f}") # Output: # Epoch 0 | Loss: 132847.2 # Epoch 100 | Loss: 1205.3 # Epoch 200 | Loss: 418.7 # Epoch 300 | Loss: 394.2 # Epoch 400 | Loss: 389.8
# Predict price for a 3,000 sqft house with torch.no_grad(): test_sqft = torch.tensor([[3.0]]) predicted = model(test_sqft) print(f"3,000 sqft โ predicted price: ${predicted.item():.0f}k") # Expected: ~$650k (200 * 3 + 50 = 650)
Click "Train" to simulate 500 epochs. Watch the loss drop and the model's prediction line fit the data!
nn.Module and nn.Sequentialtorch.no_grad()LLMs like GPT are just much bigger neural networks. Instead of predicting house prices from square footage, they predict the next word from all the previous words. Instead of 2 layers, they have 96+. Instead of 16 neurons per layer, they have 12,288. But the training loop? Exactly the same 5 steps. That's the beauty of deep learning โ the core idea scales from a toy to the most powerful AI on Earth.