βš–οΈ Bias, Variance & Gradient Descent

Understand why models fail and how they learn. Master the fundamental concepts behind all ML algorithms!

Part 1: The Bias-Variance Tradeoff

Every ML model makes errors. These errors come from two sources: Bias and Variance. Understanding this tradeoff is KEY to building good models!

πŸ‘Ά In One Sentence

Bias = the model is systematically wrong (e.g. too simple, so it keeps missing the truth). Variance = the model is unstable (e.g. too sensitive to the exact training set, so it behaves differently on new data). We try to balance both so total error is as small as possible.

πŸ“ Total Error Formula

Total Error = BiasΒ² + Variance + Irreducible Noise

What is Bias?

Bias = Error from wrong assumptions. A high-bias model is too simple and misses important patterns.

🎯 Archery Analogy: Bias

Imagine shooting arrows at a target. High Bias = Your arrows consistently miss the center, landing in the same wrong spot every time.

The bow is miscalibrated! No matter how many times you shoot, you'll always miss in the same direction.

What is Variance?

Variance = Error from sensitivity to training data. A high-variance model changes dramatically with different training sets.

🎯 Archery Analogy: Variance

High Variance = Your arrows scatter all over the place - sometimes left, sometimes right, sometimes high, sometimes low.

Your hand is shaky! Even if on average you hit the center, individual shots are unpredictable.

πŸ˜”

High Bias (Underfitting)

Model too simple

Misses patterns

Low train & test accuracy

Fix: Use more complex model, add features

😡

High Variance (Overfitting)

Model too complex

Memorizes noise

High train, low test accuracy

Fix: Simplify model, regularization, more data

🎯

Balanced (Good Fit)

Model just right

Captures true patterns

Good train & test accuracy

Goal: Sweet spot between simple & complex

πŸ“Š Bias-Variance Tradeoff Visualization

As Model Complexity ↑

Bias ↓ (better at fitting)

Variance ↑ (more sensitive)

Sweet Spot

Minimum Total Error

Best Generalization

Part 2: Detecting Bias & Variance

Symptom Training Accuracy Test Accuracy Diagnosis Solution
Both low 60% 55% High Bias More complex model, more features
Big gap 98% 65% High Variance Regularization, more data, simplify
Both good, close 88% 85% Good Fit! You're doing great! πŸŽ‰
# Detecting Bias vs Variance in Python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Simple model (might have high bias)
simple_model = LinearRegression()
simple_model.fit(X_train, y_train)

train_score = r2_score(y_train, simple_model.predict(X_train))
test_score = r2_score(y_test, simple_model.predict(X_test))

print(f"Simple Model - Train: {train_score:.2f}, Test: {test_score:.2f}")

# Complex model (might have high variance)
complex_model = DecisionTreeRegressor(max_depth=None)  # No limit = overfits!
complex_model.fit(X_train, y_train)

train_score = r2_score(y_train, complex_model.predict(X_train))
test_score = r2_score(y_test, complex_model.predict(X_test))

print(f"Complex Model - Train: {train_score:.2f}, Test: {test_score:.2f}")

# Output:
# Simple Model - Train: 0.65, Test: 0.62  ← High Bias (both low)
# Complex Model - Train: 1.00, Test: 0.55 ← High Variance (big gap!)

Part 3: Gradient Descent (How Models Learn)

Gradient Descent is the engine that powers most ML algorithms. It's how models find the best parameters!

⛰️ The Blindfolded Hiker Analogy

Imagine you're blindfolded on a mountain and need to reach the lowest valley.

Strategy: Feel the ground around you. Step in the direction that goes downhill. Repeat until you can't go any lower.

Gradient Descent does the same! It measures the "slope" of the error and moves parameters in the direction that reduces error.

The Process

1
Start with Random Weights

Initialize model parameters randomly (or with zeros)

2
Calculate Error (Loss)

Measure how wrong the predictions are

3
Compute Gradient

Find the direction that reduces error the most

4
Update Weights

Move parameters in that direction (by learning rate amount)

5
Repeat Until Convergence

Stop when error stops decreasing significantly

πŸ“ Weight Update Formula

new_weight = old_weight - learning_rate Γ— gradient

Learning Rate: The Step Size

Learning rate controls how big each step is. Too big or too small causes problems!

🐒

Learning Rate Too Small

Takes forever to converge

Might get stuck

Wastes computation

🦘

Learning Rate Too Large

Overshoots the minimum

Bounces around chaotically

May never converge!

πŸ‘

Learning Rate Just Right

Converges smoothly

Finds good minimum

Efficient training

# Simple Gradient Descent Implementation
import numpy as np

def gradient_descent(X, y, learning_rate=0.01, iterations=1000):
    # Initialize weights randomly
    n_features = X.shape[1]
    weights = np.random.randn(n_features)
    bias = 0
    
    m = len(y)  # Number of samples
    
    for i in range(iterations):
        # Step 2: Make predictions
        predictions = np.dot(X, weights) + bias
        
        # Step 3: Calculate error (MSE)
        error = predictions - y
        
        # Step 4: Compute gradients
        gradient_weights = (1/m) * np.dot(X.T, error)
        gradient_bias = (1/m) * np.sum(error)
        
        # Step 5: Update weights
        weights = weights - learning_rate * gradient_weights
        bias = bias - learning_rate * gradient_bias
        
        # Print progress every 100 iterations
        if i % 100 == 0:
            mse = np.mean(error**2)
            print(f"Iteration {i}: MSE = {mse:.4f}")
    
    return weights, bias

# Example usage
# weights, bias = gradient_descent(X_train, y_train)

Part 4: Regularization (Fighting Overfitting)

Regularization adds a penalty for complex models, forcing them to be simpler and generalize better.

πŸ“₯ Dataset for Regularization (Ridge/Lasso with real data)

auto_mpg.csv β€” Miles per gallon and car features (cylinders, displacement, horsepower, weight, etc.). Used in regularization practice to predict mpg with Ridge/Lasso.

Download auto_mpg.csv β€” Save in the same folder as your script; use pd.read_csv("auto_mpg.csv").

πŸ“– Full code walkthrough (every line explained)

Impact of Lambda (Ξ±)

In Ridge and Lasso, alpha (often written Ξ» in theory) controls how strong the penalty is:

So: tuning alpha is key. Use cross-validation to pick the best value (e.g. RidgeCV or LassoCV in sklearn).

πŸ“ The Essay Analogy

Imagine writing an essay with a word limit. You can't use unnecessary words - you must be concise!

Regularization does the same for models: It penalizes large weights, forcing the model to use only the most important features.

Types of Regularization

Type Penalty Term Effect Best For
L1 (Lasso) Sum of |weights| Pushes some weights to exactly 0 Feature selection (removes unimportant features)
L2 (Ridge) Sum of weightsΒ² Shrinks all weights toward 0 When all features might be useful
Elastic Net Mix of L1 + L2 Combines both effects When you want balance
# Using Regularization in Sklearn
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge Regression (L2 Regularization)
ridge_model = Ridge(alpha=1.0)  # alpha controls regularization strength
ridge_model.fit(X_train, y_train)
print(f"Ridge RΒ²: {ridge_model.score(X_test, y_test):.3f}")

# Lasso Regression (L1 Regularization)
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train, y_train)
print(f"Lasso RΒ²: {lasso_model.score(X_test, y_test):.3f}")

# Check which features Lasso removed (coefficients = 0)
print("Features kept by Lasso:")
for feature, coef in zip(feature_names, lasso_model.coef_):
    if coef != 0:
        print(f"  {feature}: {coef:.2f}")

What each part does (in simple words)

Ridge(alpha=1.0) β€” Creates a Ridge model; alpha is how strong the penalty is.

ridge_model.fit(X_train, y_train) β€” Trains the model on your training data.

ridge_model.score(X_test, y_test) β€” Tells you how well it predicts (RΒ²).

Lasso(alpha=1.0) β€” Same idea but Lasso can set some weights to zero (drops features).

The for loop β€” Prints only the features Lasso kept (non-zero coefficients).

Full line-by-line walkthrough with dataset β†’

πŸ’‘ Choosing Alpha (Regularization Strength)

Higher alpha: Stronger regularization β†’ Simpler model β†’ May increase bias

Lower alpha: Weaker regularization β†’ More complex model β†’ May increase variance

Use cross-validation to find the best alpha!

Part 5: Cross-Validation

A single train-test split might give misleading results. Cross-validation provides a more reliable estimate!

πŸ“Š K-Fold Cross-Validation

Instead of one split, divide data into K parts (folds). Train on K-1 folds, test on 1 fold. Repeat K times!

Example (5-Fold): Each sample appears in the test set exactly once. Average of 5 scores gives reliable estimate.

from sklearn.model_selection import cross_val_score, KFold

# 5-Fold Cross-Validation
model = LinearRegression()
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("Cross-Validation Results:")
print(f"  Scores: {cv_scores}")
print(f"  Mean RΒ²: {cv_scores.mean():.3f}")
print(f"  Std Dev: {cv_scores.std():.3f}")

# Output:
# Cross-Validation Results:
#   Scores: [0.68, 0.71, 0.65, 0.69, 0.72]
#   Mean RΒ²: 0.690
#   Std Dev: 0.025  ← Low std = stable model!

# If std is HIGH, model has high variance (unstable)

Summary: Key Concepts

Concept What It Means How to Address
High Bias Model too simple, underfits More complex model, more features
High Variance Model too complex, overfits Regularization, more data, simpler model
Gradient Descent How models learn optimal weights Tune learning rate, iterations
Learning Rate Step size in gradient descent Start with 0.01, adjust based on convergence
Regularization Penalty for complexity L1 for feature selection, L2 for shrinkage
Cross-Validation Reliable model evaluation Use 5-10 folds, report mean Β± std

🎯 Golden Rules

  • Always compare train vs test performance to detect bias/variance
  • Use cross-validation, not just a single train-test split
  • Start simple, add complexity only if needed
  • Regularization is your friend against overfitting

🚫 Common Mistakes: Bias, Variance & Gradient Descent

  • Only looking at training score β€” You need train and test (or cross-validation) to tell bias from variance; low train + low test = bias; high train + low test = variance.
  • Learning rate too high β€” Loss bounces or diverges; too low and training is slow. Start with a small value and tune.
  • Using only L1 or only L2 β€” L1 can zero out features (sparsity); L2 shrinks weights. Choose (or use ElasticNet) based on whether you want feature selection.

πŸ“˜ From the course notebook (Regularization)

The course source uses auto-mpg.csv: data = pd.read_csv("auto-mpg.csv"); LinearRegression, Ridge(alpha=...), Lasso(alpha=...). Ridge = L2 penalty; Lasso = L1 (can shrink coefficients to zero). Download auto_mpg.csv from the datasets page. See Regularization.pdf in the course source for slides.

Complete code from course notebook: regularization.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h1 { color: blue !important; }
h2 { color: green !important; }
h3 { color: purple !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# --- Code cell 5 ---
data = pd.read_csv("auto-mpg.csv")

# --- Code cell 6 ---
#Features in data set

#cylinders: contains the number of cylinders present in the car

#displacement: contains the Displacement of the car

#horsepower: contains the Horsepower of the car

#weight: contains the weight of the car

#acceleration: contains the Acceleration of the car

#model_year: contains the model year of the car

#origin: contains the origin country which car belong to

#car_name: contains the name of the car(Brand-Model-Variant)


#predict Miles per Gallon
#mpg: contains the fuel consumption value(in Miles per Gallon) for car

# --- Code cell 7 ---
data.head(15)

# --- Code cell 8 ---
data.info()

# --- Code cell 11 ---
data['horsepower'] = data['horsepower'].str.replace('?','NaN').astype(float)
data['horsepower'].fillna(data['horsepower'].mean(),inplace=True)
data['horsepower'] = data['horsepower'].astype(int)

# --- Code cell 12 ---
data.info()

# --- Code cell 13 ---
data.describe(include='all').round(2)

# --- Code cell 15 ---
data.columns

# --- Code cell 17 ---
#Correlation of output with numerical variables
numerical_columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight','acceleration']
# plotting correlation heatmap
dataplot = sns.heatmap(data[numerical_columns].corr(), cmap="YlGnBu", annot=True)
  
# displaying heatmap
plt.show()

# --- Code cell 19 ---
data = pd.get_dummies(data,columns=['origin','model year'])  # create features
data.drop(columns=['car name'],axis=1,inplace=True) # drop unwanted data
data.head(10)

# --- Code cell 20 ---
from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(data, test_size=0.20, random_state=0)
y_train = x_train.pop('mpg')
y_test = x_test.pop('mpg')

# --- Code cell 21 ---

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
num_vars = ['cylinders', 'displacement', 'horsepower', 'weight','acceleration']
x_train[num_vars] = scaler.fit_transform(x_train[num_vars])
x_test[num_vars] = scaler.transform(x_test[num_vars])

# --- Code cell 22 ---
print(x_train.head(10))

# --- Code cell 26 ---

# Try with different values of regularization parameter alpha
lasso = Lasso(alpha=0.1) #alpha` must be a non-negative float i.e. in `[0, inf)
lasso.fit(x_train,y_train)
for z in range(len(list(x_train.columns))):
    print("Lasso: The coefficient for {} is {}".format(x_train.columns[z], lasso.coef_[z]))

# --- Code cell 27 ---
from sklearn.metrics import r2_score
y_test_pred = lasso.predict(x_test)
r2_score(y_test, y_test_pred)

# --- Code cell 30 ---
# L2 Regularization

ridge = Ridge(alpha=10.0) #alpha` must be a non-negative float i.e. in `[0, inf)
ridge.fit(x_train,y_train)
for z in range(len(list(x_train.columns))):
    print("Ridge: The coefficient for {} is {}".format(x_train.columns[z], ridge.coef_[z]))

# --- Code cell 31 ---
y_test_pred = ridge.predict(x_test)
r2_score(y_test, y_test_pred)

Complete code from course notebook: Impact_of_Lambda.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

# --- Code cell 2 ---
# 1. Generate a synthetic dataset with a non-linear relationship
np.random.seed(42)
X = np.sort(np.random.rand(100, 1) * 2 - 1, axis=0) # Generate 100 points between -1 and 1
y = np.sin(2 * np.pi * X).ravel() + np.random.normal(0, 0.2, 100) # True function sin(2*pi*x) with noise

# --- Code cell 3 ---
# 2. Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- Code cell 4 ---
# 3. Use polynomial features to create a high-dimensional space
degree = 15
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# --- Code cell 5 ---
# 4. Train a simple Linear Regression model (no regularization) as a baseline
lr = LinearRegression()
lr.fit(X_train_poly, y_train)

# --- Code cell 6 ---
# 5. Train Ridge Regression models with different lambda (alpha) values
lambdas = [0.001, 0.1, 10]
models = []
for l in lambdas:
    ridge = Ridge(alpha=l)
    ridge.fit(X_train_poly, y_train)
    models.append(ridge)

# --- Code cell 7 ---
# 6. Plotting the results
plt.figure(figsize=(12, 8))
plt.scatter(X_train, y_train, s=20, c='b', label='Training data')
plt.scatter(X_test, y_test, s=20, c='r', label='Test data')

# Create a smooth line for plotting the fits
x_line = np.linspace(-1, 1, 100).reshape(-1, 1)
x_line_poly = poly.transform(x_line)

# Plot Linear Regression fit
y_line_lr = lr.predict(x_line_poly)
plt.plot(x_line, y_line_lr, c='k', linestyle='--', label='Linear Regression (No Regularization)')

# Plot Ridge Regression fits
line_styles = ['-', ':', '-.']
for i, l in enumerate(lambdas):
    y_line_ridge = models[i].predict(x_line_poly)
    plt.plot(x_line, y_line_ridge, linestyle=line_styles[i], label=f'Ridge ($\\lambda$={l})')

plt.title('Impact of $\\lambda$ on Model Fit')
plt.xlabel('$X$')
plt.ylabel('$y$')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# --- Code cell 8 ---
# 7. Print coefficients and R-squared scores
print("--------------------------------------------------")
print("Model Coefficients and Performance")
print("--------------------------------------------------")

# Linear Regression
print("\nLinear Regression (No Regularization):")
print(f"  Coefficients: {np.round(lr.coef_, 2)}")
print(f"  R-squared (Train): {r2_score(y_train, lr.predict(X_train_poly)):.4f}")
print(f"  R-squared (Test): {r2_score(y_test, lr.predict(X_test_poly)):.4f}")

# Ridge Regression
for i, l in enumerate(lambdas):
    y_train_pred = models[i].predict(X_train_poly)
    y_test_pred = models[i].predict(X_test_poly)
    print(f"\nRidge ($\\lambda$={l}):")
    print(f"  Coefficients: {np.round(models[i].coef_, 2)}")
    print(f"  R-squared (Train): {r2_score(y_train, y_train_pred):.4f}")
    print(f"  R-squared (Test): {r2_score(y_test, y_test_pred):.4f}")

πŸ’­ Short reflection

In one sentence: why does a very high learning rate in gradient descent lead to unstable training (loss jumping around) instead of converging?

βœ… CORE (Must know)

  • Bias: error from wrong assumptions (underfitting); high bias = model too simple.
  • Variance: error from sensitivity to training data (overfitting); high variance = model too complex.
  • Bias–Variance tradeoff: Total Error = BiasΒ² + Variance + Irreducible noise; we balance both.
  • Gradient descent: iteratively update weights by moving in the direction that reduces loss; new_weight = old_weight βˆ’ learning_rate Γ— gradient.
  • Learning rate: step size; too high = unstable; too low = slow convergence.
  • Regularization: L1 (Lasso) and L2 (Ridge) penalize large weights to reduce overfitting.
  • Use train vs test (or cross-validation) to spot overfitting and underfitting.

πŸ“š NON-CORE (Good to know)

  • Stochastic vs batch gradient descent; mini-batch.
  • Momentum and adaptive learning rates (Adam, AdaGrad).
  • Early stopping as a form of regularization.
  • Why we square bias in the decomposition.