Bias, Variance & Gradient Descent | Fakhruddin Khambaty's Learning Hub

Part 1: The Bias-Variance Tradeoff

Every ML model makes errors. These errors come from two sources: Bias and Variance. Understanding this tradeoff is KEY to building good models!

👶 In One Sentence

Bias = the model is systematically wrong (e.g. too simple, so it keeps missing the truth). Variance = the model is unstable (e.g. too sensitive to the exact training set, so it behaves differently on new data). We try to balance both so total error is as small as possible.

📐 Total Error Formula

Total Error = Bias² + Variance + Irreducible Noise

What is Bias?

Bias = Error from wrong assumptions. A high-bias model is too simple and misses important patterns.

🎯 Archery Analogy: Bias

Imagine shooting arrows at a target. High Bias = Your arrows consistently miss the center, landing in the same wrong spot every time.

The bow is miscalibrated! No matter how many times you shoot, you'll always miss in the same direction.

What is Variance?

Variance = Error from sensitivity to training data. A high-variance model changes dramatically with different training sets.

🎯 Archery Analogy: Variance

High Variance = Your arrows scatter all over the place - sometimes left, sometimes right, sometimes high, sometimes low.

Your hand is shaky! Even if on average you hit the center, individual shots are unpredictable.

😔

High Bias (Underfitting)

Model too simple

Misses patterns

Low train & test accuracy

Fix: Use more complex model, add features

😵

High Variance (Overfitting)

Model too complex

Memorizes noise

High train, low test accuracy

Fix: Simplify model, regularization, more data

🎯

Balanced (Good Fit)

Model just right

Captures true patterns

Good train & test accuracy

Goal: Sweet spot between simple & complex

📊 Bias-Variance Tradeoff Visualization

As Model Complexity ↑

Bias ↓ (better at fitting)

Variance ↑ (more sensitive)

Sweet Spot

Minimum Total Error

Best Generalization

Part 2: Detecting Bias & Variance

Symptom	Training Accuracy	Test Accuracy	Diagnosis	Solution
Both low	60%	55%	High Bias	More complex model, more features
Big gap	98%	65%	High Variance	Regularization, more data, simplify
Both good, close	88%	85%	Good Fit!	You're doing great! 🎉

# Detecting Bias vs Variance in Python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Simple model (might have high bias)
simple_model = LinearRegression()
simple_model.fit(X_train, y_train)

train_score = r2_score(y_train, simple_model.predict(X_train))
test_score = r2_score(y_test, simple_model.predict(X_test))

print(f"Simple Model - Train: {train_score:.2f}, Test: {test_score:.2f}")

# Complex model (might have high variance)
complex_model = DecisionTreeRegressor(max_depth=None)  # No limit = overfits!
complex_model.fit(X_train, y_train)

train_score = r2_score(y_train, complex_model.predict(X_train))
test_score = r2_score(y_test, complex_model.predict(X_test))

print(f"Complex Model - Train: {train_score:.2f}, Test: {test_score:.2f}")

# Output:
# Simple Model - Train: 0.65, Test: 0.62  ← High Bias (both low)
# Complex Model - Train: 1.00, Test: 0.55 ← High Variance (big gap!)

Part 3: Gradient Descent (How Models Learn)

Gradient Descent is the engine that powers most ML algorithms. It's how models find the best parameters!

⛰️ The Blindfolded Hiker Analogy

Imagine you're blindfolded on a mountain and need to reach the lowest valley.

Strategy: Feel the ground around you. Step in the direction that goes downhill. Repeat until you can't go any lower.

Gradient Descent does the same! It measures the "slope" of the error and moves parameters in the direction that reduces error.

The Process

1

Start with Random Weights

Initialize model parameters randomly (or with zeros)

2

Calculate Error (Loss)

Measure how wrong the predictions are

3

Compute Gradient

Find the direction that reduces error the most

4

Update Weights

Move parameters in that direction (by learning rate amount)

5

Repeat Until Convergence

Stop when error stops decreasing significantly

📐 Weight Update Formula

new_weight = old_weight - learning_rate × gradient

Learning Rate: The Step Size

Learning rate controls how big each step is. Too big or too small causes problems!

🐢

Learning Rate Too Small

Takes forever to converge

Might get stuck

Wastes computation

🦘

Learning Rate Too Large

Overshoots the minimum

Bounces around chaotically

May never converge!

👍

Learning Rate Just Right

Converges smoothly

Finds good minimum

Efficient training

# Simple Gradient Descent Implementation
import numpy as np

def gradient_descent(X, y, learning_rate=0.01, iterations=1000):
    # Initialize weights randomly
    n_features = X.shape[1]
    weights = np.random.randn(n_features)
    bias = 0
    
    m = len(y)  # Number of samples
    
    for i in range(iterations):
        # Step 2: Make predictions
        predictions = np.dot(X, weights) + bias
        
        # Step 3: Calculate error (MSE)
        error = predictions - y
        
        # Step 4: Compute gradients
        gradient_weights = (1/m) * np.dot(X.T, error)
        gradient_bias = (1/m) * np.sum(error)
        
        # Step 5: Update weights
        weights = weights - learning_rate * gradient_weights
        bias = bias - learning_rate * gradient_bias
        
        # Print progress every 100 iterations
        if i % 100 == 0:
            mse = np.mean(error**2)
            print(f"Iteration {i}: MSE = {mse:.4f}")
    
    return weights, bias

# Example usage
# weights, bias = gradient_descent(X_train, y_train)

Part 4: Regularization (Fighting Overfitting)

Regularization adds a penalty for complex models, forcing them to be simpler and generalize better.

📥 Dataset for Regularization (Ridge/Lasso with real data)

auto_mpg.csv — Miles per gallon and car features (cylinders, displacement, horsepower, weight, etc.). Used in regularization practice to predict mpg with Ridge/Lasso.

Download auto_mpg.csv — Save in the same folder as your script; use pd.read_csv("auto_mpg.csv").

📖 Full code walkthrough (every line explained)

Impact of Lambda (α)

In Ridge and Lasso, alpha (often written λ in theory) controls how strong the penalty is:

Alpha = 0 → No regularization (plain linear regression).
Small alpha → Weak penalty → weights can stay large → more flexible, risk of overfitting.
Large alpha → Strong penalty → weights shrink a lot → simpler model, risk of underfitting.

So: tuning alpha is key. Use cross-validation to pick the best value (e.g. RidgeCV or LassoCV in sklearn).

📝 The Essay Analogy

Imagine writing an essay with a word limit. You can't use unnecessary words - you must be concise!

Regularization does the same for models: It penalizes large weights, forcing the model to use only the most important features.

Types of Regularization

Type	Penalty Term	Effect	Best For
L1 (Lasso)	Sum of \|weights\|	Pushes some weights to exactly 0	Feature selection (removes unimportant features)
L2 (Ridge)	Sum of weights²	Shrinks all weights toward 0	When all features might be useful
Elastic Net	Mix of L1 + L2	Combines both effects	When you want balance

# Using Regularization in Sklearn
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge Regression (L2 Regularization)
ridge_model = Ridge(alpha=1.0)  # alpha controls regularization strength
ridge_model.fit(X_train, y_train)
print(f"Ridge R²: {ridge_model.score(X_test, y_test):.3f}")

# Lasso Regression (L1 Regularization)
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train, y_train)
print(f"Lasso R²: {lasso_model.score(X_test, y_test):.3f}")

# Check which features Lasso removed (coefficients = 0)
print("Features kept by Lasso:")
for feature, coef in zip(feature_names, lasso_model.coef_):
    if coef != 0:
        print(f"  {feature}: {coef:.2f}")

What each part does (in simple words)

Ridge(alpha=1.0) — Creates a Ridge model; alpha is how strong the penalty is.

ridge_model.fit(X_train, y_train) — Trains the model on your training data.

ridge_model.score(X_test, y_test) — Tells you how well it predicts (R²).

Lasso(alpha=1.0) — Same idea but Lasso can set some weights to zero (drops features).

The for loop — Prints only the features Lasso kept (non-zero coefficients).

Full line-by-line walkthrough with dataset →

💡 Choosing Alpha (Regularization Strength)

Higher alpha: Stronger regularization → Simpler model → May increase bias

Lower alpha: Weaker regularization → More complex model → May increase variance

Use cross-validation to find the best alpha!

Part 5: Cross-Validation

A single train-test split might give misleading results. Cross-validation provides a more reliable estimate!

📊 K-Fold Cross-Validation

Instead of one split, divide data into K parts (folds). Train on K-1 folds, test on 1 fold. Repeat K times!

Example (5-Fold): Each sample appears in the test set exactly once. Average of 5 scores gives reliable estimate.

from sklearn.model_selection import cross_val_score, KFold

# 5-Fold Cross-Validation
model = LinearRegression()
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("Cross-Validation Results:")
print(f"  Scores: {cv_scores}")
print(f"  Mean R²: {cv_scores.mean():.3f}")
print(f"  Std Dev: {cv_scores.std():.3f}")

# Output:
# Cross-Validation Results:
#   Scores: [0.68, 0.71, 0.65, 0.69, 0.72]
#   Mean R²: 0.690
#   Std Dev: 0.025  ← Low std = stable model!

# If std is HIGH, model has high variance (unstable)

Summary: Key Concepts

Concept	What It Means	How to Address
High Bias	Model too simple, underfits	More complex model, more features
High Variance	Model too complex, overfits	Regularization, more data, simpler model
Gradient Descent	How models learn optimal weights	Tune learning rate, iterations
Learning Rate	Step size in gradient descent	Start with 0.01, adjust based on convergence
Regularization	Penalty for complexity	L1 for feature selection, L2 for shrinkage
Cross-Validation	Reliable model evaluation	Use 5-10 folds, report mean ± std

🎯 Golden Rules

Always compare train vs test performance to detect bias/variance
Use cross-validation, not just a single train-test split
Start simple, add complexity only if needed
Regularization is your friend against overfitting

🚫 Common Mistakes: Bias, Variance & Gradient Descent

Only looking at training score — You need train and test (or cross-validation) to tell bias from variance; low train + low test = bias; high train + low test = variance.
Learning rate too high — Loss bounces or diverges; too low and training is slow. Start with a small value and tune.
Using only L1 or only L2 — L1 can zero out features (sparsity); L2 shrinks weights. Choose (or use ElasticNet) based on whether you want feature selection.

📘 From the course notebook (Regularization)

The course source uses auto-mpg.csv: data = pd.read_csv("auto-mpg.csv"); LinearRegression, Ridge(alpha=...), Lasso(alpha=...). Ridge = L2 penalty; Lasso = L1 (can shrink coefficients to zero). Download auto_mpg.csv from the datasets page. See Regularization.pdf in the course source for slides.

Complete code from course notebook: regularization.ipynb

Every line of code (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h1 { color: blue !important; }
h2 { color: green !important; }
h3 { color: purple !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# --- Code cell 5 ---
data = pd.read_csv("auto-mpg.csv")

# --- Code cell 6 ---
#Features in data set

#cylinders: contains the number of cylinders present in the car

#displacement: contains the Displacement of the car

#horsepower: contains the Horsepower of the car

#weight: contains the weight of the car

#acceleration: contains the Acceleration of the car

#model_year: contains the model year of the car

#origin: contains the origin country which car belong to

#car_name: contains the name of the car(Brand-Model-Variant)


#predict Miles per Gallon
#mpg: contains the fuel consumption value(in Miles per Gallon) for car

# --- Code cell 7 ---
data.head(15)

# --- Code cell 8 ---
data.info()

# --- Code cell 11 ---
data['horsepower'] = data['horsepower'].str.replace('?','NaN').astype(float)
data['horsepower'].fillna(data['horsepower'].mean(),inplace=True)
data['horsepower'] = data['horsepower'].astype(int)

# --- Code cell 12 ---
data.info()

# --- Code cell 13 ---
data.describe(include='all').round(2)

# --- Code cell 15 ---
data.columns

# --- Code cell 17 ---
#Correlation of output with numerical variables
numerical_columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight','acceleration']
# plotting correlation heatmap
dataplot = sns.heatmap(data[numerical_columns].corr(), cmap="YlGnBu", annot=True)
  
# displaying heatmap
plt.show()

# --- Code cell 19 ---
data = pd.get_dummies(data,columns=['origin','model year'])  # create features
data.drop(columns=['car name'],axis=1,inplace=True) # drop unwanted data
data.head(10)

# --- Code cell 20 ---
from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(data, test_size=0.20, random_state=0)
y_train = x_train.pop('mpg')
y_test = x_test.pop('mpg')

# --- Code cell 21 ---

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
num_vars = ['cylinders', 'displacement', 'horsepower', 'weight','acceleration']
x_train[num_vars] = scaler.fit_transform(x_train[num_vars])
x_test[num_vars] = scaler.transform(x_test[num_vars])

# --- Code cell 22 ---
print(x_train.head(10))

# --- Code cell 26 ---

# Try with different values of regularization parameter alpha
lasso = Lasso(alpha=0.1) #alpha` must be a non-negative float i.e. in `[0, inf)
lasso.fit(x_train,y_train)
for z in range(len(list(x_train.columns))):
    print("Lasso: The coefficient for {} is {}".format(x_train.columns[z], lasso.coef_[z]))

# --- Code cell 27 ---
from sklearn.metrics import r2_score
y_test_pred = lasso.predict(x_test)
r2_score(y_test, y_test_pred)

# --- Code cell 30 ---
# L2 Regularization

ridge = Ridge(alpha=10.0) #alpha` must be a non-negative float i.e. in `[0, inf)
ridge.fit(x_train,y_train)
for z in range(len(list(x_train.columns))):
    print("Ridge: The coefficient for {} is {}".format(x_train.columns[z], ridge.coef_[z]))

# --- Code cell 31 ---
y_test_pred = ridge.predict(x_test)
r2_score(y_test, y_test_pred)

Complete code from course notebook: Impact_of_Lambda.ipynb