Understand why models fail and how they learn. Master the fundamental concepts behind all ML algorithms!
Every ML model makes errors. These errors come from two sources: Bias and Variance. Understanding this tradeoff is KEY to building good models!
Bias = the model is systematically wrong (e.g. too simple, so it keeps missing the truth). Variance = the model is unstable (e.g. too sensitive to the exact training set, so it behaves differently on new data). We try to balance both so total error is as small as possible.
Bias = Error from wrong assumptions. A high-bias model is too simple and misses important patterns.
Imagine shooting arrows at a target. High Bias = Your arrows consistently miss the center, landing in the same wrong spot every time.
The bow is miscalibrated! No matter how many times you shoot, you'll always miss in the same direction.
Variance = Error from sensitivity to training data. A high-variance model changes dramatically with different training sets.
High Variance = Your arrows scatter all over the place - sometimes left, sometimes right, sometimes high, sometimes low.
Your hand is shaky! Even if on average you hit the center, individual shots are unpredictable.
Model too simple
Misses patterns
Low train & test accuracy
Fix: Use more complex model, add features
Model too complex
Memorizes noise
High train, low test accuracy
Fix: Simplify model, regularization, more data
Model just right
Captures true patterns
Good train & test accuracy
Goal: Sweet spot between simple & complex
As Model Complexity β
Bias β (better at fitting)
Variance β (more sensitive)
Sweet Spot
Minimum Total Error
Best Generalization
| Symptom | Training Accuracy | Test Accuracy | Diagnosis | Solution |
|---|---|---|---|---|
| Both low | 60% | 55% | High Bias | More complex model, more features |
| Big gap | 98% | 65% | High Variance | Regularization, more data, simplify |
| Both good, close | 88% | 85% | Good Fit! | You're doing great! π |
# Detecting Bias vs Variance in Python from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import r2_score # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Simple model (might have high bias) simple_model = LinearRegression() simple_model.fit(X_train, y_train) train_score = r2_score(y_train, simple_model.predict(X_train)) test_score = r2_score(y_test, simple_model.predict(X_test)) print(f"Simple Model - Train: {train_score:.2f}, Test: {test_score:.2f}") # Complex model (might have high variance) complex_model = DecisionTreeRegressor(max_depth=None) # No limit = overfits! complex_model.fit(X_train, y_train) train_score = r2_score(y_train, complex_model.predict(X_train)) test_score = r2_score(y_test, complex_model.predict(X_test)) print(f"Complex Model - Train: {train_score:.2f}, Test: {test_score:.2f}") # Output: # Simple Model - Train: 0.65, Test: 0.62 β High Bias (both low) # Complex Model - Train: 1.00, Test: 0.55 β High Variance (big gap!)
Gradient Descent is the engine that powers most ML algorithms. It's how models find the best parameters!
Imagine you're blindfolded on a mountain and need to reach the lowest valley.
Strategy: Feel the ground around you. Step in the direction that goes downhill. Repeat until you can't go any lower.
Gradient Descent does the same! It measures the "slope" of the error and moves parameters in the direction that reduces error.
Initialize model parameters randomly (or with zeros)
Measure how wrong the predictions are
Find the direction that reduces error the most
Move parameters in that direction (by learning rate amount)
Stop when error stops decreasing significantly
Learning rate controls how big each step is. Too big or too small causes problems!
Takes forever to converge
Might get stuck
Wastes computation
Overshoots the minimum
Bounces around chaotically
May never converge!
Converges smoothly
Finds good minimum
Efficient training
# Simple Gradient Descent Implementation import numpy as np def gradient_descent(X, y, learning_rate=0.01, iterations=1000): # Initialize weights randomly n_features = X.shape[1] weights = np.random.randn(n_features) bias = 0 m = len(y) # Number of samples for i in range(iterations): # Step 2: Make predictions predictions = np.dot(X, weights) + bias # Step 3: Calculate error (MSE) error = predictions - y # Step 4: Compute gradients gradient_weights = (1/m) * np.dot(X.T, error) gradient_bias = (1/m) * np.sum(error) # Step 5: Update weights weights = weights - learning_rate * gradient_weights bias = bias - learning_rate * gradient_bias # Print progress every 100 iterations if i % 100 == 0: mse = np.mean(error**2) print(f"Iteration {i}: MSE = {mse:.4f}") return weights, bias # Example usage # weights, bias = gradient_descent(X_train, y_train)
Regularization adds a penalty for complex models, forcing them to be simpler and generalize better.
auto_mpg.csv β Miles per gallon and car features (cylinders, displacement, horsepower, weight, etc.). Used in regularization practice to predict mpg with Ridge/Lasso.
Download auto_mpg.csv β Save in the same folder as your script; use pd.read_csv("auto_mpg.csv").
In Ridge and Lasso, alpha (often written Ξ» in theory) controls how strong the penalty is:
So: tuning alpha is key. Use cross-validation to pick the best value (e.g. RidgeCV or LassoCV in sklearn).
Imagine writing an essay with a word limit. You can't use unnecessary words - you must be concise!
Regularization does the same for models: It penalizes large weights, forcing the model to use only the most important features.
| Type | Penalty Term | Effect | Best For |
|---|---|---|---|
| L1 (Lasso) | Sum of |weights| | Pushes some weights to exactly 0 | Feature selection (removes unimportant features) |
| L2 (Ridge) | Sum of weightsΒ² | Shrinks all weights toward 0 | When all features might be useful |
| Elastic Net | Mix of L1 + L2 | Combines both effects | When you want balance |
# Using Regularization in Sklearn from sklearn.linear_model import Ridge, Lasso, ElasticNet # Ridge Regression (L2 Regularization) ridge_model = Ridge(alpha=1.0) # alpha controls regularization strength ridge_model.fit(X_train, y_train) print(f"Ridge RΒ²: {ridge_model.score(X_test, y_test):.3f}") # Lasso Regression (L1 Regularization) lasso_model = Lasso(alpha=1.0) lasso_model.fit(X_train, y_train) print(f"Lasso RΒ²: {lasso_model.score(X_test, y_test):.3f}") # Check which features Lasso removed (coefficients = 0) print("Features kept by Lasso:") for feature, coef in zip(feature_names, lasso_model.coef_): if coef != 0: print(f" {feature}: {coef:.2f}")
Ridge(alpha=1.0) β Creates a Ridge model; alpha is how strong the penalty is.
ridge_model.fit(X_train, y_train) β Trains the model on your training data.
ridge_model.score(X_test, y_test) β Tells you how well it predicts (RΒ²).
Lasso(alpha=1.0) β Same idea but Lasso can set some weights to zero (drops features).
The for loop β Prints only the features Lasso kept (non-zero coefficients).
Higher alpha: Stronger regularization β Simpler model β May increase bias
Lower alpha: Weaker regularization β More complex model β May increase variance
Use cross-validation to find the best alpha!
A single train-test split might give misleading results. Cross-validation provides a more reliable estimate!
Instead of one split, divide data into K parts (folds). Train on K-1 folds, test on 1 fold. Repeat K times!
Example (5-Fold): Each sample appears in the test set exactly once. Average of 5 scores gives reliable estimate.
from sklearn.model_selection import cross_val_score, KFold # 5-Fold Cross-Validation model = LinearRegression() cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2') print("Cross-Validation Results:") print(f" Scores: {cv_scores}") print(f" Mean RΒ²: {cv_scores.mean():.3f}") print(f" Std Dev: {cv_scores.std():.3f}") # Output: # Cross-Validation Results: # Scores: [0.68, 0.71, 0.65, 0.69, 0.72] # Mean RΒ²: 0.690 # Std Dev: 0.025 β Low std = stable model! # If std is HIGH, model has high variance (unstable)
| Concept | What It Means | How to Address |
|---|---|---|
| High Bias | Model too simple, underfits | More complex model, more features |
| High Variance | Model too complex, overfits | Regularization, more data, simpler model |
| Gradient Descent | How models learn optimal weights | Tune learning rate, iterations |
| Learning Rate | Step size in gradient descent | Start with 0.01, adjust based on convergence |
| Regularization | Penalty for complexity | L1 for feature selection, L2 for shrinkage |
| Cross-Validation | Reliable model evaluation | Use 5-10 folds, report mean Β± std |
The course source uses auto-mpg.csv: data = pd.read_csv("auto-mpg.csv"); LinearRegression, Ridge(alpha=...), Lasso(alpha=...). Ridge = L2 penalty; Lasso = L1 (can shrink coefficients to zero). Download auto_mpg.csv from the datasets page. See Regularization.pdf in the course source for slides.
Every line of code (verbatim).
# --- Code cell 1 ---
from IPython.core.display import HTML
HTML("""
<style>
h1 { color: blue !important; }
h2 { color: green !important; }
h3 { color: purple !important; }
</style>
""")
# --- Code cell 2 ---
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# --- Code cell 5 ---
data = pd.read_csv("auto-mpg.csv")
# --- Code cell 6 ---
#Features in data set
#cylinders: contains the number of cylinders present in the car
#displacement: contains the Displacement of the car
#horsepower: contains the Horsepower of the car
#weight: contains the weight of the car
#acceleration: contains the Acceleration of the car
#model_year: contains the model year of the car
#origin: contains the origin country which car belong to
#car_name: contains the name of the car(Brand-Model-Variant)
#predict Miles per Gallon
#mpg: contains the fuel consumption value(in Miles per Gallon) for car
# --- Code cell 7 ---
data.head(15)
# --- Code cell 8 ---
data.info()
# --- Code cell 11 ---
data['horsepower'] = data['horsepower'].str.replace('?','NaN').astype(float)
data['horsepower'].fillna(data['horsepower'].mean(),inplace=True)
data['horsepower'] = data['horsepower'].astype(int)
# --- Code cell 12 ---
data.info()
# --- Code cell 13 ---
data.describe(include='all').round(2)
# --- Code cell 15 ---
data.columns
# --- Code cell 17 ---
#Correlation of output with numerical variables
numerical_columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight','acceleration']
# plotting correlation heatmap
dataplot = sns.heatmap(data[numerical_columns].corr(), cmap="YlGnBu", annot=True)
# displaying heatmap
plt.show()
# --- Code cell 19 ---
data = pd.get_dummies(data,columns=['origin','model year']) # create features
data.drop(columns=['car name'],axis=1,inplace=True) # drop unwanted data
data.head(10)
# --- Code cell 20 ---
from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(data, test_size=0.20, random_state=0)
y_train = x_train.pop('mpg')
y_test = x_test.pop('mpg')
# --- Code cell 21 ---
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
num_vars = ['cylinders', 'displacement', 'horsepower', 'weight','acceleration']
x_train[num_vars] = scaler.fit_transform(x_train[num_vars])
x_test[num_vars] = scaler.transform(x_test[num_vars])
# --- Code cell 22 ---
print(x_train.head(10))
# --- Code cell 26 ---
# Try with different values of regularization parameter alpha
lasso = Lasso(alpha=0.1) #alpha` must be a non-negative float i.e. in `[0, inf)
lasso.fit(x_train,y_train)
for z in range(len(list(x_train.columns))):
print("Lasso: The coefficient for {} is {}".format(x_train.columns[z], lasso.coef_[z]))
# --- Code cell 27 ---
from sklearn.metrics import r2_score
y_test_pred = lasso.predict(x_test)
r2_score(y_test, y_test_pred)
# --- Code cell 30 ---
# L2 Regularization
ridge = Ridge(alpha=10.0) #alpha` must be a non-negative float i.e. in `[0, inf)
ridge.fit(x_train,y_train)
for z in range(len(list(x_train.columns))):
print("Ridge: The coefficient for {} is {}".format(x_train.columns[z], ridge.coef_[z]))
# --- Code cell 31 ---
y_test_pred = ridge.predict(x_test)
r2_score(y_test, y_test_pred)
Every line of code (verbatim).
# --- Code cell 1 ---
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
# --- Code cell 2 ---
# 1. Generate a synthetic dataset with a non-linear relationship
np.random.seed(42)
X = np.sort(np.random.rand(100, 1) * 2 - 1, axis=0) # Generate 100 points between -1 and 1
y = np.sin(2 * np.pi * X).ravel() + np.random.normal(0, 0.2, 100) # True function sin(2*pi*x) with noise
# --- Code cell 3 ---
# 2. Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# --- Code cell 4 ---
# 3. Use polynomial features to create a high-dimensional space
degree = 15
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# --- Code cell 5 ---
# 4. Train a simple Linear Regression model (no regularization) as a baseline
lr = LinearRegression()
lr.fit(X_train_poly, y_train)
# --- Code cell 6 ---
# 5. Train Ridge Regression models with different lambda (alpha) values
lambdas = [0.001, 0.1, 10]
models = []
for l in lambdas:
ridge = Ridge(alpha=l)
ridge.fit(X_train_poly, y_train)
models.append(ridge)
# --- Code cell 7 ---
# 6. Plotting the results
plt.figure(figsize=(12, 8))
plt.scatter(X_train, y_train, s=20, c='b', label='Training data')
plt.scatter(X_test, y_test, s=20, c='r', label='Test data')
# Create a smooth line for plotting the fits
x_line = np.linspace(-1, 1, 100).reshape(-1, 1)
x_line_poly = poly.transform(x_line)
# Plot Linear Regression fit
y_line_lr = lr.predict(x_line_poly)
plt.plot(x_line, y_line_lr, c='k', linestyle='--', label='Linear Regression (No Regularization)')
# Plot Ridge Regression fits
line_styles = ['-', ':', '-.']
for i, l in enumerate(lambdas):
y_line_ridge = models[i].predict(x_line_poly)
plt.plot(x_line, y_line_ridge, linestyle=line_styles[i], label=f'Ridge ($\\lambda$={l})')
plt.title('Impact of $\\lambda$ on Model Fit')
plt.xlabel('$X$')
plt.ylabel('$y$')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
# --- Code cell 8 ---
# 7. Print coefficients and R-squared scores
print("--------------------------------------------------")
print("Model Coefficients and Performance")
print("--------------------------------------------------")
# Linear Regression
print("\nLinear Regression (No Regularization):")
print(f" Coefficients: {np.round(lr.coef_, 2)}")
print(f" R-squared (Train): {r2_score(y_train, lr.predict(X_train_poly)):.4f}")
print(f" R-squared (Test): {r2_score(y_test, lr.predict(X_test_poly)):.4f}")
# Ridge Regression
for i, l in enumerate(lambdas):
y_train_pred = models[i].predict(X_train_poly)
y_test_pred = models[i].predict(X_test_poly)
print(f"\nRidge ($\\lambda$={l}):")
print(f" Coefficients: {np.round(models[i].coef_, 2)}")
print(f" R-squared (Train): {r2_score(y_train, y_train_pred):.4f}")
print(f" R-squared (Test): {r2_score(y_test, y_test_pred):.4f}")
In one sentence: why does a very high learning rate in gradient descent lead to unstable training (loss jumping around) instead of converging?