Your first REAL Machine Learning algorithm! We'll explain it so simply that you'll wonder why you were ever scared of ML!
Linear Regression is like drawing the BEST possible straight line through a bunch of dots!
The line helps you PREDICT new values you haven't seen before!
It's the one straight line that gets closest to all the dots at once. "Closest" means: if you add up how far each dot is from the line (the errors), that total is as small as possible. So the computer tries many lines and picks the one that makes the total error smallest. That's your line of best fit.
Because the line is only useful if it predicts well. If the line is far from the dots, our predictions will be wrong. So we define "best" as: the line that makes the prediction errors as small as possible. The cost function (MSE) is just the way we measure that total error so the algorithm can improve it step by step.
You have data about houses:
Now someone asks: "How much for a 1750 sq ft house?"
Linear regression draws a line and says: "About $175,000!"
Below: a graph with a clear X-axis (Area in sq ft) and Y-axis (Price). Dots = real data; the line = our prediction.
โท = Actual house prices (blue dots above) Line = Predicted price for any area
Where:
m (slope) = "For every 1 sq ft increase, price goes up by $m"
If m = 100, then each extra sq ft adds $100 to the price!
b (intercept) = "The starting price even for 0 sq ft"
(In reality, a 0 sq ft house doesn't exist, but mathematically we need this!)
We have the formula y = mx + b, but how do we choose the best m and b? We use a cost function.
A cost function (also called loss function) measures how wrong our predictions are. Itโs a single number: the smaller it is, the better our line fits the data. Linear regression finds the m and b that minimize this cost.
For linear regression we usually use Mean Squared Error (MSE):
In words:
So: MSE = average of (squared errors). We want to make this as small as possible.
We square so that:
Dividing by m (number of points) gives an average error, so the cost doesnโt depend on how many data points we have.
Algorithms like Gradient Descent (see the Bias, Variance & Gradient Descent lesson) start with some m and b, then repeatedly adjust them in the direction that reduces the cost (MSE). When the cost stops going down, we have (approximately) the best line.
So: Cost Function = what we minimize to get the best linear regression line.
Drag the slope slider to move the regression line. The red dashed lines show the errors. Watch the MSE number update in real time!
The best line minimizes MSE. Too flat or too steep = bigger errors (red dashes). Find the sweet spot!
Download this CSV file and save it in your working directory to run the code examples.
Garbage In = Garbage Out!
If your data has problems, your predictions will be wrong. So we MUST clean the data first!
import pandas as pd import numpy as np # Load the housing data data = pd.read_csv("Housing.csv") # Let's see what our data looks like print("First 5 rows:") print(data.head()) # Output: # price area bedrooms bathrooms stories mainroad ... # 0 13300000 7420 4 2 3 yes # 1 12250000 8960 4 4 4 yes # 2 12250000 9960 3 2 2 yes # 3 12215000 7500 4 2 2 yes # 4 11410000 7420 4 1 2 yes
pd.read_csv("Housing.csv") โ Reads the CSV file and stores it in data.
print(data.head()) โ Shows the first 5 rows so you can see the columns (price, area, bedrooms, etc.).
Imagine most houses cost $100K-$500K, but ONE costs $50 Million!
That one mansion will pull our prediction line UP, making predictions wrong for normal houses!
import seaborn as sns import matplotlib.pyplot as plt # Draw boxplots to SEE the outliers fig, axs = plt.subplots(1, 2, figsize=(12, 5)) # Boxplot for price sns.boxplot(data['price'], ax=axs[0]) axs[0].set_title('Price Distribution') # Boxplot for area sns.boxplot(data['area'], ax=axs[1]) axs[1].set_title('Area Distribution') plt.show() # The dots outside the "whiskers" are OUTLIERS!
o โ Outlier (unusual!) | โโโโดโโโ โ โ โ Box (middle 50% of data) โโโโโโโค โ Median line โ โ โโโโฌโโโ | o โ Outlier (unusual!)
# Remove outliers using the IQR method # For PRICE: Q1_price = data['price'].quantile(0.25) # 25th percentile Q3_price = data['price'].quantile(0.75) # 75th percentile IQR_price = Q3_price - Q1_price # Interquartile Range # Keep only houses within normal price range data = data[(data['price'] >= Q1_price - 1.5 * IQR_price) & (data['price'] <= Q3_price + 1.5 * IQR_price)] print(f"After removing price outliers: {len(data)} houses") # For AREA: Q1_area = data['area'].quantile(0.25) Q3_area = data['area'].quantile(0.75) IQR_area = Q3_area - Q1_area data = data[(data['area'] >= Q1_area - 1.5 * IQR_area) & (data['area'] <= Q3_area + 1.5 * IQR_area)] print(f"After removing area outliers: {len(data)} houses") # Output: # After removing price outliers: 510 houses # After removing area outliers: 497 houses
We want to use features that ACTUALLY affect the price!
If "number of windows" has 0 correlation with price, why include it?
# Create a correlation heatmap # This shows how strongly each feature relates to others # Select only numeric columns numeric_cols = ['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'parking'] # Create heatmap plt.figure(figsize=(10, 8)) sns.heatmap(data[numeric_cols].corr(), annot=True, # Show numbers on the heatmap cmap='YlGnBu', # Color scheme center=0) plt.title('Correlation Heatmap') plt.show() # What to look for: # - Numbers close to +1 or -1 = Strong correlation # - Numbers close to 0 = No correlation # - "area" has ~0.54 correlation with "price" = GOOD!
Computers don't understand "yes" or "no" - they only understand numbers!
So we convert: "yes" โ 1 and "no" โ 0
# List of columns that have "yes/no" values yes_no_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea'] # Create a function to convert yes/no to 1/0 def convert_yes_no(x): return x.map({'yes': 1, 'no': 0}) # Apply the function to all yes/no columns data[yes_no_columns] = data[yes_no_columns].apply(convert_yes_no) # Let's verify it worked print(data[['mainroad', 'guestroom', 'basement']].head()) # Output: # mainroad guestroom basement # 0 1 0 0 # 1 1 0 0 # 2 1 0 1 # 3 1 0 1 # 4 1 1 1
"furnishingstatus" has 3 options: furnished, semi-furnished, unfurnished
We can't just say: furnished=1, semi=2, unfurnished=3
Because then the computer thinks unfurnished is "3 times more" than furnished!
Create SEPARATE columns for each option:
| Original | is_furnished | is_semi | is_unfurnished |
|---|---|---|---|
| furnished | 1 | 0 | 0 |
| semi-furnished | 0 | 1 | 0 |
| unfurnished | 0 | 0 | 1 |
# One-hot encode furnishingstatus # drop_first=True removes one column (to avoid redundancy) dummies = pd.get_dummies(data['furnishingstatus'], drop_first=True) print("New columns created:") print(dummies.head()) # Output: # semi-furnished unfurnished # 0 0 0 โ This means "furnished" # 1 0 0 โ This means "furnished" # 2 1 0 โ This means "semi-furnished" # 3 0 1 โ This means "unfurnished" # Add these new columns to our data data = pd.concat([data, dummies], axis=1) # Remove the original column data.drop(['furnishingstatus'], axis=1, inplace=True)
Imagine studying for a test by memorizing the answers...
You'd get 100% on THAT test, but fail any NEW test!
That's called cheating (or "overfitting" in ML)!
Split your data into TWO parts:
The model NEVER sees test data during training. It's like a surprise quiz!
from sklearn.model_selection import train_test_split # Separate the data into X (features) and y (target) # Features = What we use to predict # Target = What we want to predict y = data['price'] # TARGET: house price X = data.drop('price', axis=1) # FEATURES: everything else # Split: 80% for training, 20% for testing X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=0.8, # 80% for training test_size=0.2, # 20% for testing random_state=42 # For reproducibility ) print(f"Training set size: {len(X_train)} houses") print(f"Test set size: {len(X_test)} houses") # Output: # Training set size: 397 houses # Test set size: 100 houses
Imagine comparing:
5000 is WAY bigger than 3, so the model might think "area is more important!"
But that's not fair! Scaling puts them on the SAME scale (like 0 to 1).
Many algorithms (including the one that finds the best line) adjust weights for each feature. If one feature has numbers in the thousands (e.g. area) and another has numbers 1โ5 (e.g. bedrooms), the big numbers can dominate and the algorithm has a harder time finding a good balance. Scaling puts every feature in a similar range (e.g. 0โ1) so no single feature unfairly dominates just because of its units.
What happens if you don't scale? Sometimes the model still works, but training can be slower or less stable; for other algorithms (e.g. those that use distance or regularization) skipping scaling can give worse or wrong results. So: when in doubt, scale.
from sklearn.preprocessing import MinMaxScaler # Create the scaler scaler = MinMaxScaler() # Only scale NUMERIC columns (not the yes/no columns) numeric_features = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking'] # IMPORTANT: Fit scaler on TRAINING data only! X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features]) # Apply the SAME scaling to test data (without fitting again) X_test[numeric_features] = scaler.transform(X_test[numeric_features]) # Check the result print("Before scaling, area ranged from ~2000 to ~10000") print(f"After scaling, area ranges from {X_train['area'].min():.2f} to {X_train['area'].max():.2f}") # Output: # Before scaling, area ranged from ~2000 to ~10000 # After scaling, area ranges from 0.00 to 1.00
fit_transform on training data = "Learn the min/max AND apply scaling"
transform on test data = "Use the SAME min/max learned before"
If you fit on test data too, you're "cheating" by peeking at test information!
After all that preparation, building the model is just 3 lines of code!
from sklearn.linear_model import LinearRegression # Step 1: Create the model model = LinearRegression() # Step 2: Train the model (find the best line!) model.fit(X_train, y_train) # That's it! The model has learned! ๐ # Let's see what it learned: print("Intercept (b):", model.intercept_) print("\nCoefficients (m for each feature):") for feature, coef in zip(X_train.columns, model.coef_): print(f" {feature}: {coef:,.0f}") # Output: # Intercept (b): 2,189,813 # # Coefficients (m for each feature): # area: 2,527,089 โ Most important! # bedrooms: 193,001 # bathrooms: 1,520,868 โ Very important! # stories: 1,471,086 # airconditioning: 657,332 # ...
area: 2,527,089
This means: When area increases from 0 to 1 (in our scaled data), price goes up by ~$2.5 million!
Since we scaled area to 0-1, the coefficient shows the IMPORTANCE of each feature.
Area and bathrooms are the most important predictors!
# Make predictions on the TEST data # (data the model has NEVER seen before!) y_pred = model.predict(X_test) # Compare predictions vs actual prices comparison = pd.DataFrame({ 'Actual Price': y_test.values, 'Predicted Price': y_pred, 'Difference': y_test.values - y_pred }) print(comparison.head(10)) # Output: # Actual Price Predicted Price Difference # 0 4,200,000 4,350,000 -150,000 # 1 5,950,000 5,800,000 150,000 # 2 3,640,000 3,500,000 140,000 # ...
# Create a scatter plot plt.figure(figsize=(10, 8)) plt.scatter(y_test, y_pred, alpha=0.5) # Draw a perfect prediction line plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='Perfect Prediction') plt.xlabel('Actual Price') plt.ylabel('Predicted Price') plt.title('Actual vs Predicted House Prices') plt.legend() plt.show() # If all points are ON the red line = Perfect predictions! # If points are SCATTERED far from line = Poor predictions
Just looking at predictions isn't enough. We need concrete metrics!
Rยฒ = 0%: Model is useless (just guessing the average)
Rยฒ = 100%: Model is PERFECT (predicts exactly right)
Rยฒ = 70%: Model explains 70% of the variation in prices
MAPE = 15% means "On average, our predictions are off by 15%"
Lower is better! MAPE < 20% is usually considered acceptable.
from sklearn.metrics import r2_score, mean_absolute_percentage_error # Calculate Rยฒ Score r2 = r2_score(y_test, y_pred) print(f"Rยฒ Score: {r2:.2%}") # Calculate MAPE mape = mean_absolute_percentage_error(y_test, y_pred) print(f"MAPE: {mape:.2%}") # Output: # Rยฒ Score: 68.45% # MAPE: 16.23% # Interpretation: # - Our model explains ~68% of price variation # - On average, predictions are off by ~16% # - For a house worth $5M, we might be off by ~$800K
| Rยฒ Score | Quality |
|---|---|
| > 90% | ๐ Excellent |
| 70-90% | โ Good |
| 50-70% | ๐ Acceptable |
| < 50% | โ Needs improvement |
Our 68% is in the "Acceptable" range - not bad for a simple model!
Residual = Actual Value - Predicted Value
It's how much we were WRONG by. Also called "error."
Blue dots = actual data. Pink line = prediction. Dashed vertical lines = residuals (errors). They โpulseโ so you see how far each point is from the line.
# Calculate residuals on TRAINING data y_train_pred = model.predict(X_train) residuals = y_train - y_train_pred # Plot 1: Distribution of errors plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) sns.histplot(residuals, kde=True) plt.title('Distribution of Residuals (Errors)') plt.xlabel('Error Amount') # Plot 2: Residuals vs Predicted values plt.subplot(1, 2, 2) plt.scatter(y_train_pred, residuals, alpha=0.5) plt.axhline(y=0, color='r', linestyle='--') plt.title('Residuals vs Predicted Values') plt.xlabel('Predicted Price') plt.ylabel('Residual (Error)') plt.tight_layout() plt.show()
If you see patterns, your model is missing something important!
Check your understanding. Click an answer for instant feedback.
In one sentence: why do we use train-test split instead of training on all the data and then reporting accuracy on the same data?
To master linear regression you must know every core point. Non-core points deepen understanding and help in interviews.
Understand your data - what columns? Any missing values?
Remove extreme values using IQR method
Find which features actually relate to your target
yes/no โ 1/0, multiple categories โ one-hot encoding
80% for training, 20% for testing (don't cheat!)
Put all features on the same scale (0 to 1)
model = LinearRegression(); model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Calculate Rยฒ score and MAPE
Check for patterns in errors
You just learned your FIRST machine learning algorithm!
Linear Regression is the foundation - almost all other algorithms build on these concepts!