Every line of the linear regression code explained in simple words. We predict house price from area, bedrooms, etc.
pd.read_csv("Housing.csv") works.
Load the libraries we need.
import warnings warnings.filterwarnings('ignore') import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
Read the Housing CSV into a DataFrame.
data = pd.read_csv("Housing.csv") data.head(10)
data. Columns: price, area, bedrooms, bathrooms, stories, mainroad, guestroom, basement, etc.Check column types and counts.
data.info()
See the min, max, average, and spread of every numeric column in one table.
data.describe()
We use 80% of the data to train the model and 20% to test it.
from sklearn.model_selection import train_test_split df_train, df_test = train_test_split(data, train_size=0.8, test_size=0.2, random_state=100)
df_train, 20% to df_test. random_state=100 makes the split the same every time you run.We want to predict price. So price is y; the other columns (area, bedrooms, etc.) are X.
y_train = df_train['price'] X_train = df_train.drop('price', axis=1) y_test = df_test['price'] X_test = df_test.drop('price', axis=1)
We use sklearn's LinearRegression to fit a line (or plane) to the data.
from sklearn.linear_model import LinearRegression lr_model = LinearRegression() lr_model.fit(X_train, y_train)
Use the test set to see how well the model predicts.
predictions = lr_model.predict(X_test) print(f"R² score on test: {lr_model.score(X_test, y_test):.3f}")
The course source also walks through: outlier treatment (removing extreme price/area so one house doesn’t drag the line), converting yes/no to 1/0, dummy variables for furnishing status, MinMaxScaler so all features are on a similar scale, and residual analysis (plotting errors to check if the model is fair). Those steps are covered in the main Linear Regression lesson; here we kept the minimal code path so you can run and understand every line quickly. See Linear Regression.pdf in the course source for slides.
Every line of code from the course notebook (verbatim).
# --- Code cell 1 ---
from IPython.core.display import HTML
HTML("""
<style>
h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")
# --- Code cell 2 ---
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
# Import the numpy and pandas package
import numpy as np
import pandas as pd
# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# --- Code cell 4 ---
data = pd.read_csv("Housing.csv")
data.head(10)
# --- Code cell 5 ---
data.info()
# --- Code cell 6 ---
data.describe()
# --- Code cell 8 ---
fig, axs = plt.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(data['price'], ax = axs[0,0]).set(xlabel= 'price')
plt2 = sns.boxplot(data['area'], ax = axs[0,1]).set(xlabel='area')
plt3 = sns.boxplot(data['bedrooms'], ax = axs[0,2]).set(xlabel='bedrooms')
plt1 = sns.boxplot(data['bathrooms'], ax = axs[1,0]).set(xlabel='bathrooms')
plt2 = sns.boxplot(data['stories'], ax = axs[1,1]).set(xlabel='stories')
plt3 = sns.boxplot(data['parking'], ax = axs[1,2]).set(xlabel='parking')
plt.tight_layout()
# --- Code cell 9 ---
# outlier treatment for price
Q1 = data.price.quantile(0.25) # data['price'].quantile(0.25)
Q3 = data.price.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.price >= Q1 - 1.5*IQR) & (data.price <= Q3 + 1.5*IQR)]
print(Q3,Q1)
# --- Code cell 10 ---
len(data)
# --- Code cell 11 ---
# outlier treatment for area
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]
# --- Code cell 12 ---
len(data)
# --- Code cell 13 ---
data.columns
# --- Code cell 14 ---
#Correlation of output with numerical variables
# plotting correlation heatmap
dataplot = sns.heatmap(data[['price', 'area', 'bedrooms', 'bathrooms', 'stories','parking']].corr(), cmap="YlGnBu", annot=True)
# displaying heatmap
plt.show()
# --- Code cell 15 ---
#Visualizing categorical variables
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'mainroad', y = 'price', data = data)
plt.subplot(2,3,2)
sns.boxplot(x = 'guestroom', y = 'price', data = data)
plt.subplot(2,3,3)
sns.boxplot(x = 'basement', y = 'price', data = data)
plt.subplot(2,3,4)
sns.boxplot(x = 'hotwaterheating', y = 'price', data = data)
plt.subplot(2,3,5)
sns.boxplot(x = 'airconditioning', y = 'price', data = data)
plt.subplot(2,3,6)
sns.boxplot(x = 'furnishingstatus', y = 'price', data = data)
plt.show()
# --- Code cell 16 ---
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'mainroad', y = 'price', hue = 'airconditioning', data = data)
plt.show()
# --- Code cell 18 ---
# List of variables to map
cat_features = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
# Defining the map function
def create_features(x):
return x.map({'yes': 1, "no": 0})
# Applying the function to the housing list
data[cat_features] = data[cat_features].apply(create_features)
# --- Code cell 19 ---
data.head(10)
# --- Code cell 20 ---
#Create dummy features for categorical variables
data_cat = pd.get_dummies(data['furnishingstatus'],drop_first=True)
# --- Code cell 21 ---
data_cat.head(100)
# --- Code cell 22 ---
data = pd.concat([data, data_cat], axis = 1)
data.drop(['furnishingstatus'], axis = 1, inplace = True)
data.head(10)
# --- Code cell 24 ---
from sklearn.model_selection import train_test_split
#np.random.seed(0) # not must to do
df_train, df_test = train_test_split(data, train_size = 0.8, test_size = 0.2, random_state = 100)
# --- Code cell 25 ---
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# --- Code cell 26 ---
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
# --- Code cell 27 ---
df_train.describe()
# --- Code cell 28 ---
df_test[num_vars] = scaler.transform(df_test[num_vars])
# --- Code cell 29 ---
df_test.describe()
# --- Code cell 30 ---
y_train = df_train.pop('price') # df_train['price'] #labels in training data
x_train = df_train # features in training data
# --- Code cell 31 ---
y_test = df_test.pop('price') # df_test['price'] #lables in test data
x_test = df_test # features in test data
# --- Code cell 34 ---
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
lr_model.fit(x_train, y_train) # training
# --- Code cell 35 ---
help(LinearRegression)
# --- Code cell 36 ---
#Let's see the summary of our linear model
print(lr_model.coef_)
# --- Code cell 37 ---
# this is for our understanding - not needed for LR model
import numpy as np
importance = np.array(lr_model.coef_)
importance = importance / sum(importance)
print(importance )
# --- Code cell 38 ---
for z in range(len(list(x_train.columns))):
print("The Importance coefficient for {} is {}".format(x_train.columns[z], importance[z]))
# --- Code cell 39 ---
print(lr_model.intercept_)
# --- Code cell 40 ---
x_train.columns
# --- Code cell 43 ---
#residual analysis
#Residual analysis is typically performed on the training data rather than the test data.
#The purpose of residual analysis is to assess the performance and
#assumptions of your regression model during the training phase.
# --- Code cell 45 ---
y_train_pred = lr_model.predict(x_train)
residuals = (y_train_pred - y_train)
# --- Code cell 46 ---
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot(residuals, bins = 20)
fig.suptitle('Error Terms', fontsize = 20) # Plot heading
plt.xlabel('Errors', fontsize = 18) # X-label
# --- Code cell 47 ---
plt.scatter(y_train,residuals)
plt.show()
# Model is giving negative error in high house price range and postive error in low house price range
# Its estimating a bit higher prices for houses with lower prices and lower pricess for houses with higher prices
# We might need more data or the relatioship between input and ouput is not necessarily linear in all regions
# In the region where model is doing fine ( residual around 0) - we do have lot of data points
# --- Code cell 50 ---
# Making predictions
y_test_pred = lr_model.predict(x_test)
# --- Code cell 51 ---
from sklearn.metrics import r2_score,mean_absolute_error,mean_absolute_percentage_error
r2_score(y_test, y_test_pred)
# --- Code cell 52 ---
mean_absolute_percentage_error(y_test, y_test_pred)
# --- Code cell 53 ---
# Less than ~20% average percentage error is not a bad result on such a small dataset
# Results would improve if we get more data to train the model
# Some additional fetaures like distance from best school, shoppig and entertainment options etc would also help
# --- Code cell 54 ---
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_test_pred)
fig.suptitle('y_test vs y_test_pred', fontsize=20) # Plot heading
plt.xlabel('y_test', fontsize=18) # X-label
plt.ylabel('y_test_pred', fontsize=16) # Y-label