Logistic Regression - Complete Guide | Fakhruddin Khambaty's Learning Hub

Chapter 1: What is Logistic Regression?

👶 Linear vs Logistic - What's the Difference?

Linear Regression: Predicts a NUMBER (house price = $350,000)

Logistic Regression: Predicts a CATEGORY (Will get heart disease? Yes/No)

Despite the name "Regression", Logistic Regression is used for Classification!

🔢 Linear vs 📋 Logistic

LINEAR REGRESSION (Predict Numbers):
─────────────────────────────────────
Input: House size, bedrooms, location
Output: $425,000 (continuous number)

            Price
              ^
         450K │        ●
         400K │      ●
         350K │    ●
         300K │  ●
              └─────────────────→ Size


LOGISTIC REGRESSION (Predict Categories):
─────────────────────────────────────────
Input: Age, BP, cholesterol, smoking
Output: 0 (No heart disease) or 1 (Yes)

         Probability
              ^
          1.0 │          ●●●●●●
              │        ●
          0.5 │      ●    ← S-shaped curve!
              │    ●
          0.0 │●●●●●
              └─────────────────→ Risk Score

🤔 Why Not Just Use Linear Regression for Yes/No?

Linear Regression can output 1.5 or -0.3 (doesn't make sense for Yes/No!)

Logistic Regression uses a Sigmoid function to squeeze outputs between 0 and 1.

Output > 0.5 → Predict "Yes" (1)

Output ≤ 0.5 → Predict "No" (0)

🎮 Interactive: The Sigmoid Curve & Decision Threshold

The S-curve converts any number into a probability (0 to 1). Drag the threshold slider to see how changing the cutoff affects predictions!

Low (0.2) High (0.9) 0.50

Low threshold = more "Yes" predictions (catches more, but more false alarms). High threshold = fewer "Yes" predictions (misses some, but more precise).

Chapter 2: Understanding Our Dataset

We're using the Framingham Heart Study dataset to predict 10-year risk of Coronary Heart Disease (CHD).

📥 Download the Dataset to Follow Along!

Download this CSV file and save it in your working directory to run the code examples.

Download CSV (191 KB)

📖 Full code walkthrough (every line explained)

Feature	Description	Type
male	Gender (1 = male, 0 = female)	Categorical
age	Age of patient	Numerical
currentSmoker	Is patient a current smoker?	Categorical
cigsPerDay	Cigarettes smoked per day	Numerical
BPMeds	On blood pressure medication?	Categorical
prevalentStroke	Had a stroke before?	Categorical
prevalentHyp	Is patient hypertensive?	Categorical
diabetes	Has diabetes?	Categorical
totChol	Total cholesterol level	Numerical
sysBP / diaBP	Systolic / Diastolic blood pressure	Numerical
BMI	Body Mass Index	Numerical
glucose	Blood glucose level	Numerical
TenYearCHD 🎯	10-year risk of heart disease (TARGET)	0 = No, 1 = Yes

Chapter 3: Building the Model Step-by-Step

Step 1: Import Libraries & Load Data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning imports
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Load the heart disease dataset (download from link above!)
data = pd.read_csv("heart_disease_dataset.csv")
print(f"Dataset shape: {data.shape}")
print(data.head())

# Dataset shape: (4238, 16)
#    male  age  education  currentSmoker  cigsPerDay  ...

What each part does (in simple words)

pd.read_csv(...) — Loads the heart disease CSV into data.

data.shape — Number of rows and columns.

data.head() — First 5 rows. Other lines: imports for scaling, train/test split, LogisticRegression, and metrics.

Full line-by-line walkthrough →

Step 2: Check for Missing Values

# Check missing values
print(data.isnull().sum())

# male                 0
# age                  0
# cigsPerDay          29  ← Missing!
# totChol             50  ← Missing!
# BMI                 19  ← Missing!
# heartRate            1  ← Missing!
# glucose            388  ← Missing!
# TenYearCHD           0

Step 3: Handle Missing Values

🤔 Why Median Instead of Mean?

Mean is affected by outliers (extreme values pull it up/down)

Median is robust - it's the middle value, unaffected by outliers!

For medical data with potential extreme values, median is safer.

# Fill missing values with MEDIAN (robust to outliers)
numerical_columns = ['cigsPerDay', 'totChol', 'BMI', 'heartRate', 'glucose']

for col in numerical_columns:
    median_value = data[col].median()
    data[col] = data[col].fillna(median_value)
    print(f"{col}: filled with median = {median_value}")

# Verify no more missing values
print("\nMissing values after imputation:")
print(data.isnull().sum().sum())  # Output: 0

Step 4: Visualize Numerical Features

# Boxplots: Compare features between heart disease (1) vs no heart disease (0)
plt.figure(figsize=(20, 12))

features_to_plot = ['age', 'totChol', 'sysBP', 'diaBP', 'BMI', 
                    'heartRate', 'glucose', 'cigsPerDay']

for i, col in enumerate(features_to_plot, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(x='TenYearCHD', y=col, data=data, hue='TenYearCHD', palette='Set2', legend=False)
    plt.title(f'{col} by Heart Disease Risk')

plt.tight_layout()
plt.show()

📊 What the Boxplots Tell Us

Age: Heart disease patients are generally OLDER
sysBP (Systolic BP): Higher in heart disease group
Glucose: Higher in heart disease group
Cholesterol: Slightly higher in heart disease group

Step 5: Check Correlation

# Correlation heatmap for numerical features
numerical_cols = ['age', 'cigsPerDay', 'totChol', 'BMI', 'heartRate', 
                  'glucose', 'sysBP', 'diaBP']

plt.figure(figsize=(10, 8))
sns.heatmap(data[numerical_cols].corr(), cmap="YlGnBu", annot=True, fmt=".2f")
plt.title("Correlation Between Numerical Features")
plt.show()

🔍 Key Finding: sysBP & diaBP are Highly Correlated (0.79)

sysBP (Systolic): Pressure when heart beats

diaBP (Diastolic): Pressure when heart relaxes

They measure similar things, so they're correlated. In advanced models, you might drop one!

Step 6: Split Data & Scale Features

# Separate features (X) and target (y)
y = data["TenYearCHD"]              # What we want to predict
X = data.drop('TenYearCHD', axis=1)  # Everything else

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=0
)

# Scale features to 0-1 range (important for Logistic Regression!)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)  # Fit AND transform on training
X_test = scaler.transform(X_test)         # Only transform on test

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
# Training samples: 3390
# Test samples: 848

Chapter 4: Train & Evaluate the Model

Step 7: Train the Model

# Create and train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Check accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
# Accuracy: 84.55%

⚠️ Wait! 84.55% Accuracy Sounds Great... But Is It?

Let's look deeper at what the model is actually doing!

Step 8: The Confusion Matrix

🤷 What is a Confusion Matrix?

It shows WHERE your model makes mistakes:

True Negative (TN): Predicted NO, Actually NO ✅
True Positive (TP): Predicted YES, Actually YES ✅
False Positive (FP): Predicted YES, Actually NO ❌ (False Alarm)
False Negative (FN): Predicted NO, Actually YES ❌ (MISSED!)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

#              Predicted
#              NO    YES
# Actual NO   [[710     0]
# Actual YES   [131     7]]

📊 What the Confusion Matrix Reveals

                    PREDICTED
                   NO      YES
              ┌────────┬────────┐
    Actual NO │  710   │    0   │  ← Great! No false alarms
              ├────────┼────────┤
   Actual YES │  131   │    7   │  ← PROBLEM! Only caught 7 out of 138!
              └────────┴────────┘

    Out of 138 people who WILL get heart disease:
    - Model correctly identified: 7 (5%)  😢
    - Model MISSED: 131 (95%)  😱
    
    ⚠️ This is TERRIBLE for healthcare!
    Missing heart disease patients is DANGEROUS!

# Full classification report
print(classification_report(y_test, y_pred))

#               precision    recall  f1-score   support
#
#            0       0.84      1.00      0.92       710
#            1       1.00      0.05      0.10       138  ← Recall is only 5%!
#
#     accuracy                           0.85       848

Understanding the Metrics

Metric	Formula	Our Value	Meaning
Accuracy	(TP+TN) / Total	84.5%	Overall correct predictions (MISLEADING here!)
Precision	TP / (TP+FP)	100%	When we predict YES, how often correct?
Recall 🚨	TP / (TP+FN)	5%	Of actual YES cases, how many did we catch?
F1-Score	2(PR)/(P+R)	10%	Balance between Precision and Recall

🏥 In Healthcare: Recall is CRITICAL!

We'd rather have some false alarms (tell healthy people to get checked) than MISS someone who actually has heart disease!

Missing 95% of heart disease patients is unacceptable!

Chapter 5: The Imbalanced Data Problem

# Check class distribution
from collections import Counter

print("Training set distribution:")
print(Counter(y_train))
# Counter({0: 2875, 1: 515})

print("\nPercentages:")
print(f"No heart disease (0): {2875/3390:.1%}")
print(f"Heart disease (1): {515/3390:.1%}")
# No heart disease (0): 84.8%
# Heart disease (1): 15.2%

🤔 Why is Imbalanced Data a Problem?

The model learns: "If I just predict NO for everyone, I'll be right 85% of the time!"

It's taking the lazy path instead of learning the actual patterns!

Solution: Class Weights

⚖️ What are Class Weights?

We tell the model: "Mistakes on the MINORITY class (heart disease) are MORE EXPENSIVE!"

Weight {0:1, 1:4} means missing a heart disease case costs 4x more than a false alarm.

# Train with class weights to handle imbalance
weight = {0: 1, 1: 4# Penalize missing heart disease 4x more

model_balanced = LogisticRegression(class_weight=weight)
model_balanced.fit(X_train, y_train)

y_pred_balanced = model_balanced.predict(X_test)

print("Confusion Matrix (Balanced):")
print(confusion_matrix(y_test, y_pred_balanced))

#              Predicted
#              NO    YES
# Actual NO   [[567   143]
# Actual YES   [ 61    77]]

📊 Before vs After Class Weights

BEFORE (No weights):              AFTER (With weights):
─────────────────────             ─────────────────────
Caught: 7 out of 138 (5%)         Caught: 77 out of 138 (56%) ✅
Missed: 131 patients 😱           Missed: 61 patients (better!)

Trade-off: More false alarms (143 vs 0), but that's OK in healthcare!

print(classification_report(y_test, y_pred_balanced))

#               precision    recall  f1-score   support
#
#            0       0.90      0.80      0.85       710
#            1       0.35      0.56      0.43       138  ← Recall improved: 5% → 56%!
#
#     accuracy                           0.76       848

✅ Success! Recall Improved from 5% to 56%!

Accuracy dropped from 85% to 76%, but that's a GOOD trade-off in healthcare.

We're now catching 56% of heart disease patients instead of just 5%!

Chapter 6: Model Interpretation - Odds Ratios

🎲 What is an Odds Ratio?

It tells you how much each feature affects the probability of heart disease:

Odds Ratio > 1: INCREASES risk (higher is worse)
Odds Ratio < 1: DECREASES risk (protective)
Odds Ratio ≈ 1: No significant effect

# Calculate odds ratios
features = X.columns.tolist()
odds_ratios = np.exp(model_balanced.coef_)[0]

print("Feature Importance (Odds Ratios):")
print("=" * 50)
for feature, odds in zip(features, odds_ratios):
    print(f"{feature:20s}: {odds:.2f}")

# Feature Importance (Odds Ratios):
# ==================================================
# male                : 1.52  ← Being male increases risk
# age                 : 11.37 ← AGE is HUGE risk factor!
# education           : 1.00  ← No effect
# currentSmoker       : 1.11  ← Slight increase
# cigsPerDay          : 4.23  ← Major risk factor!
# BPMeds              : 1.17  ← Slight increase
# prevalentStroke     : 2.34  ← Previous stroke increases risk
# prevalentHyp        : 1.47  ← Hypertension increases risk
# diabetes            : 1.78  ← Diabetes increases risk
# totChol             : 2.33  ← High cholesterol increases risk
# sysBP               : 5.59  ← HIGH blood pressure = big risk!
# diaBP               : 0.69  ← Slightly protective (after controlling for sysBP)
# BMI                 : 1.48  ← Higher BMI increases risk
# heartRate           : 0.89  ← Slight protective effect
# glucose             : 2.75  ← High glucose increases risk

🏆 Top 5 Heart Disease Risk Factors

Age (11.37x) - Older = much higher risk
Systolic BP (5.59x) - High blood pressure is dangerous
Cigarettes/Day (4.23x) - Smoking kills
Glucose (2.75x) - High blood sugar is risky
Previous Stroke (2.34x) - History matters

Chapter 7: Save & Load Your Model

import joblib

# Save the model to a file
joblib.dump(model_balanced, 'heart_disease_model.pkl')
print("✅ Model saved!")

# Load the model later
loaded_model = joblib.load('heart_disease_model.pkl')

# Use loaded model to predict
new_predictions = loaded_model.predict(X_test)
print("✅ Model loaded and working!")

# Check model parameters
print("Model intercept:", loaded_model.intercept_)
# Model intercept: [-3.07863041]

💾 Why Save Models?

Training takes time! Save your trained model so you can:

Use it later without retraining
Deploy it to a website or app
Share it with your team

Chapter 8: Getting Probability Predictions

# Get probabilities instead of just 0/1
probabilities = model_balanced.predict_proba(X_test)

# Create a nice DataFrame to view results
results = pd.DataFrame({
    'Prob_No_HeartDisease': probabilities[:, 0],
    'Prob_HeartDisease': probabilities[:, 1],
    'Predicted': model_balanced.predict(X_test),
    'Actual': y_test.values
})

print(results.head(10))

#    Prob_No_HeartDisease  Prob_HeartDisease  Predicted  Actual
# 0              0.73                 0.27          0       0
# 1              0.92                 0.08          0       0
# 2              0.68                 0.32          0       1  ← Missed!
# 3              0.45                 0.55          1       1  ← Correct!
# 4              0.89                 0.11          0       0

🎯 Why Probabilities Matter

Instead of just "Yes/No", you can say:

"This patient has a 55% probability of heart disease in 10 years."

Doctors can then decide: High-risk patients need immediate intervention!

🚫 Common Mistakes in Logistic Regression

Relying only on accuracy when classes are imbalanced — A model that always predicts "no" can have high accuracy; use F1, precision, recall, or AUC-ROC.
Using 0.5 as the only threshold — For imbalanced or cost-sensitive problems, a different threshold (e.g. 0.3) may be better; tune using the metric that matters.
Forgetting to scale features — Like linear regression, scaling helps the optimizer; use the same scaler fit on training data for test.

📘 From the course notebook (Logistic Regression)

The course source uses dataset.csv (heart disease risk; target TenYearCHD). Key code: data = pd.read_csv("dataset.csv"); StandardScaler or MinMaxScaler; train_test_split; LogisticRegression().fit(X_train, y_train); confusion_matrix, accuracy_score, roc_curve, classification_report. Download dataset.csv from the datasets page. See Logistic Regression.pdf in the course source for slides.

Complete code from course notebook: logistic_regression.ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd
#import ydata_profiling as yp
# data preprocessing
from sklearn.preprocessing import StandardScaler
# data splitting
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# --- Code cell 3 ---
data = pd.read_csv("dataset.csv")

# --- Code cell 5 ---
# Sex: male or female
# Age: Age of the patient
# Current Smoker: whether or not the patient is a current smoker 
# Cigs Per Day: the number of cigarettes that the person smoked on average in one day
# BP Meds: whether or not the patient was on blood pressure medication 
# Prevalent Stroke: whether or not the patient had previously had a stroke 
# Prevalent Hyp: whether or not the patient was hypertensive 
# Diabetes: whether or not the patient had diabetes
# Tot Chol: total cholesterol level 
# Sys BP: systolic blood pressure 
# Dia BP: diastolic blood pressure 
# BMI: Body Mass Index 
# Heart Rate: heart rate 
# Glucose: glucose level 

#Predict variable (desired target)
# 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

# --- Code cell 6 ---
data.head(10)

# --- Code cell 8 ---
data.info()

# --- Code cell 9 ---
print(data.isnull().sum())

# --- Code cell 10 ---
for col in data.columns:
    print(col)
    print(data[col].unique())
    print('\n')

# --- Code cell 12 ---
#categorical_columns = ['education','BPMeds']
numerical_columns = ['cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose']


for column in list(numerical_columns):
   data[column].fillna(data[column].median(),inplace = True)

# --- Code cell 13 ---
print(data.isnull().sum())

# --- Code cell 16 ---
data.head(10)

# --- Code cell 17 ---
# What if we had patinet number column?
# Would that be useful as feature
# Drop such columns from data

# --- Code cell 18 ---
data.columns

# --- Code cell 19 ---
#Visualizing numerical variables

plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'TenYearCHD', y = 'age', data = data)
plt.subplot(3,3,2)
sns.boxplot(x = 'TenYearCHD', y = 'totChol', data = data)
plt.subplot(3,3,3)
sns.boxplot(x = 'TenYearCHD', y = 'sysBP', data = data)
plt.subplot(3,3,4)
sns.boxplot(x = 'TenYearCHD', y = 'diaBP', data = data)
plt.subplot(3,3,5)
sns.boxplot(x = 'TenYearCHD', y = 'BMI', data = data)
plt.subplot(3,3,6)
sns.boxplot(x = 'TenYearCHD', y = 'heartRate', data = data)
plt.subplot(3,3,7)
sns.boxplot(x = 'TenYearCHD', y = 'glucose', data = data)
plt.subplot(3,3,8)
sns.boxplot(x = 'TenYearCHD', y = 'education', data = data)
plt.subplot(3,3,9)
sns.boxplot(x = 'TenYearCHD', y = 'cigsPerDay', data = data)
plt.show()

# --- Code cell 20 ---

plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.countplot(x ='male', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,2)
sns.countplot(x ='currentSmoker', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,3)
sns.countplot(x ='BPMeds', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,4)
sns.countplot(x ='prevalentStroke', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,5)
sns.countplot(x ='prevalentHyp', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,6)
sns.countplot(x ='diabetes', hue = 'TenYearCHD', data = data)
plt.show()

# --- Code cell 21 ---
len(data.columns)

# --- Code cell 22 ---
#Correlation of output with numerical variables
numerical_columns = ['age', 'cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose', 'sysBP','diaBP']

# plotting correlation heatmap
dataplot = sns.heatmap(data[numerical_columns].corr(), cmap="YlGnBu", annot=True)

# --- Code cell 23 ---
# Highly correlated features

#sysBP: Systolic Blood Pressure - The pressure exerted when the heartbeats
#diaBP: Diastolic Blood Pressure - The pressure exerted on the walls of the arteries when the heart muscles relax 
#in between two beats

#Both systolic and diastolic blood pressure are important indicators of cardiovascular health, 
#and both can be associated with an increased risk of heart disease. 

# However, the relationship between blood pressure and heart disease is complex, 
#and both systolic and diastolic pressure readings are often considered together 
# to provide a more comprehensive assessment.

# --- Code cell 26 ---
data.head(10)

# --- Code cell 30 ---
def train_test_split_and_scale(data):
    y = data["TenYearCHD"]
    x = data.drop('TenYearCHD',axis=1)
    features = list(x.columns)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state = 0)
    scaler = MinMaxScaler()
    x_train = scaler.fit_transform(x_train) # scaling is done only on features
    x_test = scaler.transform(x_test)
    return x_train, x_test, y_train, y_test,features

# --- Code cell 31 ---
x_train, x_test, y_train, y_test,features = train_test_split_and_scale(data)

# --- Code cell 32 ---
Counter(y_train)

# --- Code cell 33 ---
def fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=None):
    lr = LogisticRegression(class_weight=class_weight)
    model = lr.fit(x_train, y_train) # model training
    lr_predict = lr.predict(x_test) # create predicted o/p 0/1
    lr_conf_matrix = confusion_matrix(y_test, lr_predict)
    lr_acc_score = accuracy_score(y_test, lr_predict)
    print("confussion matrix")
    print(lr_conf_matrix)
    print("\n")
    print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
    print(classification_report(y_test,lr_predict))
    return model

# --- Code cell 34 ---
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test)
print("odds ratio", np.exp(model.coef_))

# --- Code cell 36 ---
Counter(y_train)

# --- Code cell 37 ---
Counter(y_test)

# --- Code cell 41 ---
# define class weights
weight = {0:1, 1:4}
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=weight)

# --- Code cell 42 ---
results = pd.DataFrame(model.predict_proba(x_test))
results.columns = ['class_0_proba','class_1_proba']
results['predicted_class'] = model.predict(x_test)
results.head(10)

# --- Code cell 43 ---
#save and reuse the model

# --- Code cell 45 ---
import joblib  # 'pip install joblib' if you get "Package Not found Error"
joblib.dump(model , 'model_classifier.pkl')

# --- Code cell 46 ---
print(model_read.intercept_)

# --- Code cell 47 ---
model_read = joblib.load('model_classifier.pkl')
model_read.predict(x_test)

# --- Code cell 49 ---
# Feature importance
# Odds ratio well higher than 1: Increase in fetaure value increases probability of event(heart risk) hapenning

# Odds ratio well below 1: Increase in fetaure value decreases probability of event(heart risk) hapenning

# A feature with an odds ratio near zero typically suggests that the associated predictor has 
#a strong negative impact on the odds of the event occurring.

# Odds ratio near 1 indicates that feature may not be a strong predictor

# --- Code cell 50 ---
odds_ratio = np.exp(model.coef_)[0]

for z in range(len(features)):
     print("Odds ratio for feature {} is {}".format(features[z], odds_ratio[z]))

# --- Code cell 52 ---
print(model.coef_)

# --- Code cell 53 ---
print(np.exp(model.coef_))

Complete code from course notebook: logistic_regression.ipynb

Every line of code from the course notebook (verbatim).

# --- Code cell 1 ---
from IPython.core.display import HTML

HTML("""
<style>

h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")

# --- Code cell 2 ---
import pandas as pd
#import ydata_profiling as yp
# data preprocessing
from sklearn.preprocessing import StandardScaler
# data splitting
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# --- Code cell 3 ---
data = pd.read_csv("dataset.csv")

# --- Code cell 5 ---
# Sex: male or female
# Age: Age of the patient
# Current Smoker: whether or not the patient is a current smoker 
# Cigs Per Day: the number of cigarettes that the person smoked on average in one day
# BP Meds: whether or not the patient was on blood pressure medication 
# Prevalent Stroke: whether or not the patient had previously had a stroke 
# Prevalent Hyp: whether or not the patient was hypertensive 
# Diabetes: whether or not the patient had diabetes
# Tot Chol: total cholesterol level 
# Sys BP: systolic blood pressure 
# Dia BP: diastolic blood pressure 
# BMI: Body Mass Index 
# Heart Rate: heart rate 
# Glucose: glucose level 

#Predict variable (desired target)
# 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

# --- Code cell 6 ---
data.head(10)

# --- Code cell 8 ---
data.info()

# --- Code cell 9 ---
print(data.isnull().sum())

# --- Code cell 10 ---
for col in data.columns:
    print(col)
    print(data[col].unique())
    print('\n')

# --- Code cell 12 ---
#categorical_columns = ['education','BPMeds']
numerical_columns = ['cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose']


for column in list(numerical_columns):
   data[column].fillna(data[column].median(),inplace = True)

# --- Code cell 13 ---
print(data.isnull().sum())

# --- Code cell 16 ---
data.head(10)

# --- Code cell 17 ---
# What if we had patinet number column?
# Would that be useful as feature
# Drop such columns from data

# --- Code cell 18 ---
data.columns

# --- Code cell 19 ---
#Visualizing numerical variables

plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'TenYearCHD', y = 'age', data = data)
plt.subplot(3,3,2)
sns.boxplot(x = 'TenYearCHD', y = 'totChol', data = data)
plt.subplot(3,3,3)
sns.boxplot(x = 'TenYearCHD', y = 'sysBP', data = data)
plt.subplot(3,3,4)
sns.boxplot(x = 'TenYearCHD', y = 'diaBP', data = data)
plt.subplot(3,3,5)
sns.boxplot(x = 'TenYearCHD', y = 'BMI', data = data)
plt.subplot(3,3,6)
sns.boxplot(x = 'TenYearCHD', y = 'heartRate', data = data)
plt.subplot(3,3,7)
sns.boxplot(x = 'TenYearCHD', y = 'glucose', data = data)
plt.subplot(3,3,8)
sns.boxplot(x = 'TenYearCHD', y = 'education', data = data)
plt.subplot(3,3,9)
sns.boxplot(x = 'TenYearCHD', y = 'cigsPerDay', data = data)
plt.show()

# --- Code cell 20 ---

plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.countplot(x ='male', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,2)
sns.countplot(x ='currentSmoker', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,3)
sns.countplot(x ='BPMeds', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,4)
sns.countplot(x ='prevalentStroke', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,5)
sns.countplot(x ='prevalentHyp', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,6)
sns.countplot(x ='diabetes', hue = 'TenYearCHD', data = data)
plt.show()

# --- Code cell 21 ---
len(data.columns)

# --- Code cell 22 ---
#Correlation of output with numerical variables
numerical_columns = ['age', 'cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose', 'sysBP','diaBP']

# plotting correlation heatmap
dataplot = sns.heatmap(data[numerical_columns].corr(), cmap="YlGnBu", annot=True)

# --- Code cell 23 ---
# Highly correlated features

#sysBP: Systolic Blood Pressure - The pressure exerted when the heartbeats
#diaBP: Diastolic Blood Pressure - The pressure exerted on the walls of the arteries when the heart muscles relax 
#in between two beats

#Both systolic and diastolic blood pressure are important indicators of cardiovascular health, 
#and both can be associated with an increased risk of heart disease. 

# However, the relationship between blood pressure and heart disease is complex, 
#and both systolic and diastolic pressure readings are often considered together 
# to provide a more comprehensive assessment.

# --- Code cell 26 ---
data.head(10)

# --- Code cell 30 ---
def train_test_split_and_scale(data):
    y = data["TenYearCHD"]
    x = data.drop('TenYearCHD',axis=1)
    features = list(x.columns)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state = 0)
    scaler = MinMaxScaler()
    x_train = scaler.fit_transform(x_train) # scaling is done only on features
    x_test = scaler.transform(x_test)
    return x_train, x_test, y_train, y_test,features

# --- Code cell 31 ---
x_train, x_test, y_train, y_test,features = train_test_split_and_scale(data)

# --- Code cell 32 ---
Counter(y_train)

# --- Code cell 33 ---
def fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=None):
    lr = LogisticRegression(class_weight=class_weight)
    model = lr.fit(x_train, y_train) # model training
    lr_predict = lr.predict(x_test) # create predicted o/p 0/1
    lr_conf_matrix = confusion_matrix(y_test, lr_predict)
    lr_acc_score = accuracy_score(y_test, lr_predict)
    print("confussion matrix")
    print(lr_conf_matrix)
    print("\n")
    print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
    print(classification_report(y_test,lr_predict))
    return model

# --- Code cell 34 ---
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test)
print("odds ratio", np.exp(model.coef_))

# --- Code cell 36 ---
Counter(y_train)

# --- Code cell 37 ---
Counter(y_test)

# --- Code cell 41 ---
# define class weights
weight = {0:1, 1:4}
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=weight)

# --- Code cell 42 ---
results = pd.DataFrame(model.predict_proba(x_test))
results.columns = ['class_0_proba','class_1_proba']
results['predicted_class'] = model.predict(x_test)
results.head(10)

# --- Code cell 43 ---
#save and reuse the model

# --- Code cell 45 ---
import joblib  # 'pip install joblib' if you get "Package Not found Error"
joblib.dump(model , 'model_classifier.pkl')

# --- Code cell 46 ---
print(model_read.intercept_)

# --- Code cell 47 ---
model_read = joblib.load('model_classifier.pkl')
model_read.predict(x_test)

# --- Code cell 49 ---
# Feature importance
# Odds ratio well higher than 1: Increase in fetaure value increases probability of event(heart risk) hapenning

# Odds ratio well below 1: Increase in fetaure value decreases probability of event(heart risk) hapenning

# A feature with an odds ratio near zero typically suggests that the associated predictor has 
#a strong negative impact on the odds of the event occurring.

# Odds ratio near 1 indicates that feature may not be a strong predictor

# --- Code cell 50 ---
odds_ratio = np.exp(model.coef_)[0]

for z in range(len(features)):
     print("Odds ratio for feature {} is {}".format(features[z], odds_ratio[z]))

# --- Code cell 52 ---
print(model.coef_)

# --- Code cell 53 ---
print(np.exp(model.coef_))

💭 Short reflection

In one sentence: why can’t we use linear regression for binary classification (0/1) instead of logistic regression?

✅ CORE (Must know)

Logistic regression: predicts probability P(Y=1) via sigmoid(z); z = linear combination of features.
Sigmoid: 1/(1+e^(-z)); squashes output to (0,1).
Decision boundary: typically 0.5; tune threshold for precision/recall tradeoff.
Confusion matrix: TN, FP, FN, TP; precision = TP/(TP+FP), recall = TP/(TP+FN), F1 = harmonic mean.
Imbalanced data: use class_weight, F1, or AUC-ROC; don’t rely on accuracy alone.
Interpret coefficients as log-odds; odds ratio = e^coefficient.

📚 NON-CORE (Good to know)

Log-loss (cross-entropy) as the cost function.
Multiclass: softmax and one-vs-rest.
Regularization (L1/L2) in logistic regression.

Summary

Concept	Simple Explanation
Logistic Regression	Predicts probability of belonging to a class (0-1)
Sigmoid Function	S-shaped curve that squishes values between 0 and 1
Confusion Matrix	Shows TP, TN, FP, FN - where model makes mistakes
Accuracy	% of correct predictions (can be misleading!)
Recall	Of actual positives, how many did we catch?
Precision	Of predicted positives, how many were correct?
Class Weights	Fix imbalanced data by penalizing minority class mistakes more
Odds Ratio	How much each feature increases/decreases risk

🎉 You've Mastered Logistic Regression!

You can now build classification models for healthcare, finance, marketing, and more!

Previous: Bias & Variance Next: k-Nearest Neighbors

📊 Logistic Regression

Chapter 1: What is Logistic Regression?

👶 Linear vs Logistic - What's the Difference?

🔢 Linear vs 📋 Logistic

🤔 Why Not Just Use Linear Regression for Yes/No?

🎮 Interactive: The Sigmoid Curve & Decision Threshold

Chapter 2: Understanding Our Dataset

📥 Download the Dataset to Follow Along!

Chapter 3: Building the Model Step-by-Step

Step 1: Import Libraries & Load Data

What each part does (in simple words)

Step 2: Check for Missing Values

Step 3: Handle Missing Values

🤔 Why Median Instead of Mean?

Step 4: Visualize Numerical Features

📊 What the Boxplots Tell Us

Step 5: Check Correlation

🔍 Key Finding: sysBP & diaBP are Highly Correlated (0.79)

Step 6: Split Data & Scale Features

Chapter 4: Train & Evaluate the Model

Step 7: Train the Model

⚠️ Wait! 84.55% Accuracy Sounds Great... But Is It?

Step 8: The Confusion Matrix

🤷 What is a Confusion Matrix?

📊 What the Confusion Matrix Reveals

Understanding the Metrics

🏥 In Healthcare: Recall is CRITICAL!

Chapter 5: The Imbalanced Data Problem

🤔 Why is Imbalanced Data a Problem?

Solution: Class Weights

⚖️ What are Class Weights?

📊 Before vs After Class Weights

✅ Success! Recall Improved from 5% to 56%!

Chapter 6: Model Interpretation - Odds Ratios

🎲 What is an Odds Ratio?

🏆 Top 5 Heart Disease Risk Factors

Chapter 7: Save & Load Your Model

💾 Why Save Models?

Chapter 8: Getting Probability Predictions

🎯 Why Probabilities Matter

🚫 Common Mistakes in Logistic Regression

📘 From the course notebook (Logistic Regression)

Complete code from course notebook: logistic_regression.ipynb

Complete code from course notebook: logistic_regression.ipynb

💭 Short reflection

💭 Short reflection

✅ CORE (Must know)

📚 NON-CORE (Good to know)

Summary

🎉 You've Mastered Logistic Regression!