Predict heart disease risk! Learn classification, probability, confusion matrix, and how to handle imbalanced data.
Linear Regression: Predicts a NUMBER (house price = $350,000)
Logistic Regression: Predicts a CATEGORY (Will get heart disease? Yes/No)
Despite the name "Regression", Logistic Regression is used for Classification!
LINEAR REGRESSION (Predict Numbers):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input: House size, bedrooms, location
Output: $425,000 (continuous number)
Price
^
450K โ โ
400K โ โ
350K โ โ
300K โ โ
โโโโโโโโโโโโโโโโโโโ Size
LOGISTIC REGRESSION (Predict Categories):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input: Age, BP, cholesterol, smoking
Output: 0 (No heart disease) or 1 (Yes)
Probability
^
1.0 โ โโโโโโ
โ โ
0.5 โ โ โ S-shaped curve!
โ โ
0.0 โโโโโโ
โโโโโโโโโโโโโโโโโโโ Risk Score
Linear Regression can output 1.5 or -0.3 (doesn't make sense for Yes/No!)
Logistic Regression uses a Sigmoid function to squeeze outputs between 0 and 1.
Output > 0.5 โ Predict "Yes" (1)
Output โค 0.5 โ Predict "No" (0)
The S-curve converts any number into a probability (0 to 1). Drag the threshold slider to see how changing the cutoff affects predictions!
Low threshold = more "Yes" predictions (catches more, but more false alarms). High threshold = fewer "Yes" predictions (misses some, but more precise).
We're using the Framingham Heart Study dataset to predict 10-year risk of Coronary Heart Disease (CHD).
Download this CSV file and save it in your working directory to run the code examples.
| Feature | Description | Type |
|---|---|---|
| male | Gender (1 = male, 0 = female) | Categorical |
| age | Age of patient | Numerical |
| currentSmoker | Is patient a current smoker? | Categorical |
| cigsPerDay | Cigarettes smoked per day | Numerical |
| BPMeds | On blood pressure medication? | Categorical |
| prevalentStroke | Had a stroke before? | Categorical |
| prevalentHyp | Is patient hypertensive? | Categorical |
| diabetes | Has diabetes? | Categorical |
| totChol | Total cholesterol level | Numerical |
| sysBP / diaBP | Systolic / Diastolic blood pressure | Numerical |
| BMI | Body Mass Index | Numerical |
| glucose | Blood glucose level | Numerical |
| TenYearCHD ๐ฏ | 10-year risk of heart disease (TARGET) | 0 = No, 1 = Yes |
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # Machine Learning imports from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # Load the heart disease dataset (download from link above!) data = pd.read_csv("heart_disease_dataset.csv") print(f"Dataset shape: {data.shape}") print(data.head()) # Dataset shape: (4238, 16) # male age education currentSmoker cigsPerDay ...
pd.read_csv(...) โ Loads the heart disease CSV into data.
data.shape โ Number of rows and columns.
data.head() โ First 5 rows. Other lines: imports for scaling, train/test split, LogisticRegression, and metrics.
# Check missing values print(data.isnull().sum()) # male 0 # age 0 # cigsPerDay 29 โ Missing! # totChol 50 โ Missing! # BMI 19 โ Missing! # heartRate 1 โ Missing! # glucose 388 โ Missing! # TenYearCHD 0
Mean is affected by outliers (extreme values pull it up/down)
Median is robust - it's the middle value, unaffected by outliers!
For medical data with potential extreme values, median is safer.
# Fill missing values with MEDIAN (robust to outliers) numerical_columns = ['cigsPerDay', 'totChol', 'BMI', 'heartRate', 'glucose'] for col in numerical_columns: median_value = data[col].median() data[col] = data[col].fillna(median_value) print(f"{col}: filled with median = {median_value}") # Verify no more missing values print("\nMissing values after imputation:") print(data.isnull().sum().sum()) # Output: 0
# Boxplots: Compare features between heart disease (1) vs no heart disease (0) plt.figure(figsize=(20, 12)) features_to_plot = ['age', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose', 'cigsPerDay'] for i, col in enumerate(features_to_plot, 1): plt.subplot(3, 3, i) sns.boxplot(x='TenYearCHD', y=col, data=data, hue='TenYearCHD', palette='Set2', legend=False) plt.title(f'{col} by Heart Disease Risk') plt.tight_layout() plt.show()
# Correlation heatmap for numerical features numerical_cols = ['age', 'cigsPerDay', 'totChol', 'BMI', 'heartRate', 'glucose', 'sysBP', 'diaBP'] plt.figure(figsize=(10, 8)) sns.heatmap(data[numerical_cols].corr(), cmap="YlGnBu", annot=True, fmt=".2f") plt.title("Correlation Between Numerical Features") plt.show()
sysBP (Systolic): Pressure when heart beats
diaBP (Diastolic): Pressure when heart relaxes
They measure similar things, so they're correlated. In advanced models, you might drop one!
# Separate features (X) and target (y) y = data["TenYearCHD"] # What we want to predict X = data.drop('TenYearCHD', axis=1) # Everything else # Train-test split (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=0 ) # Scale features to 0-1 range (important for Logistic Regression!) scaler = MinMaxScaler() X_train = scaler.fit_transform(X_train) # Fit AND transform on training X_test = scaler.transform(X_test) # Only transform on test print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}") # Training samples: 3390 # Test samples: 848
# Create and train Logistic Regression model model = LogisticRegression() model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) # Check accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2%}") # Accuracy: 84.55%
Let's look deeper at what the model is actually doing!
It shows WHERE your model makes mistakes:
# Generate confusion matrix cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cm) # Predicted # NO YES # Actual NO [[710 0] # Actual YES [131 7]]
PREDICTED
NO YES
โโโโโโโโโโฌโโโโโโโโโ
Actual NO โ 710 โ 0 โ โ Great! No false alarms
โโโโโโโโโโผโโโโโโโโโค
Actual YES โ 131 โ 7 โ โ PROBLEM! Only caught 7 out of 138!
โโโโโโโโโโดโโโโโโโโโ
Out of 138 people who WILL get heart disease:
- Model correctly identified: 7 (5%) ๐ข
- Model MISSED: 131 (95%) ๐ฑ
โ ๏ธ This is TERRIBLE for healthcare!
Missing heart disease patients is DANGEROUS!
# Full classification report print(classification_report(y_test, y_pred)) # precision recall f1-score support # # 0 0.84 1.00 0.92 710 # 1 1.00 0.05 0.10 138 โ Recall is only 5%! # # accuracy 0.85 848
| Metric | Formula | Our Value | Meaning |
|---|---|---|---|
| Accuracy | (TP+TN) / Total | 84.5% | Overall correct predictions (MISLEADING here!) |
| Precision | TP / (TP+FP) | 100% | When we predict YES, how often correct? |
| Recall ๐จ | TP / (TP+FN) | 5% | Of actual YES cases, how many did we catch? |
| F1-Score | 2*(P*R)/(P+R) | 10% | Balance between Precision and Recall |
We'd rather have some false alarms (tell healthy people to get checked) than MISS someone who actually has heart disease!
Missing 95% of heart disease patients is unacceptable!
# Check class distribution from collections import Counter print("Training set distribution:") print(Counter(y_train)) # Counter({0: 2875, 1: 515}) print("\nPercentages:") print(f"No heart disease (0): {2875/3390:.1%}") print(f"Heart disease (1): {515/3390:.1%}") # No heart disease (0): 84.8% # Heart disease (1): 15.2%
The model learns: "If I just predict NO for everyone, I'll be right 85% of the time!"
It's taking the lazy path instead of learning the actual patterns!
We tell the model: "Mistakes on the MINORITY class (heart disease) are MORE EXPENSIVE!"
Weight {0:1, 1:4} means missing a heart disease case costs 4x more than a false alarm.
# Train with class weights to handle imbalance weight = {0: 1, 1: 4# Penalize missing heart disease 4x more model_balanced = LogisticRegression(class_weight=weight) model_balanced.fit(X_train, y_train) y_pred_balanced = model_balanced.predict(X_test) print("Confusion Matrix (Balanced):") print(confusion_matrix(y_test, y_pred_balanced)) # Predicted # NO YES # Actual NO [[567 143] # Actual YES [ 61 77]]
BEFORE (No weights): AFTER (With weights): โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ Caught: 7 out of 138 (5%) Caught: 77 out of 138 (56%) โ Missed: 131 patients ๐ฑ Missed: 61 patients (better!) Trade-off: More false alarms (143 vs 0), but that's OK in healthcare!
print(classification_report(y_test, y_pred_balanced)) # precision recall f1-score support # # 0 0.90 0.80 0.85 710 # 1 0.35 0.56 0.43 138 โ Recall improved: 5% โ 56%! # # accuracy 0.76 848
Accuracy dropped from 85% to 76%, but that's a GOOD trade-off in healthcare.
We're now catching 56% of heart disease patients instead of just 5%!
It tells you how much each feature affects the probability of heart disease:
# Calculate odds ratios features = X.columns.tolist() odds_ratios = np.exp(model_balanced.coef_)[0] print("Feature Importance (Odds Ratios):") print("=" * 50) for feature, odds in zip(features, odds_ratios): print(f"{feature:20s}: {odds:.2f}") # Feature Importance (Odds Ratios): # ================================================== # male : 1.52 โ Being male increases risk # age : 11.37 โ AGE is HUGE risk factor! # education : 1.00 โ No effect # currentSmoker : 1.11 โ Slight increase # cigsPerDay : 4.23 โ Major risk factor! # BPMeds : 1.17 โ Slight increase # prevalentStroke : 2.34 โ Previous stroke increases risk # prevalentHyp : 1.47 โ Hypertension increases risk # diabetes : 1.78 โ Diabetes increases risk # totChol : 2.33 โ High cholesterol increases risk # sysBP : 5.59 โ HIGH blood pressure = big risk! # diaBP : 0.69 โ Slightly protective (after controlling for sysBP) # BMI : 1.48 โ Higher BMI increases risk # heartRate : 0.89 โ Slight protective effect # glucose : 2.75 โ High glucose increases risk
import joblib # Save the model to a file joblib.dump(model_balanced, 'heart_disease_model.pkl') print("โ Model saved!") # Load the model later loaded_model = joblib.load('heart_disease_model.pkl') # Use loaded model to predict new_predictions = loaded_model.predict(X_test) print("โ Model loaded and working!") # Check model parameters print("Model intercept:", loaded_model.intercept_) # Model intercept: [-3.07863041]
Training takes time! Save your trained model so you can:
# Get probabilities instead of just 0/1 probabilities = model_balanced.predict_proba(X_test) # Create a nice DataFrame to view results results = pd.DataFrame({ 'Prob_No_HeartDisease': probabilities[:, 0], 'Prob_HeartDisease': probabilities[:, 1], 'Predicted': model_balanced.predict(X_test), 'Actual': y_test.values }) print(results.head(10)) # Prob_No_HeartDisease Prob_HeartDisease Predicted Actual # 0 0.73 0.27 0 0 # 1 0.92 0.08 0 0 # 2 0.68 0.32 0 1 โ Missed! # 3 0.45 0.55 1 1 โ Correct! # 4 0.89 0.11 0 0
Instead of just "Yes/No", you can say:
"This patient has a 55% probability of heart disease in 10 years."
Doctors can then decide: High-risk patients need immediate intervention!
The course source uses dataset.csv (heart disease risk; target TenYearCHD). Key code: data = pd.read_csv("dataset.csv"); StandardScaler or MinMaxScaler; train_test_split; LogisticRegression().fit(X_train, y_train); confusion_matrix, accuracy_score, roc_curve, classification_report. Download dataset.csv from the datasets page. See Logistic Regression.pdf in the course source for slides.
Every line of code from the course notebook (verbatim).
# --- Code cell 1 ---
from IPython.core.display import HTML
HTML("""
<style>
h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")
# --- Code cell 2 ---
import pandas as pd
#import ydata_profiling as yp
# data preprocessing
from sklearn.preprocessing import StandardScaler
# data splitting
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# --- Code cell 3 ---
data = pd.read_csv("dataset.csv")
# --- Code cell 5 ---
# Sex: male or female
# Age: Age of the patient
# Current Smoker: whether or not the patient is a current smoker
# Cigs Per Day: the number of cigarettes that the person smoked on average in one day
# BP Meds: whether or not the patient was on blood pressure medication
# Prevalent Stroke: whether or not the patient had previously had a stroke
# Prevalent Hyp: whether or not the patient was hypertensive
# Diabetes: whether or not the patient had diabetes
# Tot Chol: total cholesterol level
# Sys BP: systolic blood pressure
# Dia BP: diastolic blood pressure
# BMI: Body Mass Index
# Heart Rate: heart rate
# Glucose: glucose level
#Predict variable (desired target)
# 10 year risk of coronary heart disease CHD (binary: โ1โ, means โYesโ, โ0โ means โNoโ)
# --- Code cell 6 ---
data.head(10)
# --- Code cell 8 ---
data.info()
# --- Code cell 9 ---
print(data.isnull().sum())
# --- Code cell 10 ---
for col in data.columns:
print(col)
print(data[col].unique())
print('\n')
# --- Code cell 12 ---
#categorical_columns = ['education','BPMeds']
numerical_columns = ['cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose']
for column in list(numerical_columns):
data[column].fillna(data[column].median(),inplace = True)
# --- Code cell 13 ---
print(data.isnull().sum())
# --- Code cell 16 ---
data.head(10)
# --- Code cell 17 ---
# What if we had patinet number column?
# Would that be useful as feature
# Drop such columns from data
# --- Code cell 18 ---
data.columns
# --- Code cell 19 ---
#Visualizing numerical variables
plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'TenYearCHD', y = 'age', data = data)
plt.subplot(3,3,2)
sns.boxplot(x = 'TenYearCHD', y = 'totChol', data = data)
plt.subplot(3,3,3)
sns.boxplot(x = 'TenYearCHD', y = 'sysBP', data = data)
plt.subplot(3,3,4)
sns.boxplot(x = 'TenYearCHD', y = 'diaBP', data = data)
plt.subplot(3,3,5)
sns.boxplot(x = 'TenYearCHD', y = 'BMI', data = data)
plt.subplot(3,3,6)
sns.boxplot(x = 'TenYearCHD', y = 'heartRate', data = data)
plt.subplot(3,3,7)
sns.boxplot(x = 'TenYearCHD', y = 'glucose', data = data)
plt.subplot(3,3,8)
sns.boxplot(x = 'TenYearCHD', y = 'education', data = data)
plt.subplot(3,3,9)
sns.boxplot(x = 'TenYearCHD', y = 'cigsPerDay', data = data)
plt.show()
# --- Code cell 20 ---
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.countplot(x ='male', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,2)
sns.countplot(x ='currentSmoker', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,3)
sns.countplot(x ='BPMeds', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,4)
sns.countplot(x ='prevalentStroke', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,5)
sns.countplot(x ='prevalentHyp', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,6)
sns.countplot(x ='diabetes', hue = 'TenYearCHD', data = data)
plt.show()
# --- Code cell 21 ---
len(data.columns)
# --- Code cell 22 ---
#Correlation of output with numerical variables
numerical_columns = ['age', 'cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose', 'sysBP','diaBP']
# plotting correlation heatmap
dataplot = sns.heatmap(data[numerical_columns].corr(), cmap="YlGnBu", annot=True)
# --- Code cell 23 ---
# Highly correlated features
#sysBP: Systolic Blood Pressure - The pressure exerted when the heartbeats
#diaBP: Diastolic Blood Pressure - The pressure exerted on the walls of the arteries when the heart muscles relax
#in between two beats
#Both systolic and diastolic blood pressure are important indicators of cardiovascular health,
#and both can be associated with an increased risk of heart disease.
# However, the relationship between blood pressure and heart disease is complex,
#and both systolic and diastolic pressure readings are often considered together
# to provide a more comprehensive assessment.
# --- Code cell 26 ---
data.head(10)
# --- Code cell 30 ---
def train_test_split_and_scale(data):
y = data["TenYearCHD"]
x = data.drop('TenYearCHD',axis=1)
features = list(x.columns)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state = 0)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train) # scaling is done only on features
x_test = scaler.transform(x_test)
return x_train, x_test, y_train, y_test,features
# --- Code cell 31 ---
x_train, x_test, y_train, y_test,features = train_test_split_and_scale(data)
# --- Code cell 32 ---
Counter(y_train)
# --- Code cell 33 ---
def fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=None):
lr = LogisticRegression(class_weight=class_weight)
model = lr.fit(x_train, y_train) # model training
lr_predict = lr.predict(x_test) # create predicted o/p 0/1
lr_conf_matrix = confusion_matrix(y_test, lr_predict)
lr_acc_score = accuracy_score(y_test, lr_predict)
print("confussion matrix")
print(lr_conf_matrix)
print("\n")
print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
print(classification_report(y_test,lr_predict))
return model
# --- Code cell 34 ---
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test)
print("odds ratio", np.exp(model.coef_))
# --- Code cell 36 ---
Counter(y_train)
# --- Code cell 37 ---
Counter(y_test)
# --- Code cell 41 ---
# define class weights
weight = {0:1, 1:4}
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=weight)
# --- Code cell 42 ---
results = pd.DataFrame(model.predict_proba(x_test))
results.columns = ['class_0_proba','class_1_proba']
results['predicted_class'] = model.predict(x_test)
results.head(10)
# --- Code cell 43 ---
#save and reuse the model
# --- Code cell 45 ---
import joblib # 'pip install joblib' if you get "Package Not found Error"
joblib.dump(model , 'model_classifier.pkl')
# --- Code cell 46 ---
print(model_read.intercept_)
# --- Code cell 47 ---
model_read = joblib.load('model_classifier.pkl')
model_read.predict(x_test)
# --- Code cell 49 ---
# Feature importance
# Odds ratio well higher than 1: Increase in fetaure value increases probability of event(heart risk) hapenning
# Odds ratio well below 1: Increase in fetaure value decreases probability of event(heart risk) hapenning
# A feature with an odds ratio near zero typically suggests that the associated predictor has
#a strong negative impact on the odds of the event occurring.
# Odds ratio near 1 indicates that feature may not be a strong predictor
# --- Code cell 50 ---
odds_ratio = np.exp(model.coef_)[0]
for z in range(len(features)):
print("Odds ratio for feature {} is {}".format(features[z], odds_ratio[z]))
# --- Code cell 52 ---
print(model.coef_)
# --- Code cell 53 ---
print(np.exp(model.coef_))
Every line of code from the course notebook (verbatim).
# --- Code cell 1 ---
from IPython.core.display import HTML
HTML("""
<style>
h1 { color: blue !important; }
h2 { color: green !important; }
</style>
""")
# --- Code cell 2 ---
import pandas as pd
#import ydata_profiling as yp
# data preprocessing
from sklearn.preprocessing import StandardScaler
# data splitting
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# --- Code cell 3 ---
data = pd.read_csv("dataset.csv")
# --- Code cell 5 ---
# Sex: male or female
# Age: Age of the patient
# Current Smoker: whether or not the patient is a current smoker
# Cigs Per Day: the number of cigarettes that the person smoked on average in one day
# BP Meds: whether or not the patient was on blood pressure medication
# Prevalent Stroke: whether or not the patient had previously had a stroke
# Prevalent Hyp: whether or not the patient was hypertensive
# Diabetes: whether or not the patient had diabetes
# Tot Chol: total cholesterol level
# Sys BP: systolic blood pressure
# Dia BP: diastolic blood pressure
# BMI: Body Mass Index
# Heart Rate: heart rate
# Glucose: glucose level
#Predict variable (desired target)
# 10 year risk of coronary heart disease CHD (binary: โ1โ, means โYesโ, โ0โ means โNoโ)
# --- Code cell 6 ---
data.head(10)
# --- Code cell 8 ---
data.info()
# --- Code cell 9 ---
print(data.isnull().sum())
# --- Code cell 10 ---
for col in data.columns:
print(col)
print(data[col].unique())
print('\n')
# --- Code cell 12 ---
#categorical_columns = ['education','BPMeds']
numerical_columns = ['cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose']
for column in list(numerical_columns):
data[column].fillna(data[column].median(),inplace = True)
# --- Code cell 13 ---
print(data.isnull().sum())
# --- Code cell 16 ---
data.head(10)
# --- Code cell 17 ---
# What if we had patinet number column?
# Would that be useful as feature
# Drop such columns from data
# --- Code cell 18 ---
data.columns
# --- Code cell 19 ---
#Visualizing numerical variables
plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'TenYearCHD', y = 'age', data = data)
plt.subplot(3,3,2)
sns.boxplot(x = 'TenYearCHD', y = 'totChol', data = data)
plt.subplot(3,3,3)
sns.boxplot(x = 'TenYearCHD', y = 'sysBP', data = data)
plt.subplot(3,3,4)
sns.boxplot(x = 'TenYearCHD', y = 'diaBP', data = data)
plt.subplot(3,3,5)
sns.boxplot(x = 'TenYearCHD', y = 'BMI', data = data)
plt.subplot(3,3,6)
sns.boxplot(x = 'TenYearCHD', y = 'heartRate', data = data)
plt.subplot(3,3,7)
sns.boxplot(x = 'TenYearCHD', y = 'glucose', data = data)
plt.subplot(3,3,8)
sns.boxplot(x = 'TenYearCHD', y = 'education', data = data)
plt.subplot(3,3,9)
sns.boxplot(x = 'TenYearCHD', y = 'cigsPerDay', data = data)
plt.show()
# --- Code cell 20 ---
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.countplot(x ='male', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,2)
sns.countplot(x ='currentSmoker', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,3)
sns.countplot(x ='BPMeds', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,4)
sns.countplot(x ='prevalentStroke', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,5)
sns.countplot(x ='prevalentHyp', hue = 'TenYearCHD', data = data)
plt.subplot(2,3,6)
sns.countplot(x ='diabetes', hue = 'TenYearCHD', data = data)
plt.show()
# --- Code cell 21 ---
len(data.columns)
# --- Code cell 22 ---
#Correlation of output with numerical variables
numerical_columns = ['age', 'cigsPerDay', 'totChol', 'BMI','heartRate', 'glucose', 'sysBP','diaBP']
# plotting correlation heatmap
dataplot = sns.heatmap(data[numerical_columns].corr(), cmap="YlGnBu", annot=True)
# --- Code cell 23 ---
# Highly correlated features
#sysBP: Systolic Blood Pressure - The pressure exerted when the heartbeats
#diaBP: Diastolic Blood Pressure - The pressure exerted on the walls of the arteries when the heart muscles relax
#in between two beats
#Both systolic and diastolic blood pressure are important indicators of cardiovascular health,
#and both can be associated with an increased risk of heart disease.
# However, the relationship between blood pressure and heart disease is complex,
#and both systolic and diastolic pressure readings are often considered together
# to provide a more comprehensive assessment.
# --- Code cell 26 ---
data.head(10)
# --- Code cell 30 ---
def train_test_split_and_scale(data):
y = data["TenYearCHD"]
x = data.drop('TenYearCHD',axis=1)
features = list(x.columns)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state = 0)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train) # scaling is done only on features
x_test = scaler.transform(x_test)
return x_train, x_test, y_train, y_test,features
# --- Code cell 31 ---
x_train, x_test, y_train, y_test,features = train_test_split_and_scale(data)
# --- Code cell 32 ---
Counter(y_train)
# --- Code cell 33 ---
def fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=None):
lr = LogisticRegression(class_weight=class_weight)
model = lr.fit(x_train, y_train) # model training
lr_predict = lr.predict(x_test) # create predicted o/p 0/1
lr_conf_matrix = confusion_matrix(y_test, lr_predict)
lr_acc_score = accuracy_score(y_test, lr_predict)
print("confussion matrix")
print(lr_conf_matrix)
print("\n")
print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
print(classification_report(y_test,lr_predict))
return model
# --- Code cell 34 ---
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test)
print("odds ratio", np.exp(model.coef_))
# --- Code cell 36 ---
Counter(y_train)
# --- Code cell 37 ---
Counter(y_test)
# --- Code cell 41 ---
# define class weights
weight = {0:1, 1:4}
model = fit_and_evaluate_model(x_train, x_test, y_train, y_test,class_weight=weight)
# --- Code cell 42 ---
results = pd.DataFrame(model.predict_proba(x_test))
results.columns = ['class_0_proba','class_1_proba']
results['predicted_class'] = model.predict(x_test)
results.head(10)
# --- Code cell 43 ---
#save and reuse the model
# --- Code cell 45 ---
import joblib # 'pip install joblib' if you get "Package Not found Error"
joblib.dump(model , 'model_classifier.pkl')
# --- Code cell 46 ---
print(model_read.intercept_)
# --- Code cell 47 ---
model_read = joblib.load('model_classifier.pkl')
model_read.predict(x_test)
# --- Code cell 49 ---
# Feature importance
# Odds ratio well higher than 1: Increase in fetaure value increases probability of event(heart risk) hapenning
# Odds ratio well below 1: Increase in fetaure value decreases probability of event(heart risk) hapenning
# A feature with an odds ratio near zero typically suggests that the associated predictor has
#a strong negative impact on the odds of the event occurring.
# Odds ratio near 1 indicates that feature may not be a strong predictor
# --- Code cell 50 ---
odds_ratio = np.exp(model.coef_)[0]
for z in range(len(features)):
print("Odds ratio for feature {} is {}".format(features[z], odds_ratio[z]))
# --- Code cell 52 ---
print(model.coef_)
# --- Code cell 53 ---
print(np.exp(model.coef_))
In one sentence: why canโt we use linear regression for binary classification (0/1) instead of logistic regression?
| Concept | Simple Explanation |
|---|---|
| Logistic Regression | Predicts probability of belonging to a class (0-1) |
| Sigmoid Function | S-shaped curve that squishes values between 0 and 1 |
| Confusion Matrix | Shows TP, TN, FP, FN - where model makes mistakes |
| Accuracy | % of correct predictions (can be misleading!) |
| Recall | Of actual positives, how many did we catch? |
| Precision | Of predicted positives, how many were correct? |
| Class Weights | Fix imbalanced data by penalizing minority class mistakes more |
| Odds Ratio | How much each feature increases/decreases risk |
You can now build classification models for healthcare, finance, marketing, and more!