Every line of the logistic regression code explained in simple words. We predict 10-year heart disease risk (yes/no).
Download the dataset first: heart_disease_dataset.csv (or save as dataset.csv). Save in the same folder as your script so pd.read_csv("heart_disease_dataset.csv") works.
Step 1: Imports
Load the libraries we need.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
What each line does
import pandas as pd — For DataFrames and reading CSV.
StandardScaler — Scales features so they have similar range (helps the model train better).
train_test_split — Splits data into train and test sets.
LogisticRegression — The model we use for yes/no (binary) classification.
confusion_matrix, accuracy_score, classification_report — Tools to see how well the model predicts (correct vs wrong, accuracy, precision, recall).
Step 2: Load the data
Read the heart disease CSV. Target column is TenYearCHD (1 = risk, 0 = no risk).
data = pd.read_csv("heart_disease_dataset.csv")
data.head()
What each line does
pd.read_csv("heart_disease_dataset.csv") — Reads the file into a DataFrame. Columns: male, age, currentSmoker, cigsPerDay, totChol, sysBP, diaBP, BMI, glucose, TenYearCHD, etc.
data.head() — Shows the first 5 rows so you can see the data.
Step 3: Prepare features (X) and target (y)
We predict TenYearCHD. So y is that column; X is everything else we use as input.
X = data.drop('TenYearCHD', axis=1)
y = data['TenYearCHD']
What each line does
X = data.drop('TenYearCHD', axis=1) — All columns except TenYearCHD; these are the features (inputs).
y = data['TenYearCHD'] — The target we want to predict (0 or 1).
Step 4: Split into train and test
Use part of the data to train the model and part to test it.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
What this line does
train_test_split(X, y, test_size=0.2, random_state=42) — Splits X and y so 80% is for training (X_train, y_train) and 20% for testing (X_test, y_test). random_state=42 keeps the split the same every time.
Step 5: Scale the features (optional but recommended)
Scaling puts all features on a similar scale so the model trains better.
model.predict(X_test) — Predicts 0 or 1 for each row in the test set.
accuracy_score(y_test, y_pred) — Fraction of predictions that are correct (e.g. 0.85 = 85% correct).
confusion_matrix(y_test, y_pred) — Shows: true negatives, false positives, false negatives, true positives. Helps you see where the model makes mistakes.
classification_report(y_test, y_pred) — Prints precision, recall, F1-score for each class (0 and 1).