Logistic Regression Code – Line by Line

Every line of the logistic regression code explained in simple words. We predict 10-year heart disease risk (yes/no).

Download the dataset first: heart_disease_dataset.csv (or save as dataset.csv). Save in the same folder as your script so pd.read_csv("heart_disease_dataset.csv") works.

Step 1: Imports

Load the libraries we need.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

What each line does

  • import pandas as pd — For DataFrames and reading CSV.
  • StandardScaler — Scales features so they have similar range (helps the model train better).
  • train_test_split — Splits data into train and test sets.
  • LogisticRegression — The model we use for yes/no (binary) classification.
  • confusion_matrix, accuracy_score, classification_report — Tools to see how well the model predicts (correct vs wrong, accuracy, precision, recall).

Step 2: Load the data

Read the heart disease CSV. Target column is TenYearCHD (1 = risk, 0 = no risk).

data = pd.read_csv("heart_disease_dataset.csv")
data.head()

What each line does

  • pd.read_csv("heart_disease_dataset.csv") — Reads the file into a DataFrame. Columns: male, age, currentSmoker, cigsPerDay, totChol, sysBP, diaBP, BMI, glucose, TenYearCHD, etc.
  • data.head() — Shows the first 5 rows so you can see the data.

Step 3: Prepare features (X) and target (y)

We predict TenYearCHD. So y is that column; X is everything else we use as input.

X = data.drop('TenYearCHD', axis=1)
y = data['TenYearCHD']

What each line does

  • X = data.drop('TenYearCHD', axis=1) — All columns except TenYearCHD; these are the features (inputs).
  • y = data['TenYearCHD'] — The target we want to predict (0 or 1).

Step 4: Split into train and test

Use part of the data to train the model and part to test it.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

What this line does

  • train_test_split(X, y, test_size=0.2, random_state=42) — Splits X and y so 80% is for training (X_train, y_train) and 20% for testing (X_test, y_test). random_state=42 keeps the split the same every time.

Step 5: Scale the features (optional but recommended)

Scaling puts all features on a similar scale so the model trains better.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

What each line does

  • StandardScaler() — Creates a scaler that will subtract the mean and divide by standard deviation.
  • scaler.fit_transform(X_train) — Fits the scaler on training data and transforms it (so each column has mean 0 and std 1).
  • scaler.transform(X_test) — Transforms test data using the same scaling (we don't fit on test data!).

Step 6: Build and train the model

We use LogisticRegression to learn the relationship between features and yes/no outcome.

model = LogisticRegression()
model.fit(X_train, y_train)

What each line does

  • LogisticRegression() — Creates an empty logistic regression model.
  • model.fit(X_train, y_train) — Trains the model on the training data. It learns weights so it can predict 0 or 1 from the features.

Step 7: Predict and check accuracy

Predict on the test set and see how many we got right.

y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

What each line does

  • model.predict(X_test) — Predicts 0 or 1 for each row in the test set.
  • accuracy_score(y_test, y_pred) — Fraction of predictions that are correct (e.g. 0.85 = 85% correct).
  • confusion_matrix(y_test, y_pred) — Shows: true negatives, false positives, false negatives, true positives. Helps you see where the model makes mistakes.
  • classification_report(y_test, y_pred) — Prints precision, recall, F1-score for each class (0 and 1).