🔬 SVM Code Walkthrough

Line by line: predict which bank customers will leave using Support Vector Machines

Download BankChurnersData.csv

Step 1: Explore the Data

First we load the dataset and get a feel for what's inside. The Bank Churners dataset has info about credit card customers — some stayed (Existing Customer) and some left (Attrited Customer).

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

df = pd.read_csv('datasets/BankChurnersData.csv')
print(df.shape)
df.head()

Line-by-line

  • warnings.filterwarnings('ignore') — Hides noisy sklearn warnings.
  • import SVC — The Support Vector Classifier class from scikit-learn.
  • df.shape — Prints (10127, 23) — 10,127 customers, 23 columns.
  • df.head() — Shows the first 5 rows so we can eyeball the data.

Understanding Every Column

Click any column name to see what it means:

CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Total_Trans_Ct Total_Trans_Amt Credit_Limit Avg_Open_To_Buy Total_Revolving_Bal Months_on_book Contacts_Count_12_mon Months_Inactive_12_mon
df.info()
df.describe()

🧒 ELI5: What is this dataset?

Imagine you run a bank. You have 10,127 customers with credit cards. Some are happy and stay — others get frustrated and leave. You want a computer to learn the patterns of people who leave so you can catch them early and offer them a deal to stay!

🎯 Analogy

Think of it like a school yearbook. Each row is a student, each column is a fact about them (age, grades, clubs). We want to predict who will transfer to another school based on all those facts.

Step 2: Visualize the Data (EDA)

Before training any model, we need to see the data. Visualizations reveal which features differ between churned and existing customers.

Boxplots: Numerical Features vs. Churn

num_cols = df.select_dtypes(include=['int64','float64']).columns.tolist()
num_cols.remove('CLIENTNUM')

fig, axes = plt.subplots(4, 4, figsize=(20,16))
for i, col in enumerate(num_cols):
    ax = axes[i//4][i%4]
    sns.boxplot(data=df, x='Attrition_Flag', y=col, ax=ax,
               palette={'Existing Customer':'#a78bfa','Attrited Customer':'#f87171'})
    ax.set_title(col, fontsize=10)
plt.tight_layout()
plt.show()

What each line does

  • select_dtypes — Grabs only numeric columns (not text like Gender).
  • remove('CLIENTNUM') — The ID column is just a number, not a feature.
  • subplots(4,4) — Creates a 4×4 grid of small charts.
  • boxplot — For each column, draws a box showing the range of values split by Attrited vs Existing.

Correlation Heatmap

plt.figure(figsize=(14,10))
sns.heatmap(df[num_cols].corr(), annot=True, fmt='.2f',
           cmap='RdPu', center=0, linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

🧒 ELI5: What's a correlation heatmap?

It's like a friendship chart. If two columns move together (when one goes up, the other goes up too), they're highly correlated — shown in dark purple. If they don't care about each other, it's close to zero. We look for features that are too friendly (redundant) and drop one of them.

Interactive: Top Features Correlated with Churn

Click "Reveal" to see the bars animate

Total_Trans_Ct 0.37 Total_Trans_Amt 0.31 Total_Revolving_Bal 0.26 Contacts_12_mon 0.20 Months_Inactive 0.16 ← Higher = more predictive of churn

🎯 Analogy

Boxplots are like comparing the height of basketball vs. chess club members. You immediately see which group is taller. Similarly, we see which features look different for churned customers — those are the ones SVM will rely on most.

Step 3: Feature Engineering

Raw data isn't ready for SVM. We need to convert text to numbers, create new useful features, remove redundant ones, and scale everything to the same range.

Encode the Target & Create Dummies

# Convert target: 1 = Attrited (churned), 0 = Existing (stayed)
df['Attrition_Flag'] = df['Attrition_Flag'].map({
    'Attrited Customer': 1,
    'Existing Customer': 0
})

# Drop the ID column — it's useless for prediction
df = df.drop(columns=['CLIENTNUM'])

# One-hot encode categorical columns
cat_cols = df.select_dtypes(include=['object']).columns
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

🧒 ELI5: Why dummy variables?

SVM only understands numbers, not words. So "Male"/"Female" becomes two columns: Gender_M = 1 means male, Gender_M = 0 means female. drop_first=True avoids redundancy — if it's not Male, it must be Female. One column is enough!

Create New Feature & Drop Redundant Ones

# New feature: average value per transaction
df['Avg_Transaction_Value'] = df['Total_Trans_Amt'] / df['Total_Trans_Ct']

# Avg_Open_To_Buy is ~0.99 correlated with Credit_Limit — drop it
df = df.drop(columns=['Avg_Open_To_Buy'])

Train-Test Split & Scaling

X = df.drop(columns=['Attrition_Flag'])
y = df['Attrition_Flag']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

Line-by-line

  • stratify=y — Ensures both train and test sets have the same ratio of churned/existing customers (~16%/84%).
  • MinMaxScaler — Scales every feature to 0–1. Without this, Credit_Limit (thousands) would dominate Customer_Age (tens).
  • fit_transform on train, transform on test — We learn the min/max from training data only, then apply it to test. Never peek at test data!

Interactive: Why Scaling Matters for SVM

Toggle to see how unscaled features warp the SVM decision boundary

Scaled: fair distance in all directions Stayed Churned

⚠️ Common Pitfall

If you call scaler.fit_transform(X_test) instead of scaler.transform(X_test), you're "peeking" at test data. This causes data leakage — your accuracy looks great in training but falls apart in production.

🎯 Analogy

Scaling is like converting all currencies to USD before comparing prices. Without it, 10,000 Japanese Yen looks way bigger than $90, even though they're roughly equal. SVM computes distances, so every feature must be on the same scale.

Step 4: Train SVM Models

Now the fun part! We train two SVM models — one with a linear kernel and one with an RBF kernel — and compare their performance.

Linear Kernel SVM

svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_lin = svm_linear.predict(X_test)

print("Linear SVM Accuracy:", accuracy_score(y_test, y_pred_lin))
print(classification_report(y_test, y_pred_lin))
print("Support vectors:", svm_linear.n_support_)

RBF Kernel SVM

svm_rbf = SVC(kernel='rbf', C=1.0, gamma=0.1, random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)

print("RBF SVM Accuracy:", accuracy_score(y_test, y_pred_rbf))
print(classification_report(y_test, y_pred_rbf))
print("Support vectors:", svm_rbf.n_support_)

Interactive: Compare Linear vs. RBF Results

Accuracy93.2%
Precision (Churn)0.88
Recall (Churn)0.81
F1-Score (Churn)0.84
Support Vectors[1234, 456]

🧒 ELI5: Linear vs. RBF

Linear draws a straight line (or flat plane in many dimensions) between churned and loyal customers. RBF can draw curvy, wavy boundaries — it finds patterns even when the groups aren't neatly separated by a straight cut.

🎯 Analogy

Linear is like cutting a pizza with one straight slice. RBF is like using a cookie cutter — it can carve out any shape. RBF is more flexible but risks over-fitting if you're not careful.

Key Insight: Support Vectors

  • n_support_ shows how many training samples sit right at the margin boundary.
  • Fewer support vectors = simpler model, faster predictions.
  • If almost every sample is a support vector, the model may be over-fitting.

Step 5: Hyperparameter Tuning (GridSearchCV)

Instead of guessing C, kernel, and gamma, we let the computer try every combination and pick the best one automatically.

param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': [0.1, 0.001, 0.0001]
}

grid = GridSearchCV(
    SVC(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    verbose=1,
    n_jobs=-1
)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best F1 score:", round(grid.best_score_, 4))

Line-by-line

  • param_grid — A dictionary: 4 values of C × 2 kernels × 3 gamma values = 24 combinations.
  • cv=5 — 5-fold cross-validation: splits training data into 5 chunks, trains on 4, tests on 1, rotates.
  • scoring='f1' — We optimize for F1-score, which balances precision and recall (important for imbalanced data).
  • n_jobs=-1 — Use all CPU cores to run in parallel. Much faster!
  • best_params_ — The winning combination. Typically C=10, kernel='rbf', gamma=0.1 for this dataset.

Interactive: GridSearch Heatmap

Click any cell to see the C × gamma combo details. Darker = better F1.

RBF Kernel — F1 Scores C=0.1 C=1 C=10 C=100 γ=0.1 γ=0.001 γ=0.0001 0.72 0.60 0.45 0.84 0.71 0.58 0.91 ★ 0.82 0.70 0.90 0.83 0.71 Click a cell to see details

🧒 ELI5: What is GridSearchCV?

Imagine you're baking cookies. You try 4 oven temperatures × 3 baking times and taste each batch. The combo that tastes best is your "best params." GridSearch does the same thing — it trains an SVM for every combination and scores each one, then returns the winner!

🎯 Analogy

C is how strict the teacher is (high C = "zero tolerance for mistakes"). gamma is how closely the model looks at each data point (high gamma = "examines every detail with a magnifying glass"). We need the right balance — strict enough to be accurate, but not so strict that it memorizes the training data.

Step 6: Final Model & Evaluation

Now we train the final model using the best hyperparameters found by GridSearchCV, and thoroughly evaluate its performance.

# Train final model with best params
best_svm = grid.best_estimator_
y_final = best_svm.predict(X_test)

print("Final Accuracy:", round(accuracy_score(y_test, y_final), 4))
print(classification_report(y_test, y_final))

Confusion Matrix

cm = confusion_matrix(y_test, y_final)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples',
           xticklabels=['Stayed','Churned'],
           yticklabels=['Stayed','Churned'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix — Final SVM')
plt.show()

Interactive: Confusion Matrix — Click Each Cell

Predicted: Stayed
Predicted: Churned
Actual: Stayed
TN
1598
FP
107
Actual: Churned
FN
42
TP
279

👆 Click any cell to learn what it means

ROC Curve

from sklearn.metrics import roc_curve, auc

# Need decision_function for ROC
y_scores = best_svm.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, color='#7c3aed', lw=2,
        label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0,1], [0,1], '--', color='#94a3b8')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve — Final SVM Model')
plt.legend()
plt.show()

What each line does

  • decision_function — Returns a continuous score (not just 0/1). Positive = more likely churned.
  • roc_curve — Computes the trade-off: as we lower the threshold, we catch more churners (higher TPR) but also make more false alarms (higher FPR).
  • AUC — Area Under the Curve. 1.0 = perfect, 0.5 = random guessing. We aim for ≥ 0.90.

🧒 ELI5: Confusion Matrix

Imagine a fire alarm. True Positive = alarm rings and there IS a fire (good!). False Positive = alarm rings but no fire (annoying). False Negative = fire but NO alarm (dangerous!). True Negative = no alarm, no fire (all good). We want lots of TP/TN and very few FP/FN.

Feature Importance Discussion

Which features drove churn predictions?

  • Total_Trans_Ct — Customers with fewer transactions are far more likely to churn. This was the #1 predictor.
  • Total_Trans_Amt — Lower spending = higher churn risk.
  • Total_Revolving_Bal — Customers who carry $0 revolving balance aren't really using the card.
  • Contacts_Count_12_mon — More calls to the bank = frustrated customer.
  • Avg_Transaction_Value — Our engineered feature! Lower value per transaction signals disengagement.

🎯 Analogy

Think of churn prediction like a doctor's check-up. The confusion matrix is your test result sheet: are you catching real diseases (TP) without scaring healthy people (FP)? The ROC curve shows how well the "test" performs overall — a bigger area under the curve means a more reliable diagnostic tool.

⚠️ Imbalanced Data Warning

Only ~16% of customers churned. Accuracy alone can be misleading — a model that always says "Stayed" gets 84% accuracy! That's why we optimized for F1-score and checked recall specifically for the churned class. Consider SMOTE or class_weight='balanced' for even better results.

Summary: What We Learned

Full Pipeline Recap

  • Step 1 — Loaded 10,127 bank customers with 23 features. Explored with head/info/describe.
  • Step 2 — Visualized distributions with boxplots and found correlated features with a heatmap.
  • Step 3 — One-hot encoded categoricals, engineered Avg_Transaction_Value, dropped redundant columns, scaled with MinMaxScaler.
  • Step 4 — Trained Linear SVM (~93%) and RBF SVM (~95%). RBF found non-linear patterns.
  • Step 5 — GridSearchCV tested 24 combos, found best: C=10, kernel='rbf', gamma=0.1.
  • Step 6 — Final model: ~95.6% accuracy, 0.87 recall on churners, AUC ≈ 0.98.

🧒 One-sentence ELI5

We taught a computer to spot unhappy bank customers by looking at how often they use their credit card, how much they spend, and how often they call the bank — and it gets it right about 96 out of 100 times!