Bagging & Boosting | Fakhruddin Khambaty's Learning Hub

Part 1: Ensemble Methods - The Power of Many

An ensemble method combines multiple models to create a stronger predictor. Just like a team of experts is better than one expert alone!

🗳️ The Jury Analogy

In a jury, 12 people vote on a verdict. One person might be biased or mistaken, but the group decision is usually more reliable.

Ensemble methods work the same way: Multiple models "vote" on predictions, reducing individual errors!

Two Main Approaches

🎒

Bagging (Bootstrap Aggregating)

Train in PARALLEL

Each model sees different random data samples

Final: Average or vote all predictions

Example: Random Forest

⬆️

Boosting

Train SEQUENTIALLY

Each model focuses on previous model's mistakes

Final: Weighted sum of all predictions

Examples: XGBoost, LightGBM, AdaBoost

Part 2: Bagging (Bootstrap Aggregating)

Bagging reduces variance by training multiple models on different random subsets of data.

🎒 How Bagging Works

Bootstrap: Create multiple random samples from training data (with replacement)
Train: Build a separate model on each sample (in parallel)
Aggregate: Combine predictions (vote for classification, average for regression)

# Bagging Classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Bagging with Decision Trees as base estimator
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # Base model
    n_estimators=100,                     # Number of models
    max_samples=0.8,                      # 80% of data per model
    max_features=0.8,                     # 80% of features per model
    bootstrap=True,                       # Sample with replacement
    random_state=42,
    n_jobs=-1                             # Use all CPUs
)

bagging_model.fit(X_train, y_train)
print(f"Bagging Accuracy: {bagging_model.score(X_test, y_test):.2%}")

# Note: Random Forest IS a special case of Bagging
# where base estimators are Decision Trees with random feature selection

💡 Bagging vs Random Forest

Bagging: Can use ANY base model (decision trees, SVMs, etc.)

Random Forest: Bagging specifically with Decision Trees + random feature selection at each split

Part 3: Boosting - Learning from Mistakes

Boosting builds models sequentially, where each new model focuses on correcting the errors of previous models.

👶 In One Sentence

Boosting = train one weak model, then train the next one to fix the mistakes of the first, and keep adding models that correct the remaining errors; the final prediction is a weighted combination of all of them. Unlike bagging (e.g. Random Forest), models are built one after another, not in parallel.

📚 The Student Learning Analogy

Imagine a student taking practice tests:

Test 1: Gets questions 5, 8, 12 wrong
Test 2: Focuses extra on questions like 5, 8, 12
Test 3: Focuses even more on remaining weak areas
Final: Combines learning from all practice sessions

Boosting works the same way! Each model pays more attention to samples the previous models got wrong.

Key Boosting Algorithms

🎯 AdaBoost (Adaptive Boosting) Classic

Idea: Increase weights on misclassified samples

Train a weak model
Increase weights on misclassified samples
Train next model on reweighted data
Repeat and combine with weighted vote

# AdaBoost
from sklearn.ensemble import AdaBoostClassifier

adaboost = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42
)
adaboost.fit(X_train, y_train)
print(f"AdaBoost Accuracy: {adaboost.score(X_test, y_test):.2%}")

📈 Gradient Boosting Powerful

Idea: Each model predicts the RESIDUAL (error) of previous models

Train Model 1 → Prediction = 100
Actual = 150 → Residual = 50
Train Model 2 to predict residual (50)
Final prediction = Model 1 + Model 2 + ...

# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_model.fit(X_train, y_train)
print(f"Gradient Boosting Accuracy: {gb_model.score(X_test, y_test):.2%}")

Part 4: XGBoost & LightGBM (The Champions)

These are optimized versions of Gradient Boosting that are faster, more accurate, and used in most Kaggle winning solutions!

⚡ XGBoost (Extreme Gradient Boosting) Most Popular

Why it's better:

Regularization: Built-in L1 and L2 to prevent overfitting
Parallel processing: Faster training
Handles missing values: Automatically finds best direction
Tree pruning: More efficient tree building

# XGBoost - The Kaggle King!
# pip install xgboost
import xgboost as xgb

xgb_model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    subsample=0.8,           # 80% of rows per tree
    colsample_bytree=0.8,    # 80% of features per tree
    reg_alpha=0.1,           # L1 regularization
    reg_lambda=1.0,          # L2 regularization
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

xgb_model.fit(X_train, y_train)
print(f"XGBoost Accuracy: {xgb_model.score(X_test, y_test):.2%}")

# Feature importance
xgb.plot_importance(xgb_model, max_num_features=10)
plt.show()

💨 LightGBM (Light Gradient Boosting Machine) Fastest

Why it's faster:

Leaf-wise growth: Grows deepest leaf first (vs level-wise)
Histogram-based: Bins continuous features for speed
Lower memory: Great for large datasets
Native categorical support: No need to encode!

# LightGBM - The Speed Demon!
# pip install lightgbm
import lightgbm as lgb

lgb_model = lgb.LGBMClassifier(
    n_estimators=100,
    max_depth=-1,           # No limit (use num_leaves instead)
    num_leaves=31,          # Control tree complexity
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1,
    random_state=42,
    verbose=-1              # Suppress warnings
)

lgb_model.fit(X_train, y_train)
print(f"LightGBM Accuracy: {lgb_model.score(X_test, y_test):.2%}")

Comparison: When to Use What?

Algorithm	Speed	Accuracy	Best For
Random Forest	Fast (parallel)	Good	Quick baseline, interpretability
XGBoost	Medium	Excellent	Structured/tabular data, competitions
LightGBM	Very Fast	Excellent	Large datasets, speed-critical
CatBoost	Fast	Excellent	Lots of categorical features

Part 5: Hyperparameter Tuning

Key Parameters to Tune

Parameter	Effect	Typical Range
n_estimators	Number of trees	100-1000 (more = better but slower)
learning_rate	Contribution of each tree	0.01-0.3 (lower = more trees needed)
max_depth	Tree depth	3-10 (deeper = more complex)
subsample	Fraction of rows per tree	0.6-1.0
colsample_bytree	Fraction of features per tree	0.6-1.0

# Hyperparameter Tuning with Optuna (fast & modern)
# pip install optuna
import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
    }
    
    model = xgb.XGBClassifier(**params, random_state=42, use_label_encoder=False)
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score

# Run optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)

print("Best parameters:", study.best_params)
print("Best CV score:", study.best_value)

💡 Quick Tuning Strategy

Start with default parameters
Set n_estimators high (500) with early stopping
Tune max_depth and num_leaves first
Then tune learning_rate (lower) and increase n_estimators
Finally tune subsample and colsample_bytree

🚫 Common Mistakes in Boosting

Too many trees without early stopping — Can overfit; use early_stopping_rounds (or cross-validation) to stop when validation score stops improving.
Learning rate too high with many trees — Often better to use a smaller learning rate and more trees; tune together.
Forgetting to scale — Tree-based boosters (XGBoost, LightGBM) don't require scaling, but if you mix with linear models or use other features, scale first.

💭 Short reflection

In one sentence: how is boosting different from bagging (e.g. Random Forest) in the way models are trained and combined?

✅ CORE (Must know)

Boosting: train trees sequentially; each tree corrects errors of the previous one; weighted vote or sum.
Bagging: train in parallel on bootstrap samples; vote/average (e.g. Random Forest).
XGBoost / LightGBM: gradient boosting with regularization; key params: n_estimators, max_depth, learning_rate.
Learning rate: lower = more trees needed, often better generalization.
Use early stopping to find optimal number of trees; cross-validate.

📚 NON-CORE (Good to know)

Gradient boosting: fit each tree to residuals (or negative gradient).
CatBoost, histogram-based splitting in LightGBM.
Hyperparameter tuning order: n_estimators with early_stopping, then max_depth, learning_rate, subsample.

Summary

Concept	Key Idea
Bagging	Train many models in parallel on random subsets, then vote/average
Boosting	Train models sequentially, each fixing previous mistakes
XGBoost	Optimized gradient boosting with regularization - Kaggle favorite
LightGBM	Fastest boosting, great for large data
learning_rate	Lower = slower learning, need more trees, but better generalization

🎯 Pro Tips

Always use early stopping to find optimal n_estimators
Start with XGBoost or LightGBM for tabular data - they usually win
Learning rate ↓ + n_estimators ↑ = better but slower
Use cross-validation for reliable evaluation

🚀 Bagging & Boosting