Master the algorithms winning Kaggle competitions! XGBoost, LightGBM, and friends - ensemble methods that combine weak learners into powerful predictors.
An ensemble method combines multiple models to create a stronger predictor. Just like a team of experts is better than one expert alone!
In a jury, 12 people vote on a verdict. One person might be biased or mistaken, but the group decision is usually more reliable.
Ensemble methods work the same way: Multiple models "vote" on predictions, reducing individual errors!
Train in PARALLEL
Each model sees different random data samples
Final: Average or vote all predictions
Example: Random Forest
Train SEQUENTIALLY
Each model focuses on previous model's mistakes
Final: Weighted sum of all predictions
Examples: XGBoost, LightGBM, AdaBoost
Bagging reduces variance by training multiple models on different random subsets of data.
# Bagging Classifier from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier # Bagging with Decision Trees as base estimator bagging_model = BaggingClassifier( estimator=DecisionTreeClassifier(), # Base model n_estimators=100, # Number of models max_samples=0.8, # 80% of data per model max_features=0.8, # 80% of features per model bootstrap=True, # Sample with replacement random_state=42, n_jobs=-1 # Use all CPUs ) bagging_model.fit(X_train, y_train) print(f"Bagging Accuracy: {bagging_model.score(X_test, y_test):.2%}") # Note: Random Forest IS a special case of Bagging # where base estimators are Decision Trees with random feature selection
Bagging: Can use ANY base model (decision trees, SVMs, etc.)
Random Forest: Bagging specifically with Decision Trees + random feature selection at each split
Boosting builds models sequentially, where each new model focuses on correcting the errors of previous models.
Boosting = train one weak model, then train the next one to fix the mistakes of the first, and keep adding models that correct the remaining errors; the final prediction is a weighted combination of all of them. Unlike bagging (e.g. Random Forest), models are built one after another, not in parallel.
Imagine a student taking practice tests:
Boosting works the same way! Each model pays more attention to samples the previous models got wrong.
Idea: Increase weights on misclassified samples
# AdaBoost from sklearn.ensemble import AdaBoostClassifier adaboost = AdaBoostClassifier( n_estimators=100, learning_rate=0.1, random_state=42 ) adaboost.fit(X_train, y_train) print(f"AdaBoost Accuracy: {adaboost.score(X_test, y_test):.2%}")
Idea: Each model predicts the RESIDUAL (error) of previous models
# Gradient Boosting from sklearn.ensemble import GradientBoostingClassifier gb_model = GradientBoostingClassifier( n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42 ) gb_model.fit(X_train, y_train) print(f"Gradient Boosting Accuracy: {gb_model.score(X_test, y_test):.2%}")
These are optimized versions of Gradient Boosting that are faster, more accurate, and used in most Kaggle winning solutions!
Why it's better:
# XGBoost - The Kaggle King! # pip install xgboost import xgboost as xgb xgb_model = xgb.XGBClassifier( n_estimators=100, max_depth=5, learning_rate=0.1, subsample=0.8, # 80% of rows per tree colsample_bytree=0.8, # 80% of features per tree reg_alpha=0.1, # L1 regularization reg_lambda=1.0, # L2 regularization use_label_encoder=False, eval_metric='logloss', random_state=42 ) xgb_model.fit(X_train, y_train) print(f"XGBoost Accuracy: {xgb_model.score(X_test, y_test):.2%}") # Feature importance xgb.plot_importance(xgb_model, max_num_features=10) plt.show()
Why it's faster:
# LightGBM - The Speed Demon! # pip install lightgbm import lightgbm as lgb lgb_model = lgb.LGBMClassifier( n_estimators=100, max_depth=-1, # No limit (use num_leaves instead) num_leaves=31, # Control tree complexity learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1, reg_lambda=0.1, random_state=42, verbose=-1 # Suppress warnings ) lgb_model.fit(X_train, y_train) print(f"LightGBM Accuracy: {lgb_model.score(X_test, y_test):.2%}")
| Algorithm | Speed | Accuracy | Best For |
|---|---|---|---|
| Random Forest | Fast (parallel) | Good | Quick baseline, interpretability |
| XGBoost | Medium | Excellent | Structured/tabular data, competitions |
| LightGBM | Very Fast | Excellent | Large datasets, speed-critical |
| CatBoost | Fast | Excellent | Lots of categorical features |
| Parameter | Effect | Typical Range |
|---|---|---|
| n_estimators | Number of trees | 100-1000 (more = better but slower) |
| learning_rate | Contribution of each tree | 0.01-0.3 (lower = more trees needed) |
| max_depth | Tree depth | 3-10 (deeper = more complex) |
| subsample | Fraction of rows per tree | 0.6-1.0 |
| colsample_bytree | Fraction of features per tree | 0.6-1.0 |
# Hyperparameter Tuning with Optuna (fast & modern) # pip install optuna import optuna from sklearn.model_selection import cross_val_score def objective(trial): params = { 'n_estimators': trial.suggest_int('n_estimators', 50, 300), 'max_depth': trial.suggest_int('max_depth', 3, 10), 'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3), 'subsample': trial.suggest_float('subsample', 0.6, 1.0), 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0), } model = xgb.XGBClassifier(**params, random_state=42, use_label_encoder=False) score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean() return score # Run optimization study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=50, show_progress_bar=True) print("Best parameters:", study.best_params) print("Best CV score:", study.best_value)
In one sentence: how is boosting different from bagging (e.g. Random Forest) in the way models are trained and combined?
| Concept | Key Idea |
|---|---|
| Bagging | Train many models in parallel on random subsets, then vote/average |
| Boosting | Train models sequentially, each fixing previous mistakes |
| XGBoost | Optimized gradient boosting with regularization - Kaggle favorite |
| LightGBM | Fastest boosting, great for large data |
| learning_rate | Lower = slower learning, need more trees, but better generalization |