Gradient Boosting

AI Bootcamp


Mork Mongkul

Motivation & Introduction

Motivation & Introduction

From Random Forest to Gradient Boosting

  • Random Forest builds multiple trees independently and in parallel.

  • Each tree is trained on:
    • A bootstrap sample of the data

    • A random subset of features

  • Final prediction:
    • Regression: Average of all tree predictions

    • Classification: Majority vote

  • This is called Bagging (Bootstrap Aggregating)

Multiple trees vote/average their predictions

Motivation & Introduction

The Key Question

  • Random Forest: Trees are independent

  • Each tree makes predictions without knowing what other trees predicted

  • What if we could make trees learn from each other’s mistakes?

  • This leads us to Boosting!

Can subsequent trees fix previous errors?

Motivation & Introduction

Introduction to Boosting

  • Boosting: Build trees sequentially

  • Each new tree focuses on correcting errors made by previous trees

  • Key differences from Random Forest:
    • Trees are built one after another (not in parallel)

    • Each tree learns from previous mistakes

    • Typically uses shallow trees (weak learners)

Green lines show errors that the next tree will try to fix

Gradient Boosting: The Algorithm

Gradient Boosting

Sequential Error Correction

  • Step 1: Start with an initial prediction
    • Usually the mean (regression) or log-odds (classification)

  • Step 2: Calculate residuals (errors)
    • \(\text{residual}_i = y_i - \hat{y}_i\)

  • Step 3: Train a tree to predict these residuals

  • Step 4: Update predictions:
    • \(\hat{y}_{\text{new}} = \hat{y}_{\text{old}} + \color{red}{\alpha} \times \text{tree prediction}\)

    • \(\color{red}{\alpha}\) is the learning rate

  • Step 5: Repeat Steps 2-4 for \(M\) trees

Gradient Boosting

Mathematical Formulation

  • Goal: Find function \(F(x)\) that minimizes the loss \(L(y, F(x))\)

  • Initialize: \[F_0(x) = \arg\min_{\gamma}\sum_{i=1}^n L(y_i, \gamma)\]
    • For squared loss: \(F_0(x) = \bar{y}\) (mean of targets)

  • For \(m = 1\) to \(M\):

  1. Compute pseudo-residuals (negative gradient): \[r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\]
  • For squared loss: \(r_{im} = y_i - F_{m-1}(x_i)\) (actual residuals)
  1. Fit a regression tree \(h_m(x)\) to predict \(r_{im}\)
  1. Update the model: \[F_m(x) = F_{m-1}(x) + \color{red}{\alpha} \cdot h_m(x)\] where \(\color{red}{\alpha}\) is the learning rate
  • Final prediction: \(F(x) = F_M(x)\)

Gradient Boosting

Why “Gradient”?

  • We’re performing gradient descent in function space!

  • Instead of updating parameters \(\theta\): \[\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_\theta L\]

  • We update the function \(F(x)\): \[F_m(x) = F_{m-1}(x) - \alpha \nabla_F L\]

  • The tree \(h_m(x)\) approximates the negative gradient: \[h_m(x) \approx -\nabla_F L(y, F_{m-1}(x))\]

  • This is why we call it Gradient Boosting!

Each tree takes a step in the direction that reduces loss

Gradient Boosting

A Simple Example (Regression)

Data:

\(x\) \(y\)
1 2
2 4
3 6
4 8

Step 1: Initialize \[F_0(x) = \bar{y} = \frac{2+4+6+8}{4} = 5\]

Step 2: Calculate residuals (iteration 1)

\(x\) \(y\) \(F_0(x)\) \(r_1 = y - F_0(x)\)
1 2 5 -3
2 4 5 -1
3 6 5 1
4 8 5 3

Step 3: Fit tree \(h_1(x)\) to predict residuals

Suppose the tree learns: \[h_1(x) = \begin{cases} -2 & \text{if } x \leq 2 \\ 2 & \text{if } x > 2 \end{cases}\]

Step 4: Update predictions (with \(\alpha=0.5\)) \[F_1(x) = F_0(x) + 0.5 \times h_1(x)\]

\(x\) \(F_0(x)\) \(h_1(x)\) \(F_1(x)\)
1 5 -2 4
2 5 -2 4
3 5 2 6
4 5 2 6

New residuals:

\(x\) \(y\) \(F_1(x)\) \(r_2 = y - F_1(x)\)
1 2 4 -2
2 4 4 0
3 6 6 0
4 8 6 2

Continue for \(M\) iterations…

Key Hyperparameters

Key Hyperparameters

Controlling the Learning Process

  1. Number of trees (n_estimators)
    • More trees \(\Rightarrow\) better fit (but slower, risk of overfitting)

    • Typical values: 100-1000

  2. Learning rate (learning_rate)
    • Controls how much each tree contributes

    • Smaller \(\alpha\) \(\Rightarrow\) more trees needed, but better generalization

    • Typical values: 0.01-0.3

    • Trade-off: \(\downarrow \alpha\) requires \(\uparrow M\) trees

  3. Tree depth (max_depth)
    • Boosting works well with shallow trees

    • Typical values: 3-8 (much shallower than Random Forest!)

Key Hyperparameters

Additional Important Parameters

  1. Subsampling (subsample)
    • Fraction of samples used to fit each tree

    • Values < 1.0 introduce randomness (like Random Forest)

    • Typical values: 0.5-1.0

    • Called Stochastic Gradient Boosting when < 1.0

  2. Feature subsampling
    • max_features: Features considered per split

    • colsample_bytree: Features per tree (XGBoost)

    • Reduces correlation between trees

  3. Regularization
    • min_samples_leaf, min_samples_split

    • Prevents overfitting by limiting tree complexity

Typical hyperparameter combinations:

Purpose n_estimators learning_rate max_depth
Quick baseline 100 0.1 3
Better performance 500 0.05 4-5
Competition-grade 1000+ 0.01-0.03 5-8

Important relationships:

  • \(\downarrow\) learning_rate \(\Rightarrow\) \(\uparrow\) n_estimators needed

  • Smaller trees (max_depth) are usually better in boosting

  • Use cross-validation to find optimal values!

Gradient Boosting Variants

Gradient Boosting Variants

Three Main Implementations

1. Sklearn GradientBoosting

Pros:

  • Built into scikit-learn
  • Simple API, consistent with sklearn
  • Good for learning the algorithm

Cons:

  • Slower than XGBoost/LightGBM
  • Fewer features
  • Not ideal for large datasets

When to use:

  • Teaching/learning
  • Small to medium datasets
  • When you need sklearn integration

2. XGBoost

Pros:

  • Industry standard
  • Very fast (parallel processing)
  • Handles missing values
  • Advanced regularization
  • Dominant in Kaggle competitions

Cons:

  • Requires separate installation
  • More hyperparameters to tune

When to use:

  • Production systems
  • Medium to large datasets
  • When you need best performance

3. LightGBM

Pros:

  • Extremely fast
  • Memory efficient
  • Great for large datasets (>10K rows)
  • Handles categorical features natively
  • Leaf-wise tree growth (vs level-wise)

Cons:

  • Can overfit on small datasets
  • Requires careful tuning

When to use:

  • Large datasets
  • When speed is critical
  • Limited computational resources

Gradient Boosting Variants

Quick Comparison

Feature Sklearn GB XGBoost LightGBM
Speed Slow Fast Very Fast
Memory High Medium Low
Missing values Manual Automatic Automatic
Categorical features Manual encoding Manual encoding Native support
Tree growth Level-wise Level-wise Leaf-wise
Installation Built-in pip install xgboost pip install lightgbm
Best for Learning Production Big data
Kaggle popularity Low Very High High

Gradient Boosting in Action

Gradient Boosting in Action

Implementation with XGBoost

Dataset: Heart Disease Dataset

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import xgboost as xgb

# Load and prepare data
path = "../Data"
data = pd.read_csv(path + "/heart.csv")
data_no_dup = data.drop_duplicates()

X = data_no_dup.iloc[:, :-1]
y = data_no_dup.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Gradient Boosting in Action

XGBoost with Hyperparameter Tuning

# Define XGBoost classifier
xgb_clf = xgb.XGBClassifier(
    random_state=42,
    eval_metric='logloss'  # Suppress warning
)

# Hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# GridSearch with cross-validation
grid_search = GridSearchCV(
    estimator=xgb_clf,
    param_grid=param_grid,
    cv=10,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Best model
best_xgb = grid_search.best_estimator_
y_pred = best_xgb.predict(X_test)

print(f"Best hyperparameters: {grid_search.best_params_}")
Fitting 10 folds for each of 108 candidates, totalling 1080 fits
Best hyperparameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100, 'subsample': 1.0}

Gradient Boosting in Action

Results Comparison

Code
# Create results dataframe
results_df = pd.DataFrame({
    'Accuracy': [
        accuracy_score(y_test, y_pred),
        0.885246,  # 16-NN (from Decision Tree slides)
        grid_search.best_score_  # CV score
    ],
    'Precision': [
        precision_score(y_test, y_pred),
        0.882353,
        np.nan
    ],
    'Recall': [
        recall_score(y_test, y_pred),
        0.909091,
        np.nan
    ],
    'F1-score': [
        f1_score(y_test, y_pred),
        0.909091,
        np.nan
    ]
}, index=['Gradient Boosting (Test)', '16-NN (Test)', 'Gradient Boosting (CV)'])

results_df
Accuracy Precision Recall F1-score
Gradient Boosting (Test) 0.737705 0.757576 0.757576 0.757576
16-NN (Test) 0.885246 0.882353 0.909091 0.909091
Gradient Boosting (CV) 0.838167 NaN NaN NaN

Important insights from these results:

  • Small dataset effect: With only ~300 samples, simpler models (K-NN) can outperform complex ensembles

  • When Gradient Boosting excels: Typically on datasets with 1000+ samples and complex non-linear patterns

  • Always compare multiple models: No single algorithm is best for all problems

  • CV score is crucial: Use cross-validation to get reliable performance estimates

Gradient Boosting in Action

Feature Importance

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': best_xgb.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 5 most important features:")
print(feature_importance.head())
Top 5 most important features:
   feature  importance
2       cp    0.283735
12    thal    0.109298
11      ca    0.104121
8    exang    0.086912
1      sex    0.061423

Gradient Boosting in Action

Learning Curves

Code
# Train models with different numbers of trees to see learning curve
train_scores = []
test_scores = []
n_estimators_range = range(10, 301, 10)

for n in n_estimators_range:
    model = xgb.XGBClassifier(
        n_estimators=n,
        learning_rate=grid_search.best_params_['learning_rate'],
        max_depth=grid_search.best_params_['max_depth'],
        subsample=grid_search.best_params_['subsample'],
        colsample_bytree=grid_search.best_params_['colsample_bytree'],
        random_state=42,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(n_estimators_range),
    y=train_scores,
    mode='lines',
    name='Train accuracy',
    line=dict(color='blue', width=2)
))
fig.add_trace(go.Scatter(
    x=list(n_estimators_range),
    y=test_scores,
    mode='lines',
    name='Test accuracy',
    line=dict(color='red', width=2)
))

fig.update_layout(
    title='Learning Curve: Performance vs Number of Trees',
    xaxis_title='Number of Trees',
    yaxis_title='Accuracy',
    width=700,
    height=400,
    margin=dict(l=0, r=0, t=40, b=0)
)

HTML(fig.to_html(include_plotlyjs='cdn'))
  • Note how accuracy improves with more trees, then plateaus

  • Watch for the gap between train and test (overfitting signal)

Practical Tips & Best Practices

Practical Tips & Best Practices

When to Use Gradient Boosting

✓ Good use cases:

  • Tabular data (structured datasets)
    • Customer churn prediction
    • Credit risk modeling
    • House price prediction
    • Medical diagnosis
  • Medium to large datasets (1K+ rows)
  • When accuracy is critical
    • Kaggle competitions
    • Production ML systems
  • Mixed feature types
    • Numerical + categorical features
    • Missing data (XGBoost handles it well)
  • Very small datasets (<100 rows)
    • High risk of overfitting
    • Simpler models work better
  • Image/audio/video data
    • Use deep learning instead
    • CNN/RNN are more suitable
  • When interpretability is crucial
    • Ensemble of trees is hard to explain
    • Use linear models or single trees
  • Real-time inference with strict latency
    • 100s of trees can be slow
    • Consider simpler models

Practical Tips & Best Practices

Common Pitfalls and How to Avoid Them

1. Overfitting

Symptoms: - High train accuracy, low test accuracy - Large gap in learning curves

Solutions: - ↓ learning_rate + ↑ n_estimators - ↑ regularization (min_child_weight, gamma) - Use subsample < 1.0 - Reduce max_depth - Early stopping with validation set

2. Too slow training

Solutions: - Use LightGBM instead - Reduce n_estimators - Use subsample and colsample_bytree - Parallel processing (n_jobs=-1)

3. Poor hyperparameter choices

Solutions: - Start with defaults, then tune - Use RandomizedSearchCV for large grids - Focus on: n_estimators, learning_rate, max_depth - Monitor validation performance

4. Forgetting to scale (XGBoost)

Good news: - Tree-based models don’t need feature scaling! - But categorical encoding is still needed

5. Not using early stopping

xgb_clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=10
)

Practical Tips & Best Practices

Suggested Workflow

1. Start simple:

# Baseline XGBoost
model = xgb.XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

2. Evaluate and iterate:

# Check overfitting
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test:  {model.score(X_test, y_test):.3f}")

3. Tune key hyperparameters:

param_grid = {
    'n_estimators': [100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7]
}

4. Fine-tune with advanced parameters:

# After finding good base parameters, add:
param_grid = {
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'min_child_weight': [1, 3, 5]
}

5. Use cross-validation throughout!

Practical Tips & Best Practices

Gradient Boosting vs Other Methods

Method Pros Cons When to use
Decision Tree Simple, interpretable, fast Overfits easily, unstable Quick baseline, interpretability needed
Random Forest Robust, handles overfitting, parallel Slower prediction, black box General purpose, classification/regression
Gradient Boosting Highest accuracy, handles complex patterns Slower training, harder to tune, can overfit When accuracy matters most
XGBoost Fast, handles missing data, regularization More complex, many hyperparameters Production systems
LightGBM Very fast, memory efficient Can overfit small data Large datasets

Key Takeaways:

  • Start with Random Forest for robustness

  • Switch to XGBoost when you need maximum accuracy

  • Use LightGBM for very large datasets (>100K rows)

  • Always compare multiple methods on your specific problem!

Summary

Summary

Key Concepts

Core Ideas:

  • Gradient Boosting builds trees sequentially
  • Each tree corrects errors (residuals) of previous trees
  • Uses gradient descent in function space
  • Final prediction: \(F(x) = F_0(x) + \alpha \sum_{m=1}^M h_m(x)\)

Key Differences:

  • Random Forest: Trees independent, vote/average
  • Gradient Boosting: Trees sequential, additive
  • Random Forest: Deep trees
  • Gradient Boosting: Shallow trees (weak learners)

Critical Hyperparameters:

  1. n_estimators: Number of trees (100-1000)
  1. learning_rate: Step size (0.01-0.3)
  1. max_depth: Tree depth (3-8)
  1. subsample: Row sampling (0.5-1.0)
  1. colsample_bytree: Feature sampling

Implementation Choice:

  • Learning: sklearn GradientBoosting
  • Production: XGBoost (recommended)
  • Big Data: LightGBM

Summary

Remember

✓ Gradient Boosting is one of the most powerful ML algorithms for tabular data

✓ It wins many Kaggle competitions

✓ Trade-off: Performance vs Training time and Tuning complexity

✓ Always use cross-validation for hyperparameter tuning

✓ Start simple, then iterate and improve

✓ Watch for overfitting (monitor train vs test performance)

XGBoost is your best friend for most real-world problems!

“In the land of tabular data, Gradient Boosting is king”

— Every Kaggle Grandmaster

Questions?

Instinct Institute

Mork Mongkul