Gradient Boosting

AI Bootcamp

Mork Mongkul

Motivation & Introduction

From Random Forest to Gradient Boosting

Random Forest builds multiple trees independently and in parallel.
Each tree is trained on:
- A bootstrap sample of the data
- A random subset of features
Final prediction:
- Regression: Average of all tree predictions
- Classification: Majority vote
This is called Bagging (Bootstrap Aggregating)

Multiple trees vote/average their predictions

Motivation & Introduction

The Key Question

Random Forest: Trees are independent
Each tree makes predictions without knowing what other trees predicted
What if we could make trees learn from each other’s mistakes?
This leads us to Boosting!

Can subsequent trees fix previous errors?

Motivation & Introduction

Introduction to Boosting

Boosting: Build trees sequentially
Each new tree focuses on correcting errors made by previous trees
Key differences from Random Forest:
- Trees are built one after another (not in parallel)
- Each tree learns from previous mistakes
- Typically uses shallow trees (weak learners)

Green lines show errors that the next tree will try to fix

Gradient Boosting: The Algorithm

Gradient Boosting

Sequential Error Correction

Step 1: Start with an initial prediction
- Usually the mean (regression) or log-odds (classification)
Step 2: Calculate residuals (errors)
- \(\text{residual}_i = y_i - \hat{y}_i\)
Step 3: Train a tree to predict these residuals
Step 4: Update predictions:
- \(\hat{y}_{\text{new}} = \hat{y}_{\text{old}} + \color{red}{\alpha} \times \text{tree prediction}\)
- \(\color{red}{\alpha}\) is the learning rate
Step 5: Repeat Steps 2-4 for \(M\) trees

Gradient Boosting

Mathematical Formulation

Goal: Find function \(F(x)\) that minimizes the loss \(L(y, F(x))\)
Initialize: \[F_0(x) = \arg\min_{\gamma}\sum_{i=1}^n L(y_i, \gamma)\]
- For squared loss: \(F_0(x) = \bar{y}\) (mean of targets)
For \(m = 1\) to \(M\):

Compute pseudo-residuals (negative gradient): \[r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\]

For squared loss: \(r_{im} = y_i - F_{m-1}(x_i)\) (actual residuals)

Fit a regression tree \(h_m(x)\) to predict \(r_{im}\)

Update the model: \[F_m(x) = F_{m-1}(x) + \color{red}{\alpha} \cdot h_m(x)\] where \(\color{red}{\alpha}\) is the learning rate

Final prediction: \(F(x) = F_M(x)\)

Gradient Boosting

Why “Gradient”?

We’re performing gradient descent in function space!
Instead of updating parameters \(\theta\): \[\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_\theta L\]
We update the function \(F(x)\): \[F_m(x) = F_{m-1}(x) - \alpha \nabla_F L\]
The tree \(h_m(x)\) approximates the negative gradient: \[h_m(x) \approx -\nabla_F L(y, F_{m-1}(x))\]
This is why we call it Gradient Boosting!

Each tree takes a step in the direction that reduces loss

Gradient Boosting

A Simple Example (Regression)

Data:

\(x\)	\(y\)
1	2
2	4
3	6
4	8

Step 1: Initialize \[F_0(x) = \bar{y} = \frac{2+4+6+8}{4} = 5\]

Step 2: Calculate residuals (iteration 1)

\(x\)	\(y\)	\(F_0(x)\)	\(r_1 = y - F_0(x)\)
1	2	5	-3
2	4	5	-1
3	6	5	1
4	8	5	3

Step 3: Fit tree \(h_1(x)\) to predict residuals

Suppose the tree learns: \[h_1(x) = \begin{cases} -2 & \text{if } x \leq 2 \\ 2 & \text{if } x > 2 \end{cases}\]

Step 4: Update predictions (with \(\alpha=0.5\)) \[F_1(x) = F_0(x) + 0.5 \times h_1(x)\]

\(x\)	\(F_0(x)\)	\(h_1(x)\)	\(F_1(x)\)
1	5	-2	4
2	5	-2	4
3	5	2	6
4	5	2	6

New residuals:

\(x\)	\(y\)	\(F_1(x)\)	\(r_2 = y - F_1(x)\)
1	2	4	-2
2	4	4	0
3	6	6	0
4	8	6	2

Continue for \(M\) iterations…

Key Hyperparameters

Controlling the Learning Process

Number of trees (n_estimators)
- More trees \(\Rightarrow\) better fit (but slower, risk of overfitting)
- Typical values: 100-1000
Learning rate (learning_rate)
- Controls how much each tree contributes
- Smaller \(\alpha\) \(\Rightarrow\) more trees needed, but better generalization
- Typical values: 0.01-0.3
- Trade-off: \(\downarrow \alpha\) requires \(\uparrow M\) trees
Tree depth (max_depth)
- Boosting works well with shallow trees
- Typical values: 3-8 (much shallower than Random Forest!)

Key Hyperparameters

Additional Important Parameters

Subsampling (subsample)
- Fraction of samples used to fit each tree
- Values < 1.0 introduce randomness (like Random Forest)
- Typical values: 0.5-1.0
- Called Stochastic Gradient Boosting when < 1.0
Feature subsampling
- max_features: Features considered per split
- colsample_bytree: Features per tree (XGBoost)
- Reduces correlation between trees
Regularization
- min_samples_leaf, min_samples_split
- Prevents overfitting by limiting tree complexity

Typical hyperparameter combinations:

Purpose	`n_estimators`	`learning_rate`	`max_depth`
Quick baseline	100	0.1	3
Better performance	500	0.05	4-5
Competition-grade	1000+	0.01-0.03	5-8

Important relationships:

\(\downarrow\) learning_rate \(\Rightarrow\) \(\uparrow\) n_estimators needed
Smaller trees (max_depth) are usually better in boosting
Use cross-validation to find optimal values!

Gradient Boosting Variants

Three Main Implementations

1. Sklearn GradientBoosting

Pros:

Built into scikit-learn
Simple API, consistent with sklearn
Good for learning the algorithm

Cons:

Slower than XGBoost/LightGBM
Fewer features
Not ideal for large datasets

When to use:

Teaching/learning
Small to medium datasets
When you need sklearn integration

2. XGBoost

Pros:

Industry standard
Very fast (parallel processing)
Handles missing values
Advanced regularization
Dominant in Kaggle competitions

Cons:

Requires separate installation
More hyperparameters to tune

When to use:

Production systems
Medium to large datasets
When you need best performance

3. LightGBM

Pros:

Extremely fast
Memory efficient
Great for large datasets (>10K rows)
Handles categorical features natively
Leaf-wise tree growth (vs level-wise)

Cons:

Can overfit on small datasets
Requires careful tuning

When to use:

Large datasets
When speed is critical
Limited computational resources

Gradient Boosting Variants

Quick Comparison

Feature	Sklearn GB	XGBoost	LightGBM
Speed	Slow	Fast	Very Fast
Memory	High	Medium	Low
Missing values	Manual	Automatic	Automatic
Categorical features	Manual encoding	Manual encoding	Native support
Tree growth	Level-wise	Level-wise	Leaf-wise
Installation	Built-in	`pip install xgboost`	`pip install lightgbm`
Best for	Learning	Production	Big data
Kaggle popularity	Low	Very High	High

Gradient Boosting in Action

Implementation with XGBoost

Dataset: Heart Disease Dataset

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import xgboost as xgb

# Load and prepare data
path = "../Data"
data = pd.read_csv(path + "/heart.csv")
data_no_dup = data.drop_duplicates()

X = data_no_dup.iloc[:, :-1]
y = data_no_dup.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Gradient Boosting in Action

XGBoost with Hyperparameter Tuning

# Define XGBoost classifier
xgb_clf = xgb.XGBClassifier(
    random_state=42,
    eval_metric='logloss'  # Suppress warning
)

# Hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# GridSearch with cross-validation
grid_search = GridSearchCV(
    estimator=xgb_clf,
    param_grid=param_grid,
    cv=10,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Best model
best_xgb = grid_search.best_estimator_
y_pred = best_xgb.predict(X_test)

print(f"Best hyperparameters: {grid_search.best_params_}")

Fitting 10 folds for each of 108 candidates, totalling 1080 fits
Best hyperparameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100, 'subsample': 1.0}

Gradient Boosting in Action

Results Comparison

Code

# Create results dataframe
results_df = pd.DataFrame({
    'Accuracy': [
        accuracy_score(y_test, y_pred),
        0.885246,  # 16-NN (from Decision Tree slides)
        grid_search.best_score_  # CV score
    ],
    'Precision': [
        precision_score(y_test, y_pred),
        0.882353,
        np.nan
    ],
    'Recall': [
        recall_score(y_test, y_pred),
        0.909091,
        np.nan
    ],
    'F1-score': [
        f1_score(y_test, y_pred),
        0.909091,
        np.nan
    ]
}, index=['Gradient Boosting (Test)', '16-NN (Test)', 'Gradient Boosting (CV)'])

results_df

	Accuracy	Precision	Recall	F1-score
Gradient Boosting (Test)	0.737705	0.757576	0.757576	0.757576
16-NN (Test)	0.885246	0.882353	0.909091	0.909091
Gradient Boosting (CV)	0.838167	NaN	NaN	NaN

Important insights from these results:

Small dataset effect: With only ~300 samples, simpler models (K-NN) can outperform complex ensembles
When Gradient Boosting excels: Typically on datasets with 1000+ samples and complex non-linear patterns
Always compare multiple models: No single algorithm is best for all problems
CV score is crucial: Use cross-validation to get reliable performance estimates

Gradient Boosting in Action

Feature Importance

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': best_xgb.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 5 most important features:")
print(feature_importance.head())

Top 5 most important features:
   feature  importance
2       cp    0.283735
12    thal    0.109298
11      ca    0.104121
8    exang    0.086912
1      sex    0.061423

Gradient Boosting in Action

Learning Curves

Code

# Train models with different numbers of trees to see learning curve
train_scores = []
test_scores = []
n_estimators_range = range(10, 301, 10)

for n in n_estimators_range:
    model = xgb.XGBClassifier(
        n_estimators=n,
        learning_rate=grid_search.best_params_['learning_rate'],
        max_depth=grid_search.best_params_['max_depth'],
        subsample=grid_search.best_params_['subsample'],
        colsample_bytree=grid_search.best_params_['colsample_bytree'],
        random_state=42,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(n_estimators_range),
    y=train_scores,
    mode='lines',
    name='Train accuracy',
    line=dict(color='blue', width=2)
))
fig.add_trace(go.Scatter(
    x=list(n_estimators_range),
    y=test_scores,
    mode='lines',
    name='Test accuracy',
    line=dict(color='red', width=2)
))

fig.update_layout(
    title='Learning Curve: Performance vs Number of Trees',
    xaxis_title='Number of Trees',
    yaxis_title='Accuracy',
    width=700,
    height=400,
    margin=dict(l=0, r=0, t=40, b=0)
)

HTML(fig.to_html(include_plotlyjs='cdn'))

Note how accuracy improves with more trees, then plateaus
Watch for the gap between train and test (overfitting signal)

Practical Tips & Best Practices

When to Use Gradient Boosting

✓ Good use cases:

Tabular data (structured datasets)
- Customer churn prediction
- Credit risk modeling
- House price prediction
- Medical diagnosis

Medium to large datasets (1K+ rows)

When accuracy is critical
- Kaggle competitions
- Production ML systems

Mixed feature types
- Numerical + categorical features
- Missing data (XGBoost handles it well)

✗ Not recommended for:

Very small datasets (<100 rows)
- High risk of overfitting
- Simpler models work better

Image/audio/video data
- Use deep learning instead
- CNN/RNN are more suitable

When interpretability is crucial
- Ensemble of trees is hard to explain
- Use linear models or single trees

Real-time inference with strict latency
- 100s of trees can be slow
- Consider simpler models

Practical Tips & Best Practices

Common Pitfalls and How to Avoid Them

1. Overfitting

Symptoms: - High train accuracy, low test accuracy - Large gap in learning curves

Solutions: - ↓ learning_rate + ↑ n_estimators - ↑ regularization (min_child_weight, gamma) - Use subsample < 1.0 - Reduce max_depth - Early stopping with validation set

2. Too slow training

Solutions: - Use LightGBM instead - Reduce n_estimators - Use subsample and colsample_bytree - Parallel processing (n_jobs=-1)

3. Poor hyperparameter choices

Solutions: - Start with defaults, then tune - Use RandomizedSearchCV for large grids - Focus on: n_estimators, learning_rate, max_depth - Monitor validation performance

4. Forgetting to scale (XGBoost)

Good news: - Tree-based models don’t need feature scaling! - But categorical encoding is still needed

5. Not using early stopping

xgb_clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=10
)

Practical Tips & Best Practices

Suggested Workflow

1. Start simple:

# Baseline XGBoost
model = xgb.XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

2. Evaluate and iterate:

# Check overfitting
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test:  {model.score(X_test, y_test):.3f}")

3. Tune key hyperparameters:

param_grid = {
    'n_estimators': [100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7]
}

4. Fine-tune with advanced parameters:

# After finding good base parameters, add:
param_grid = {
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'min_child_weight': [1, 3, 5]
}

5. Use cross-validation throughout!

Practical Tips & Best Practices

Gradient Boosting vs Other Methods

Method	Pros	Cons	When to use
Decision Tree	Simple, interpretable, fast	Overfits easily, unstable	Quick baseline, interpretability needed
Random Forest	Robust, handles overfitting, parallel	Slower prediction, black box	General purpose, classification/regression
Gradient Boosting	Highest accuracy, handles complex patterns	Slower training, harder to tune, can overfit	When accuracy matters most
XGBoost	Fast, handles missing data, regularization	More complex, many hyperparameters	Production systems
LightGBM	Very fast, memory efficient	Can overfit small data	Large datasets

Key Takeaways:

Start with Random Forest for robustness
Switch to XGBoost when you need maximum accuracy
Use LightGBM for very large datasets (>100K rows)
Always compare multiple methods on your specific problem!

Summary

Key Concepts

Core Ideas:

Gradient Boosting builds trees sequentially

Each tree corrects errors (residuals) of previous trees

Uses gradient descent in function space

Final prediction: \(F(x) = F_0(x) + \alpha \sum_{m=1}^M h_m(x)\)

Key Differences:

Random Forest: Trees independent, vote/average

Gradient Boosting: Trees sequential, additive

Random Forest: Deep trees

Gradient Boosting: Shallow trees (weak learners)

Critical Hyperparameters:

n_estimators: Number of trees (100-1000)

learning_rate: Step size (0.01-0.3)

max_depth: Tree depth (3-8)

subsample: Row sampling (0.5-1.0)

colsample_bytree: Feature sampling

Implementation Choice:

Learning: sklearn GradientBoosting

Production: XGBoost (recommended)

Big Data: LightGBM

Summary

Remember

✓ Gradient Boosting is one of the most powerful ML algorithms for tabular data

✓ It wins many Kaggle competitions

✓ Trade-off: Performance vs Training time and Tuning complexity

✓ Always use cross-validation for hyperparameter tuning

✓ Start simple, then iterate and improve

✓ Watch for overfitting (monitor train vs test performance)

✓ XGBoost is your best friend for most real-world problems!

“In the land of tabular data, Gradient Boosting is king”

— Every Kaggle Grandmaster

Questions?

Instinct Institute

Mork Mongkul

Gradient Boosting

Motivation & Introduction

Motivation & Introduction

From Random Forest to Gradient Boosting

Motivation & Introduction

The Key Question

What if we could make trees learn from each other’s mistakes?

Motivation & Introduction

Introduction to Boosting

Gradient Boosting: The Algorithm

Gradient Boosting

Sequential Error Correction

Gradient Boosting

Mathematical Formulation

Gradient Boosting

Why “Gradient”?

Gradient Boosting

A Simple Example (Regression)

Key Hyperparameters

Key Hyperparameters

Controlling the Learning Process

Key Hyperparameters

Additional Important Parameters

Gradient Boosting Variants

Gradient Boosting Variants

Three Main Implementations

1. Sklearn GradientBoosting

2. XGBoost

3. LightGBM

Gradient Boosting Variants

Quick Comparison

Gradient Boosting in Action

Gradient Boosting in Action

Implementation with XGBoost

Gradient Boosting in Action

XGBoost with Hyperparameter Tuning

Gradient Boosting in Action

Results Comparison

Gradient Boosting in Action

Feature Importance

Gradient Boosting in Action

Learning Curves

Practical Tips & Best Practices

Practical Tips & Best Practices

When to Use Gradient Boosting

✓ Good use cases:

✗ Not recommended for:

Practical Tips & Best Practices

Common Pitfalls and How to Avoid Them

1. Overfitting

2. Too slow training

3. Poor hyperparameter choices

4. Forgetting to scale (XGBoost)

5. Not using early stopping

Practical Tips & Best Practices

Suggested Workflow

Practical Tips & Best Practices

Gradient Boosting vs Other Methods

Key Takeaways:

Summary

Summary

Key Concepts

Summary

Remember

Questions?