AI Bootcamp
Random Forest builds multiple trees independently and in parallel.
A bootstrap sample of the data
A random subset of features
Regression: Average of all tree predictions
Classification: Majority vote
This is called Bagging (Bootstrap Aggregating)
Multiple trees vote/average their predictions
Random Forest: Trees are independent
Each tree makes predictions without knowing what other trees predicted
This leads us to Boosting!
Can subsequent trees fix previous errors?
Boosting: Build trees sequentially
Each new tree focuses on correcting errors made by previous trees
Trees are built one after another (not in parallel)
Each tree learns from previous mistakes
Typically uses shallow trees (weak learners)
Green lines show errors that the next tree will try to fix
Usually the mean (regression) or log-odds (classification)
\(\text{residual}_i = y_i - \hat{y}_i\)
Step 3: Train a tree to predict these residuals
\(\hat{y}_{\text{new}} = \hat{y}_{\text{old}} + \color{red}{\alpha} \times \text{tree prediction}\)
\(\color{red}{\alpha}\) is the learning rate
Step 5: Repeat Steps 2-4 for \(M\) trees
Goal: Find function \(F(x)\) that minimizes the loss \(L(y, F(x))\)
For squared loss: \(F_0(x) = \bar{y}\) (mean of targets)
For \(m = 1\) to \(M\):
Final prediction: \(F(x) = F_M(x)\)
We’re performing gradient descent in function space!
Instead of updating parameters \(\theta\): \[\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_\theta L\]
We update the function \(F(x)\): \[F_m(x) = F_{m-1}(x) - \alpha \nabla_F L\]
The tree \(h_m(x)\) approximates the negative gradient: \[h_m(x) \approx -\nabla_F L(y, F_{m-1}(x))\]
This is why we call it Gradient Boosting!
Each tree takes a step in the direction that reduces loss
Data:
| \(x\) | \(y\) |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 6 |
| 4 | 8 |
Step 1: Initialize \[F_0(x) = \bar{y} = \frac{2+4+6+8}{4} = 5\]
Step 2: Calculate residuals (iteration 1)
| \(x\) | \(y\) | \(F_0(x)\) | \(r_1 = y - F_0(x)\) |
|---|---|---|---|
| 1 | 2 | 5 | -3 |
| 2 | 4 | 5 | -1 |
| 3 | 6 | 5 | 1 |
| 4 | 8 | 5 | 3 |
Step 3: Fit tree \(h_1(x)\) to predict residuals
Suppose the tree learns: \[h_1(x) = \begin{cases} -2 & \text{if } x \leq 2 \\ 2 & \text{if } x > 2 \end{cases}\]
Step 4: Update predictions (with \(\alpha=0.5\)) \[F_1(x) = F_0(x) + 0.5 \times h_1(x)\]
| \(x\) | \(F_0(x)\) | \(h_1(x)\) | \(F_1(x)\) |
|---|---|---|---|
| 1 | 5 | -2 | 4 |
| 2 | 5 | -2 | 4 |
| 3 | 5 | 2 | 6 |
| 4 | 5 | 2 | 6 |
New residuals:
| \(x\) | \(y\) | \(F_1(x)\) | \(r_2 = y - F_1(x)\) |
|---|---|---|---|
| 1 | 2 | 4 | -2 |
| 2 | 4 | 4 | 0 |
| 3 | 6 | 6 | 0 |
| 4 | 8 | 6 | 2 |
Continue for \(M\) iterations…
n_estimators)
More trees \(\Rightarrow\) better fit (but slower, risk of overfitting)
Typical values: 100-1000
learning_rate)
Controls how much each tree contributes
Smaller \(\alpha\) \(\Rightarrow\) more trees needed, but better generalization
Typical values: 0.01-0.3
Trade-off: \(\downarrow \alpha\) requires \(\uparrow M\) trees
max_depth)
Boosting works well with shallow trees
Typical values: 3-8 (much shallower than Random Forest!)
subsample)
Fraction of samples used to fit each tree
Values < 1.0 introduce randomness (like Random Forest)
Typical values: 0.5-1.0
Called Stochastic Gradient Boosting when < 1.0
max_features: Features considered per split
colsample_bytree: Features per tree (XGBoost)
Reduces correlation between trees
min_samples_leaf, min_samples_split
Prevents overfitting by limiting tree complexity
Typical hyperparameter combinations:
| Purpose | n_estimators |
learning_rate |
max_depth |
|---|---|---|---|
| Quick baseline | 100 | 0.1 | 3 |
| Better performance | 500 | 0.05 | 4-5 |
| Competition-grade | 1000+ | 0.01-0.03 | 5-8 |
Important relationships:
\(\downarrow\) learning_rate \(\Rightarrow\) \(\uparrow\) n_estimators needed
Smaller trees (max_depth) are usually better in boosting
Use cross-validation to find optimal values!
Pros:
Cons:
When to use:
Pros:
Cons:
When to use:
Pros:
Cons:
When to use:
| Feature | Sklearn GB | XGBoost | LightGBM |
|---|---|---|---|
| Speed | Slow | Fast | Very Fast |
| Memory | High | Medium | Low |
| Missing values | Manual | Automatic | Automatic |
| Categorical features | Manual encoding | Manual encoding | Native support |
| Tree growth | Level-wise | Level-wise | Leaf-wise |
| Installation | Built-in | pip install xgboost |
pip install lightgbm |
| Best for | Learning | Production | Big data |
| Kaggle popularity | Low | Very High | High |
Dataset: Heart Disease Dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import xgboost as xgb
# Load and prepare data
path = "../Data"
data = pd.read_csv(path + "/heart.csv")
data_no_dup = data.drop_duplicates()
X = data_no_dup.iloc[:, :-1]
y = data_no_dup.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)# Define XGBoost classifier
xgb_clf = xgb.XGBClassifier(
random_state=42,
eval_metric='logloss' # Suppress warning
)
# Hyperparameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 4, 5],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0]
}
# GridSearch with cross-validation
grid_search = GridSearchCV(
estimator=xgb_clf,
param_grid=param_grid,
cv=10,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
# Best model
best_xgb = grid_search.best_estimator_
y_pred = best_xgb.predict(X_test)
print(f"Best hyperparameters: {grid_search.best_params_}")Fitting 10 folds for each of 108 candidates, totalling 1080 fits
Best hyperparameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100, 'subsample': 1.0}
# Create results dataframe
results_df = pd.DataFrame({
'Accuracy': [
accuracy_score(y_test, y_pred),
0.885246, # 16-NN (from Decision Tree slides)
grid_search.best_score_ # CV score
],
'Precision': [
precision_score(y_test, y_pred),
0.882353,
np.nan
],
'Recall': [
recall_score(y_test, y_pred),
0.909091,
np.nan
],
'F1-score': [
f1_score(y_test, y_pred),
0.909091,
np.nan
]
}, index=['Gradient Boosting (Test)', '16-NN (Test)', 'Gradient Boosting (CV)'])
results_df| Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|
| Gradient Boosting (Test) | 0.737705 | 0.757576 | 0.757576 | 0.757576 |
| 16-NN (Test) | 0.885246 | 0.882353 | 0.909091 | 0.909091 |
| Gradient Boosting (CV) | 0.838167 | NaN | NaN | NaN |
Important insights from these results:
Small dataset effect: With only ~300 samples, simpler models (K-NN) can outperform complex ensembles
When Gradient Boosting excels: Typically on datasets with 1000+ samples and complex non-linear patterns
Always compare multiple models: No single algorithm is best for all problems
CV score is crucial: Use cross-validation to get reliable performance estimates
Top 5 most important features:
feature importance
2 cp 0.283735
12 thal 0.109298
11 ca 0.104121
8 exang 0.086912
1 sex 0.061423
# Train models with different numbers of trees to see learning curve
train_scores = []
test_scores = []
n_estimators_range = range(10, 301, 10)
for n in n_estimators_range:
model = xgb.XGBClassifier(
n_estimators=n,
learning_rate=grid_search.best_params_['learning_rate'],
max_depth=grid_search.best_params_['max_depth'],
subsample=grid_search.best_params_['subsample'],
colsample_bytree=grid_search.best_params_['colsample_bytree'],
random_state=42,
eval_metric='logloss'
)
model.fit(X_train, y_train)
train_scores.append(accuracy_score(y_train, model.predict(X_train)))
test_scores.append(accuracy_score(y_test, model.predict(X_test)))
fig = go.Figure()
fig.add_trace(go.Scatter(
x=list(n_estimators_range),
y=train_scores,
mode='lines',
name='Train accuracy',
line=dict(color='blue', width=2)
))
fig.add_trace(go.Scatter(
x=list(n_estimators_range),
y=test_scores,
mode='lines',
name='Test accuracy',
line=dict(color='red', width=2)
))
fig.update_layout(
title='Learning Curve: Performance vs Number of Trees',
xaxis_title='Number of Trees',
yaxis_title='Accuracy',
width=700,
height=400,
margin=dict(l=0, r=0, t=40, b=0)
)
HTML(fig.to_html(include_plotlyjs='cdn'))Note how accuracy improves with more trees, then plateaus
Watch for the gap between train and test (overfitting signal)
Symptoms: - High train accuracy, low test accuracy - Large gap in learning curves
Solutions: - ↓ learning_rate + ↑ n_estimators - ↑ regularization (min_child_weight, gamma) - Use subsample < 1.0 - Reduce max_depth - Early stopping with validation set
Solutions: - Use LightGBM instead - Reduce n_estimators - Use subsample and colsample_bytree - Parallel processing (n_jobs=-1)
Solutions: - Start with defaults, then tune - Use RandomizedSearchCV for large grids - Focus on: n_estimators, learning_rate, max_depth - Monitor validation performance
Good news: - Tree-based models don’t need feature scaling! - But categorical encoding is still needed
1. Start simple:
2. Evaluate and iterate:
3. Tune key hyperparameters:
4. Fine-tune with advanced parameters:
5. Use cross-validation throughout!
| Method | Pros | Cons | When to use |
|---|---|---|---|
| Decision Tree | Simple, interpretable, fast | Overfits easily, unstable | Quick baseline, interpretability needed |
| Random Forest | Robust, handles overfitting, parallel | Slower prediction, black box | General purpose, classification/regression |
| Gradient Boosting | Highest accuracy, handles complex patterns | Slower training, harder to tune, can overfit | When accuracy matters most |
| XGBoost | Fast, handles missing data, regularization | More complex, many hyperparameters | Production systems |
| LightGBM | Very fast, memory efficient | Can overfit small data | Large datasets |
Start with Random Forest for robustness
Switch to XGBoost when you need maximum accuracy
Use LightGBM for very large datasets (>100K rows)
Always compare multiple methods on your specific problem!
Core Ideas:
Key Differences:
Critical Hyperparameters:
n_estimators: Number of trees (100-1000)learning_rate: Step size (0.01-0.3)max_depth: Tree depth (3-8)subsample: Row sampling (0.5-1.0)colsample_bytree: Feature samplingImplementation Choice:
✓ Gradient Boosting is one of the most powerful ML algorithms for tabular data
✓ It wins many Kaggle competitions
✓ Trade-off: Performance vs Training time and Tuning complexity
✓ Always use cross-validation for hyperparameter tuning
✓ Start simple, then iterate and improve
✓ Watch for overfitting (monitor train vs test performance)
✓ XGBoost is your best friend for most real-world problems!
“In the land of tabular data, Gradient Boosting is king”
— Every Kaggle Grandmaster
Instinct Institute
Mork MongkulGradient Boosting | AI Bootcamp