AI Bootcamp
Decision Trees are flexible and interpretable
But they suffer from high variance
A small change in data can result in a very different tree
This leads to overfitting
Key question:
Can we stabilize trees without losing their power?
One tree = unstable learner
Random Forest is an ensemble method
It combines many Decision Trees
Final prediction is obtained by aggregation
Many weak, uncorrelated trees form a strong predictor
Let \(B\) be the number of trees
For \(b = 1, \dots, B\):
Aggregate predictions:
Classification: \[\hat{y} = \text{majority vote}\{\hat{f}_1(\mathbf{x}), \ldots, \hat{f}_B(\mathbf{x})\}\]
Regression: \[\hat{y} = \frac{1}{B}\sum_{b=1}^B \hat{f}_b(\mathbf{x})\]
Bootstrap = sampling with replacement
Each tree sees a different version of the dataset
About 63% of observations appear in each tree
The remaining ~37% are Out-of-Bag (OOB) samples
Reduces variance by averaging
Original Training Set (n=10):
| Index | Feature 1 | Feature 2 | Target |
|---|---|---|---|
| 1 | 2.3 | 1.5 | 0 |
| 2 | 1.8 | 2.1 | 1 |
| … | … | … | … |
| 10 | 3.1 | 0.8 | 1 |
Bootstrap Sample 1 (sampled with replacement): - Indices: [1, 3, 3, 5, 7, 7, 7, 9, 10, 2] - Sample 3 and 7 appear multiple times - Samples 4, 6, 8 are OOB for Tree 1
Bootstrap Sample 2 (different sample): - Indices: [2, 2, 4, 5, 6, 8, 8, 9, 9, 10] - Different observations selected - Samples 1, 3, 7 are OOB for Tree 2
Key Point: Each tree trains on ~63% unique samples, creating diversity!
At each split, only a subset of features is considered
This prevents strong predictors from dominating all trees
Trees become less correlated
Variance of the ensemble decreases
Typical choices: - Classification: \(m = \sqrt{p}\) - Regression: \(m = p/3\)
Assume we have \(B\) independent models, each with variance \(\sigma^2\)
The variance of the average is: \[\text{Var}\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b\right) = \frac{\sigma^2}{B}\]
Averaging reduces variance!
In practice, trees are correlated with correlation \(\rho\): \[\text{Var}(\text{avg}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\]
Bootstrap + Feature randomness \(\Rightarrow\) reduces \(\rho\) \(\Rightarrow\) reduces overall variance
As \(B \to \infty\): variance approaches \(\rho\sigma^2\) (can’t eliminate correlated part)
Random Forest improves generalization
Each tree uses ~63% of data for training
The remaining ~37% are Out-of-Bag (OOB) samples
OOB Error = Average prediction error on OOB samples
Free cross-validation! No need for separate validation set
OOB error closely approximates test error!
n_estimators – number of trees (more is usually better, but diminishing returns)
max_features – features per split (\(\sqrt{p}\) for classification, \(p/3\) for regression)
max_depth – maximum tree depth (usually left unlimited)
min_samples_leaf – minimum samples required at leaf node
min_samples_split – minimum samples required to split
bootstrap – whether to use bootstrap sampling (default: True)
Trees are deep; overfitting is controlled by averaging
Random Forest provides feature importance scores
Formula: \[\text{Importance}(X_j) = \frac{1}{B}\sum_{b=1}^B \sum_{t \in T_b} \Delta i(t, X_j)\] where \(\Delta i(t, X_j)\) is impurity decrease at split \(t\) using feature \(X_j\)
Higher importance = more useful for prediction
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load data
df = pd.read_csv('../Data/heart.csv')
# Prepare features and target
X = df.drop('target', axis=1)
y = df['target']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier(random_state=42, oob_score=True)
param_grid = {
'n_estimators': [200, 500],
'max_features': ['sqrt', 'log2'],
'min_samples_leaf': [1, 2, 5]
}
grid = GridSearchCV(
rf,
param_grid=param_grid,
cv=10,
scoring='accuracy',
n_jobs=-1
)
grid.fit(X_train, y_train)
best_rf = grid.best_estimator_
print(f"Best hyperparameters: {grid.best_params_}")
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"OOB Score: {best_rf.oob_score_:.4f}")
best_rfBest hyperparameters: {'max_features': 'sqrt', 'min_samples_leaf': 1, 'n_estimators': 200}
Best CV accuracy: 0.9902
OOB Score: 0.9951
RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
# Random Forest predictions
y_pred_rf = best_rf.predict(X_test)
# Train Decision Tree for comparison
dt = DecisionTreeClassifier(random_state=42, max_depth=5)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
# Create comparison table
results = pd.DataFrame({
'Model': ['Decision Tree', 'Random Forest'],
'Accuracy': [
accuracy_score(y_test, y_pred_dt),
accuracy_score(y_test, y_pred_rf)
],
'Precision': [
precision_score(y_test, y_pred_dt),
precision_score(y_test, y_pred_rf)
],
'Recall': [
recall_score(y_test, y_pred_dt),
recall_score(y_test, y_pred_rf)
],
'F1-Score': [
f1_score(y_test, y_pred_dt),
f1_score(y_test, y_pred_rf)
]
})
results.style.format({
'Accuracy': '{:.4f}',
'Precision': '{:.4f}',
'Recall': '{:.4f}',
'F1-Score': '{:.4f}'
}).background_gradient(cmap='Blues', subset=['Accuracy', 'Precision', 'Recall', 'F1-Score'])| Model | Accuracy | Precision | Recall | F1-Score | |
|---|---|---|---|---|---|
| 0 | Decision Tree | 0.8439 | 0.7840 | 0.9515 | 0.8596 |
| 1 | Random Forest | 0.9854 | 1.0000 | 0.9709 | 0.9852 |
import plotly.graph_objects as go
from IPython.display import HTML
# Get feature importances
feature_names = df.drop('target', axis=1).columns.tolist()
importances = best_rf.feature_importances_
# Sort by importance
sorted_idx = np.argsort(importances)
features_sorted = [feature_names[i] for i in sorted_idx]
importances_sorted = [importances[i] for i in sorted_idx]
fig = go.Figure()
fig.add_trace(go.Bar(
x=importances_sorted,
y=features_sorted,
orientation='h',
marker=dict(
color=importances_sorted,
colorscale='Blues',
showscale=True,
colorbar=dict(title='Importance')
),
text=[f'{imp:.3f}' for imp in importances_sorted],
textposition='outside'
))
fig.update_layout(
title='Feature Importance - Heart Disease Prediction',
xaxis_title='Importance Score',
yaxis_title='Feature',
width=800, height=500,
margin=dict(l=0, r=0, t=40, b=0),
showlegend=False
)
HTML(fig.to_html(include_plotlyjs='cdn'))Random Forest produces smoother, more stable decision boundaries
| Model | Accuracy | Precision | Recall | F1-Score | Interpretability | Training Time |
|---|---|---|---|---|---|---|
| Decision Tree | Medium | Medium | Medium | Medium | High | Fast |
| Random Forest | High | High | High | High | Low | Slower |
High accuracy on most problems
Reduces overfitting compared to single trees
Handles missing values well
No feature scaling required
Feature importance built-in
OOB error for free validation
Robust to outliers and noise
Easy to parallelize
Less interpretable than single trees
Slower training than single tree
Slower prediction than single tree
Large memory footprint (stores many trees)
Can overfit on noisy data with too many trees (rare)
Biased toward features with more categories
Not good for extrapolation (predicting outside training range)
You need high accuracy without much tuning
You have tabular data with mixed types
You want feature importance insights
Interpretability is not critical
You have enough computational resources
Your data has complex interactions
You need full interpretability (use single tree)
You need fast predictions in production
You have very high-dimensional sparse data (use linear models or boosting)
Memory is severely constrained
You need to extrapolate beyond training data
You have very small datasets (< 100 samples)
n_estimators (more is better), max_features (\(\sqrt{p}\) for classification)
Instinct Institute
Mork MongkulRandom Forest | AI Bootcamp