Random Forest

AI Bootcamp

Mork Mongkul

Motivation

Decision Trees are flexible and interpretable
But they suffer from high variance
A small change in data can result in a very different tree
This leads to overfitting

Key question:

Can we stabilize trees without losing their power?

One tree = unstable learner

Random Forest

What is Random Forest?

Random Forest is an ensemble method
It combines many Decision Trees
Each tree is trained on:
- A different dataset
- A different set of features
Final prediction is obtained by aggregation

Random Forest Intuition

Single Decision Tree

Low bias
High variance
Sensitive to noise

Random Forest

Low bias
Lower variance
Stable predictions

Many weak, uncorrelated trees form a strong predictor

How Random Forest Works

Let \(B\) be the number of trees
For \(b = 1, \dots, B\):
1. Draw a bootstrap sample from training data
1. Grow a Decision Tree \(\hat{f}_b(\mathbf{x})\):
  - At each split, randomly select \(m\) features
1. Grow tree to full depth (or nearly full)
Aggregate predictions:

Classification: \[\hat{y} = \text{majority vote}\{\hat{f}_1(\mathbf{x}), \ldots, \hat{f}_B(\mathbf{x})\}\]

Regression: \[\hat{y} = \frac{1}{B}\sum_{b=1}^B \hat{f}_b(\mathbf{x})\]

Bootstrap Sampling (Bagging)

Bootstrap = sampling with replacement
Each tree sees a different version of the dataset
About 63% of observations appear in each tree
The remaining ~37% are Out-of-Bag (OOB) samples
Reduces variance by averaging

Bootstrap Sampling: Example

Original Training Set (n=10):

Index	Feature 1	Feature 2	Target
1	2.3	1.5	0
2	1.8	2.1	1
…	…	…	…
10	3.1	0.8	1

Bootstrap Sample 1 (sampled with replacement): - Indices: [1, 3, 3, 5, 7, 7, 7, 9, 10, 2] - Sample 3 and 7 appear multiple times - Samples 4, 6, 8 are OOB for Tree 1

Bootstrap Sample 2 (different sample): - Indices: [2, 2, 4, 5, 6, 8, 8, 9, 9, 10] - Different observations selected - Samples 1, 3, 7 are OOB for Tree 2

Key Point: Each tree trains on ~63% unique samples, creating diversity!

Feature Randomness

At each split, only a subset of features is considered
This prevents strong predictors from dominating all trees
Trees become less correlated
Variance of the ensemble decreases

Typical choices: - Classification: \(m = \sqrt{p}\) - Regression: \(m = p/3\)

Why Does Random Forest Work?

Variance Reduction Through Averaging

Assume we have \(B\) independent models, each with variance \(\sigma^2\)
The variance of the average is: \[\text{Var}\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b\right) = \frac{\sigma^2}{B}\]
Averaging reduces variance!
In practice, trees are correlated with correlation \(\rho\): \[\text{Var}(\text{avg}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\]
Bootstrap + Feature randomness \(\Rightarrow\) reduces \(\rho\) \(\Rightarrow\) reduces overall variance
As \(B \to \infty\): variance approaches \(\rho\sigma^2\) (can’t eliminate correlated part)

Bias–Variance Tradeoff

Decision Tree

Low bias
High variance

Random Forest

Low bias
Lower variance

Random Forest improves generalization

Out-of-Bag (OOB) Error Estimation

Each tree uses ~63% of data for training
The remaining ~37% are Out-of-Bag (OOB) samples
For each observation \(i\):
- Find all trees where \(i\) was not used for training
- Use those trees to predict \(i\)
OOB Error = Average prediction error on OOB samples
Free cross-validation! No need for separate validation set

OOB error closely approximates test error!

Hyperparameters of Random Forest

n_estimators – number of trees (more is usually better, but diminishing returns)
max_features – features per split (\(\sqrt{p}\) for classification, \(p/3\) for regression)
max_depth – maximum tree depth (usually left unlimited)
min_samples_leaf – minimum samples required at leaf node
min_samples_split – minimum samples required to split
bootstrap – whether to use bootstrap sampling (default: True)

Trees are deep; overfitting is controlled by averaging

Feature Importance in Random Forest

Random Forest provides feature importance scores
Measured by:
- Total reduction in impurity (Gini/Entropy) from splits using that feature
- Averaged across all trees
Formula: \[\text{Importance}(X_j) = \frac{1}{B}\sum_{b=1}^B \sum_{t \in T_b} \Delta i(t, X_j)\] where \(\Delta i(t, X_j)\) is impurity decrease at split \(t\) using feature \(X_j\)
Higher importance = more useful for prediction

Random Forest in Action: Heart Disease Dataset

Same dataset as Decision Tree
Same preprocessing
Same evaluation protocol

Code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv('../Data/heart.csv')

# Prepare features and target
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Code

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(random_state=42, oob_score=True)

param_grid = {
    'n_estimators': [200, 500],
    'max_features': ['sqrt', 'log2'],
    'min_samples_leaf': [1, 2, 5]
}

grid = GridSearchCV(
    rf,
    param_grid=param_grid,
    cv=10,
    scoring='accuracy',
    n_jobs=-1
)

grid.fit(X_train, y_train)
best_rf = grid.best_estimator_
print(f"Best hyperparameters: {grid.best_params_}")
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"OOB Score: {best_rf.oob_score_:.4f}")
best_rf

Best hyperparameters: {'max_features': 'sqrt', 'min_samples_leaf': 1, 'n_estimators': 200}
Best CV accuracy: 0.9902
OOB Score: 0.9951

RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Evaluation Results

Code

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier

# Random Forest predictions
y_pred_rf = best_rf.predict(X_test)

# Train Decision Tree for comparison
dt = DecisionTreeClassifier(random_state=42, max_depth=5)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Create comparison table
results = pd.DataFrame({
    'Model': ['Decision Tree', 'Random Forest'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_dt),
        accuracy_score(y_test, y_pred_rf)
    ],
    'Precision': [
        precision_score(y_test, y_pred_dt),
        precision_score(y_test, y_pred_rf)
    ],
    'Recall': [
        recall_score(y_test, y_pred_dt),
        recall_score(y_test, y_pred_rf)
    ],
    'F1-Score': [
        f1_score(y_test, y_pred_dt),
        f1_score(y_test, y_pred_rf)
    ]
})

results.style.format({
    'Accuracy': '{:.4f}',
    'Precision': '{:.4f}',
    'Recall': '{:.4f}',
    'F1-Score': '{:.4f}'
}).background_gradient(cmap='Blues', subset=['Accuracy', 'Precision', 'Recall', 'F1-Score'])

	Model	Accuracy	Precision	Recall	F1-Score
0	Decision Tree	0.8439	0.7840	0.9515	0.8596
1	Random Forest	0.9854	1.0000	0.9709	0.9852

Feature Importance Analysis

Code

import plotly.graph_objects as go
from IPython.display import HTML

# Get feature importances
feature_names = df.drop('target', axis=1).columns.tolist()
importances = best_rf.feature_importances_

# Sort by importance
sorted_idx = np.argsort(importances)
features_sorted = [feature_names[i] for i in sorted_idx]
importances_sorted = [importances[i] for i in sorted_idx]

fig = go.Figure()

fig.add_trace(go.Bar(
    x=importances_sorted,
    y=features_sorted,
    orientation='h',
    marker=dict(
        color=importances_sorted,
        colorscale='Blues',
        showscale=True,
        colorbar=dict(title='Importance')
    ),
    text=[f'{imp:.3f}' for imp in importances_sorted],
    textposition='outside'
))

fig.update_layout(
    title='Feature Importance - Heart Disease Prediction',
    xaxis_title='Importance Score',
    yaxis_title='Feature',
    width=800, height=500,
    margin=dict(l=0, r=0, t=40, b=0),
    showlegend=False
)

HTML(fig.to_html(include_plotlyjs='cdn'))

Decision Tree vs Random Forest: Visual Comparison

Random Forest produces smoother, more stable decision boundaries

Performance Comparison

Model	Accuracy	Precision	Recall	F1-Score	Interpretability	Training Time
Decision Tree	Medium	Medium	Medium	Medium	High	Fast
Random Forest	High	High	High	High	Low	Slower

Random Forest: Pros and Cons

Advantages

High accuracy on most problems
Reduces overfitting compared to single trees
Handles missing values well
No feature scaling required
Feature importance built-in
OOB error for free validation
Robust to outliers and noise
Easy to parallelize

Disadvantages

Less interpretable than single trees
Slower training than single tree
Slower prediction than single tree
Large memory footprint (stores many trees)
Can overfit on noisy data with too many trees (rare)
Biased toward features with more categories
Not good for extrapolation (predicting outside training range)

When to Use Random Forest?

Use Random Forest when:

You need high accuracy without much tuning
You have tabular data with mixed types
You want feature importance insights
Interpretability is not critical
You have enough computational resources
Your data has complex interactions

Avoid Random Forest when:

You need full interpretability (use single tree)
You need fast predictions in production
You have very high-dimensional sparse data (use linear models or boosting)
Memory is severely constrained
You need to extrapolate beyond training data
You have very small datasets (< 100 samples)

Summary

Random Forest is an ensemble of CART models using bagging (bootstrap aggregating)
Uses bootstrap sampling + feature randomness to create diverse trees
Reduces variance without increasing bias through averaging
OOB error provides free cross-validation estimate
Feature importance helps understand which variables matter most
Excellent default choice for tabular data
Trade-off: sacrifices interpretability for superior performance
Best hyperparameters: n_estimators (more is better), max_features (\(\sqrt{p}\) for classification)

Questions?

Instinct Institute

Mork Mongkul