Random Forest

AI Bootcamp


Mork Mongkul

Motivation

Motivation

  • Decision Trees are flexible and interpretable

  • But they suffer from high variance

  • A small change in data can result in a very different tree

  • This leads to overfitting

Key question:

  • Can we stabilize trees without losing their power?

One tree = unstable learner

Random Forest

What is Random Forest?

  • Random Forest is an ensemble method

  • It combines many Decision Trees

  • Each tree is trained on:
    • A different dataset
    • A different set of features
  • Final prediction is obtained by aggregation

Random Forest Intuition

Single Decision Tree

  • Low bias
  • High variance
  • Sensitive to noise

Random Forest

  • Low bias
  • Lower variance
  • Stable predictions

Many weak, uncorrelated trees form a strong predictor

How Random Forest Works

  • Let \(B\) be the number of trees

  • For \(b = 1, \dots, B\):

    1. Draw a bootstrap sample from training data
    1. Grow a Decision Tree \(\hat{f}_b(\mathbf{x})\):
      • At each split, randomly select \(m\) features
    1. Grow tree to full depth (or nearly full)
  • Aggregate predictions:

Classification: \[\hat{y} = \text{majority vote}\{\hat{f}_1(\mathbf{x}), \ldots, \hat{f}_B(\mathbf{x})\}\]

Regression: \[\hat{y} = \frac{1}{B}\sum_{b=1}^B \hat{f}_b(\mathbf{x})\]

Bootstrap Sampling (Bagging)

  • Bootstrap = sampling with replacement

  • Each tree sees a different version of the dataset

  • About 63% of observations appear in each tree

  • The remaining ~37% are Out-of-Bag (OOB) samples

  • Reduces variance by averaging

Bootstrap Sampling: Example

Original Training Set (n=10):

Index Feature 1 Feature 2 Target
1 2.3 1.5 0
2 1.8 2.1 1
10 3.1 0.8 1

Bootstrap Sample 1 (sampled with replacement): - Indices: [1, 3, 3, 5, 7, 7, 7, 9, 10, 2] - Sample 3 and 7 appear multiple times - Samples 4, 6, 8 are OOB for Tree 1

Bootstrap Sample 2 (different sample): - Indices: [2, 2, 4, 5, 6, 8, 8, 9, 9, 10] - Different observations selected - Samples 1, 3, 7 are OOB for Tree 2

Key Point: Each tree trains on ~63% unique samples, creating diversity!

Feature Randomness

  • At each split, only a subset of features is considered

  • This prevents strong predictors from dominating all trees

  • Trees become less correlated

  • Variance of the ensemble decreases

Typical choices: - Classification: \(m = \sqrt{p}\) - Regression: \(m = p/3\)

Why Does Random Forest Work?

Variance Reduction Through Averaging

  • Assume we have \(B\) independent models, each with variance \(\sigma^2\)

  • The variance of the average is: \[\text{Var}\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b\right) = \frac{\sigma^2}{B}\]

  • Averaging reduces variance!

  • In practice, trees are correlated with correlation \(\rho\): \[\text{Var}(\text{avg}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\]

  • Bootstrap + Feature randomness \(\Rightarrow\) reduces \(\rho\) \(\Rightarrow\) reduces overall variance

  • As \(B \to \infty\): variance approaches \(\rho\sigma^2\) (can’t eliminate correlated part)

Bias–Variance Tradeoff

Decision Tree

  • Low bias
  • High variance

Random Forest

  • Low bias
  • Lower variance

Random Forest improves generalization

Out-of-Bag (OOB) Error Estimation

  • Each tree uses ~63% of data for training

  • The remaining ~37% are Out-of-Bag (OOB) samples

  • For each observation \(i\):
    • Find all trees where \(i\) was not used for training
    • Use those trees to predict \(i\)
  • OOB Error = Average prediction error on OOB samples

  • Free cross-validation! No need for separate validation set

OOB error closely approximates test error!

Hyperparameters of Random Forest

  • n_estimators – number of trees (more is usually better, but diminishing returns)

  • max_features – features per split (\(\sqrt{p}\) for classification, \(p/3\) for regression)

  • max_depth – maximum tree depth (usually left unlimited)

  • min_samples_leaf – minimum samples required at leaf node

  • min_samples_split – minimum samples required to split

  • bootstrap – whether to use bootstrap sampling (default: True)

Trees are deep; overfitting is controlled by averaging

Feature Importance in Random Forest

  • Random Forest provides feature importance scores

  • Measured by:
    • Total reduction in impurity (Gini/Entropy) from splits using that feature
    • Averaged across all trees
  • Formula: \[\text{Importance}(X_j) = \frac{1}{B}\sum_{b=1}^B \sum_{t \in T_b} \Delta i(t, X_j)\] where \(\Delta i(t, X_j)\) is impurity decrease at split \(t\) using feature \(X_j\)

  • Higher importance = more useful for prediction

Random Forest in Action: Heart Disease Dataset

  • Same dataset as Decision Tree
  • Same preprocessing
  • Same evaluation protocol
Code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv('../Data/heart.csv')

# Prepare features and target
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(random_state=42, oob_score=True)

param_grid = {
    'n_estimators': [200, 500],
    'max_features': ['sqrt', 'log2'],
    'min_samples_leaf': [1, 2, 5]
}

grid = GridSearchCV(
    rf,
    param_grid=param_grid,
    cv=10,
    scoring='accuracy',
    n_jobs=-1
)

grid.fit(X_train, y_train)
best_rf = grid.best_estimator_
print(f"Best hyperparameters: {grid.best_params_}")
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"OOB Score: {best_rf.oob_score_:.4f}")
best_rf
Best hyperparameters: {'max_features': 'sqrt', 'min_samples_leaf': 1, 'n_estimators': 200}
Best CV accuracy: 0.9902
OOB Score: 0.9951
RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Evaluation Results

Code
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier

# Random Forest predictions
y_pred_rf = best_rf.predict(X_test)

# Train Decision Tree for comparison
dt = DecisionTreeClassifier(random_state=42, max_depth=5)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Create comparison table
results = pd.DataFrame({
    'Model': ['Decision Tree', 'Random Forest'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_dt),
        accuracy_score(y_test, y_pred_rf)
    ],
    'Precision': [
        precision_score(y_test, y_pred_dt),
        precision_score(y_test, y_pred_rf)
    ],
    'Recall': [
        recall_score(y_test, y_pred_dt),
        recall_score(y_test, y_pred_rf)
    ],
    'F1-Score': [
        f1_score(y_test, y_pred_dt),
        f1_score(y_test, y_pred_rf)
    ]
})

results.style.format({
    'Accuracy': '{:.4f}',
    'Precision': '{:.4f}',
    'Recall': '{:.4f}',
    'F1-Score': '{:.4f}'
}).background_gradient(cmap='Blues', subset=['Accuracy', 'Precision', 'Recall', 'F1-Score'])
  Model Accuracy Precision Recall F1-Score
0 Decision Tree 0.8439 0.7840 0.9515 0.8596
1 Random Forest 0.9854 1.0000 0.9709 0.9852

Feature Importance Analysis

Code
import plotly.graph_objects as go
from IPython.display import HTML

# Get feature importances
feature_names = df.drop('target', axis=1).columns.tolist()
importances = best_rf.feature_importances_

# Sort by importance
sorted_idx = np.argsort(importances)
features_sorted = [feature_names[i] for i in sorted_idx]
importances_sorted = [importances[i] for i in sorted_idx]

fig = go.Figure()

fig.add_trace(go.Bar(
    x=importances_sorted,
    y=features_sorted,
    orientation='h',
    marker=dict(
        color=importances_sorted,
        colorscale='Blues',
        showscale=True,
        colorbar=dict(title='Importance')
    ),
    text=[f'{imp:.3f}' for imp in importances_sorted],
    textposition='outside'
))

fig.update_layout(
    title='Feature Importance - Heart Disease Prediction',
    xaxis_title='Importance Score',
    yaxis_title='Feature',
    width=800, height=500,
    margin=dict(l=0, r=0, t=40, b=0),
    showlegend=False
)

HTML(fig.to_html(include_plotlyjs='cdn'))

Decision Tree vs Random Forest: Visual Comparison

Random Forest produces smoother, more stable decision boundaries

Performance Comparison

Model Accuracy Precision Recall F1-Score Interpretability Training Time
Decision Tree Medium Medium Medium Medium High Fast
Random Forest High High High High Low Slower

Random Forest: Pros and Cons

Advantages

  • High accuracy on most problems

  • Reduces overfitting compared to single trees

  • Handles missing values well

  • No feature scaling required

  • Feature importance built-in

  • OOB error for free validation

  • Robust to outliers and noise

  • Easy to parallelize

Disadvantages

  • Less interpretable than single trees

  • Slower training than single tree

  • Slower prediction than single tree

  • Large memory footprint (stores many trees)

  • Can overfit on noisy data with too many trees (rare)

  • Biased toward features with more categories

  • Not good for extrapolation (predicting outside training range)

When to Use Random Forest?

Use Random Forest when:

  • You need high accuracy without much tuning

  • You have tabular data with mixed types

  • You want feature importance insights

  • Interpretability is not critical

  • You have enough computational resources

  • Your data has complex interactions

Avoid Random Forest when:

  • You need full interpretability (use single tree)

  • You need fast predictions in production

  • You have very high-dimensional sparse data (use linear models or boosting)

  • Memory is severely constrained

  • You need to extrapolate beyond training data

  • You have very small datasets (< 100 samples)

Summary

  • Random Forest is an ensemble of CART models using bagging (bootstrap aggregating)
  • Uses bootstrap sampling + feature randomness to create diverse trees
  • Reduces variance without increasing bias through averaging
  • OOB error provides free cross-validation estimate
  • Feature importance helps understand which variables matter most
  • Excellent default choice for tabular data
  • Trade-off: sacrifices interpretability for superior performance
  • Best hyperparameters: n_estimators (more is better), max_features (\(\sqrt{p}\) for classification)

Questions?

Instinct Institute

Mork Mongkul