AI Bootcamp
Regression: \[\begin{align*}\color{blue}{\hat{y}}&=\frac{1}{k}\sum_{j=1}^k\color{red}{y_{(i)}}\\ &=\text{Average $\color{red}{y_{(i)}}$ among the $k$ neighbors}.\end{align*}\]
Classification with \(M\) classes: \[\begin{align*}\color{blue}{\hat{y}}&=\arg\max_{1\leq m\leq M}\frac{1}{k}\sum_{j=1}^k\mathbb{1}_{\{\color{red}{y_{(i)}}=m\}}\\ &=\text{Majority group among the $k$ neighbors.}\end{align*}\]
\(k\)-NN defines Neighbors based on the Euclidean distance between two points.
The main leading question to the development of Decision Tree methods is
Is there other way to define Neighbor?
| x1 | x2 | y | |
|---|---|---|---|
| 0 | 1.18 | -1.28 | 0 |
| 1 | -1.64 | 0.31 | 0 |
| 2 | 1.32 | -0.46 | 1 |
| 3 | 2.88 | 1.11 | 1 |
| 4 | -0.11 | -0.65 | 0 |
| 5 | -0.94 | 1.37 | 1 |
| 6 | -0.37 | -2.64 | 0 |
| 7 | -0.61 | 1.43 | 1 |
| 8 | -1.91 | -1.95 | 0 |
| 9 | 0.19 | 0.19 | 1 |
| 10 | 0.81 | 2.10 | 1 |
Start at root (no split yet).
Recursively split into smaller regions.
Stop when a stopping criterion is met.
Regions \(\color{blue}{\Rightarrow}\) neighbors \(\color{blue}{\Rightarrow}\) prediction.
We try column \(\color{red}{X_j}\) at threshold \(\color{red}{a}\in\mathbb{R}\) into two subregions \(R_1\) and \(R_2\).
We decision to split along \(\color{red}{X_j}\) at \(\color{red}{a}\) so that \(R_1\) and \(R_2\) are as pure as possible.
Regression: Within-region variation \(\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.\)
Missclassification error \(=1-\hat{p}_{k^*}\) where \(k^*\) is the majority class.
Gini impurity \(=\sum_{k=1}^M\hat{p}_{k}(1-\hat{p}_{k})\).
Entropy \(=-\sum_{k}\hat{p}_{k}\log(\hat{p}_{k})\) where \(\hat{p}_{k}\): proportion of class \(k\) in region \(R\).
The smaller \(\Leftrightarrow\) the purer the regions!
\(\text{En}(R_1)=-1\log(1)=0\)
\(\begin{align*}\\ \text{En}(R_2)&=-\color{blue}{16/19\log(16/19)}-\color{red}{3/19\log(3/19)}\\ &=0.436.\end{align*}\)
\(\text{En}_1=(0)11/30+(0.436)19/30=0.276.\)
Information gain: \(\text{En}_0-\text{En}_1.\)
Prediction rule:
Regression: \(\color{blue}{\hat{y}}=\) average targets within the same block.
Classification: \(\color{blue}{\hat{y}}=\) majority vote among points within the same block.
Hyperparameters:
max_depth
max_features
min_samples_split
min_samples_leaf
criterion… [see explanation here]
Deep trees \(\Leftrightarrow\) less neighbors \(\Rightarrow\) Overfitting.
In this case of smaller leaves, it’s similar to smaller \(k\) in \(k\)-NN.
These hyperparameters should be fine-tuned using CV to optimize its performance.
GridsearchCV with \(K=10\) to search over the hyperparameters:
criterion)min_samples_leaf)max_features).import pandas as pd
import numpy as np
path = "../Data"
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']
# Convert to correct types
for i in quan_vars:
data[i] = data[i].astype('float')
for i in qual_vars:
data[i] = data[i].astype('category')
# Train test split
from sklearn.model_selection import train_test_split
data_no_dup = data.drop_duplicates()
X, y = data_no_dup.iloc[:,:-1], data_no_dup.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
clf = DecisionTreeClassifier()
param_grid = {'criterion': ['gini', 'entropy'],
'min_samples_leaf': [2, 5, 10, 16, 20, 25, 30],
'max_features': ['auto', 'sqrt', 'log2', 2, 5, 10, X_train.shape[1]] }
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_tr = pd.DataFrame(
data={'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1-score': f1_score(y_test, y_pred)},
columns=["Accuracy", "Precision", "Recall", "F1-score"],
index=["Tree"])
test_tr = pd.concat([test_tr, pd.DataFrame(
data={'Accuracy': 0.885246,
'Precision': 0.882353,
'Recall': 0.909091,
'F1-score': 0.909091},
columns=["Accuracy", "Precision", "Recall", "F1-score"],
index=["16-NN"])], axis=0)
print(f"Best hyperparameters: {grid_search.best_params_}")
test_trBest hyperparameters: {'criterion': 'gini', 'max_features': 13, 'min_samples_leaf': 5}
| Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|
| Tree | 0.754098 | 0.800000 | 0.727273 | 0.761905 |
| 16-NN | 0.885246 | 0.882353 | 0.909091 | 0.909091 |
Instinct Institute
Mork MongkulDecision Trees | AI Bootcamp