K-Nearest Neighbors (K-NN)

AI Bootcamp


Mork Mongkul

K-Nearest Neighbors

K-Nearest Neighbors (KNN) Algorithm

Motivation and intorduction

Linear Regression is for regression problems.

Logistic Regression is for classification problems..

Input-Ouput Data in Supervised Learning

• Both models require an input-output formula or form.
• Do we have something that

  • Works both for classification & regression?
  • DOESN’T assume any input–output formula for prediction?

meme

K-Nearest Neighbors (KNN) Algorithm

Introduction

  • Models that DO NOT assume any input–output form for prediction are known as Non-parametric models.
  • In this case, the prediction is based on two main points
    • Similarity of input data (xxi?)
    • Using the output of those similar points (yi) to predict the output of query point (y).
  • We use this idea all the time:
    • “You are the average of the five people you spend the most time with”Jim Rohn.
    • “The sky is so dark, it’s going to be raining!”
x1 x2 y
-0.752759 2.704286 1
1.935603 -0.838856 0
-0.546282 -1.960234 0
0.952162 -2.022393 0
-0.955179 2.584544 1
-2.458261 2.011815 1
2.449595 -1.562629 0
1.065386 -2.900473 0
-0.793301 0.793835 1
2.015881 1.175845 0
-0.016509 -1.194730 0

K-Nearest Neighbors (KNN) Algorithm

Euclidean Distance

  • The core idea of some Non-parametric models is using the outputs of similar data points to predict any query point.
  • But what does similar or difference mean?
  • In ML, we often use distances to measure how difference the data points are.
  • The most common distance is the Euclidean one:
    • For example: \(A = (1,\ 3,\ 4)\) and \(B = (-1,\ 2,\ 5)\) then

\[D(A, B) = \sqrt{(1-(-1))^2 + (3-2)^2 + (4-5)^2} = \sqrt{4} = 2 \text{ (unit)}.\]

  • For two input data x = (x1, x2, …, xd) and x’ = (x’1, x’2, …, x’d) then the Euclidean distance between them is given by

\[D(\mathbf{x}, \mathbf{x}') = \sqrt{\sum_{i=1}^{D} (x_i - x'_i)^2}.\]

Smaller distance = Closer the points = More similar the data.

x1 x2 y
-0.752759 2.704286 1
1.935603 -0.838856 0
-0.546282 -1.960234 0
0.952162 -2.022393 0
-0.955179 2.584544 1
  • Can you identify the most similar point to the first point based on its input?
  • What’s the label of that nearest point?
  • Assume that you know the labels and of all the points except for the first one.
  • How would you guess its label?

K-Nearest Neighbors (K-NN)

  • Given the training data: \(\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\} \subset \mathbb{R}^d \times \mathcal{Y}\).
  • If \(D\) is a distance on \(\mathbb{R}^d\) (e.g. Euclidean distance), \(\mathbf{x}_{(k)}\) is called the \(k\)-th nearest neighbor of \(\mathbf{x} \in \mathbb{R}^d\) if its distance to \(\mathbf{x}\) ranks \(k\)-th among all the input points, i.e.,
  • \(D(\mathbf{x}, \mathbf{x}_{(1)}) \leq D(\mathbf{x}, \mathbf{x}_{(2)}) \leq \cdots \leq D(\mathbf{x}, \mathbf{x}_{(k-1)}) \leq D(\mathbf{x}, \mathbf{x}_{(k)}) \leq \cdots \leq D(\mathbf{x}, \mathbf{x}_{(n)})\).
  • Let \(y_{(1)}, \ldots, y_{(n)}\) be the target of \(\mathbf{x}_{(1)}, \ldots, \mathbf{x}_{(n)}\) respectively.
  • If \(k \geq 1\), then \(k\)-NN predicts the target of an input \(\mathbf{x}\) by

• Regression:

\(\hat{y} = \frac{1}{k} \sum_{j=1}^{k} y_{(j)}\)

\(= \text{Average } y_{(j)} \text{ among the } k \text{ neighbors.}\)

\(= \text{The predicted value.}\)

• Classification with M classes:

\(\hat{y} = \arg\max_{1 \leq m \leq M} \frac{1}{k} \sum_{j=1}^{k} \mathbb{1}_{\{y_{(j)}=m\}}\)

\(= \text{Majority group among the } k \text{ neighbors.}\)

\(= \text{The predicted class.}\)

K-Nearest Neighbors (K-NN)

Example

• Regression:

\(\hat{y} = \frac{1}{k} \sum_{j=1}^{k} y_{(j)}\)

\(= \text{Average } y_{(j)} \text{ among the } k \text{ neighbors.}\)

\(= \text{The predicted value.}\)

• Classification with M classes:

\(\hat{y} = \arg\max_{1 \leq m \leq M} \frac{1}{k} \sum_{j=1}^{k} \mathbb{1}_{\{y_{(j)}=m\}}\)

\(= \text{Majority group among the } k \text{ neighbors.}\)

\(= \text{The predicted class.}\)

K-Nearest Neighbors (K-NN)

Influence of K

Regression vs Classification

  • Too large k ⇔ Using many points ⇔ too inflexibleUnderfitting.
  • Too small k ⇔ Using less points ⇔ too flexibleOverfitting.
  • How to choose a good k?

K-Nearest Neighbors (K-NN)

Fine-Tune K

Data Splitting:Train/Validate/Test

  • A good model is the one that can generalize/predict new unseen data.
  • The first attempt: splitting the data into 3 parts.
Set Common % Purpose
Train 60%-70% For training the model
Validation 15%-20% Tune hyperparameters k
Test 15%-20% Evaluate final model performance

Data Splitting

  • In this case, the best k is the one achieving the best performance on Validation set.
  • The final performance is measured using the Test set.

K-Nearest Neighbors (K-NN)

Fine-Tune K

Data Splitting:K-Fold Cross Validation

Data Splitting

K-Nearest Neighbors (K-NN)

Performance Metrics

  • Selecting the best k depends not only the splitting scheme, but also the performance metric to define What does the best mean?
  • What’s performance metric?
  • It’s a value that measures the quality of a model when using to predict new unseen observations.
  • They are divided into two main types:
  • Score: largerbetter model.
  • Error: smallerbetter model.
  • ⚠️ Not to confuse:
  • Metric: For fine-tuning the key hyperparameters of the model (use validating or testing data when being measured).
    • Example: \(R^2\), Adjusted \(R^2\), Accuracy…
  • Loss: For training the model and is computed using the training data.
    • Example: Mean Squared Error (MSE), Mean Absolute Error (MAE)…

K-Nearest Neighbors (K-NN)

Regression Metrics

  • These are some common metrics in regression problems.

• Mean Squared Error (MSE):

\(\text{MSE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2.\)

• Mean Absolute Error (MAE):

\(\text{MAE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} |\color{blue}{y_i} - \color{red}{\hat{y}_i}|.\)

• Root Mean Squared Error (RMSE):

\(\text{RMSE} = \sqrt{\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2}.\)

• Mean Absolute Percentage Error (MAPE):

\(\text{MAPE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \left|\frac{\color{blue}{y_i} - \color{red}{\hat{y}_i}}{\color{blue}{y_i}}\right|.\)

• Coefficient of Determination: \(R^2 = 1 - \frac{\sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2}{\sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \bar{y})^2}.\)

K-Nearest Neighbors (K-NN)

Classification Metrics

  • These are some common metrics in classification problems.

• Misclassification Error (ME):

\(\text{ME} = \frac{\#\{i : \color{blue}{y_i} \neq \color{red}{\hat{y}_i}\}}{n_{\text{test}}}.\)

• Precision:

\(\text{Precision} = \frac{\text{True Positive}}{\text{All Positive Prediction}}.\)

• Accuracy:

\(\text{Accuracy} = \frac{\#\{i : \color{blue}{y_i} = \color{red}{\hat{y}_i}\}}{n_{\text{test}}}.\)

• Recall:

\(\text{Recall} = \frac{\text{True Positive}}{\text{All Positive Labels}}.\)

• F1-score: \(\text{F1-score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}.\)

K-Nearest Neighbors (K-NN)

Classification Application

  • Let’s work with our Heart Disease Dataset (shape: \(1025 \times 14\)) and choose \(K = 5\).
  • K-NN is a distance-based method, it’s essential to
  • Scale the inputs
  • Watch out for outliers/missing values
  • Encode categorical inputs
  • Watch out for the effect of imbalanced class
▸ Code
import numpy as np
import pandas as pd
data = pd.read_csv("../Data/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=["5NN"])
test_perf
Accuracy Precision Recall F1-score
5NN 0.868293 0.861111 0.885714 0.873239
Accuracy Precision Recall F1-score
5NN 0.868293 0.861111 0.885714 0.873239
  • Q3: Can we do better?
  • A3: Yes! \(K = 5\) is arbitrary. We should fine-tune it!

K-Nearest Neighbors (K-NN)

Fine-Tuning k: Cross-Validation

  • There are many way to perform Cross-validation in python.
  • Let’s use GridSearchCV from sklearn.model_selection module.
▸ Code
knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(1,20))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf
Accuracy Precision Recall F1-score
5NN 0.868293 0.861111 0.885714 0.873239
1NN 1.000000 1.000000 1.000000 1.000000
Accuracy Precision Recall F1-score
5NN 0.868293 0.861111 0.885714 0.873239
1NN 1.000000 1.000000 1.000000 1.000000

K-Nearest Neighbors (K-NN)

What Can Go Wrong?

It doesn't work... why?

  • It’s because of the duplicated data!
  • Let’s try again.
▸ Code
# Remove duplicates
data_no_dup = data.drop_duplicates()

# Re-split the data
X_no_dup, y_no_dup = data_no_dup.iloc[:,:-1], data_no_dup.iloc[:,-1]
X_train_nd, X_test_nd, y_train_nd, y_test_nd = train_test_split(X_no_dup, y_no_dup, test_size=0.2, stratify=y_no_dup, random_state=42)

# Re-encode and scale
X_train_cat_nd = pd.get_dummies(X_train_nd.select_dtypes(include="category"), drop_first=True)
X_train_encoded_nd = scaler.fit_transform(np.column_stack([X_train_nd.select_dtypes(include="number").to_numpy(), X_train_cat_nd]))
X_test_cat_nd = pd.get_dummies(X_test_nd.select_dtypes(include="category"), drop_first=True)
X_test_encoded_nd = scaler.transform(np.column_stack([X_test_nd.select_dtypes(include="number").to_numpy(), X_test_cat_nd]))

# GridSearch again
grid_search_nd = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search_nd.fit(X_train_encoded_nd, y_train_nd)

# Train and predict with optimal K
knn_nd = KNeighborsClassifier(n_neighbors=grid_search_nd.best_params_['n_neighbors'])
knn_nd = knn_nd.fit(X_train_encoded_nd, y_train_nd)
y_pred_nd = knn_nd.predict(X_test_encoded_nd)

# Add results to table
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test_nd, y_pred_nd),
          'Precision': precision_score(y_test_nd, y_pred_nd),
          'Recall': recall_score(y_test_nd, y_pred_nd),
          'F1-score': f1_score(y_test_nd, y_pred_nd)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search_nd.best_params_['n_neighbors']}NN_No_Dup"])], axis=0)
Accuracy Precision Recall F1-score
5NN 0.868293 0.861111 0.885714 0.873239
1NN 1.000000 1.000000 1.000000 1.000000
16NN_No_Dup 0.885246 0.882353 0.909091 0.895522

K-Nearest Neighbors (\(k\)-NN)

Summary

  • \(k\)-NN predicts the label/value of a new point by looking at the \(k\) closest neighbors of the point.
  • Data preprocessing is essential: scaling, encoding, outliers
  • The key parameter \(k\) can be tuned using cross-validation technique.
  • \(k\)-NN may not be suitable in high-dimensional cases due to Curse of dimensionality. However, we can try:
    • Feature selection
    • Dimensional reduction
    • Distance metric

Questions?

Instinct Institute

Mork Mongkul