K-Nearest Neighbors (K-NN)

AI Bootcamp

Mork Mongkul

K-Nearest Neighbors

K-Nearest Neighbors (KNN) Algorithm

Motivation and intorduction

• Linear Regression is for regression problems.

• Logistic Regression is for classification problems..

Input-Ouput Data in Supervised Learning

• Both models require an input-output formula or form.
• Do we have something that

Works both for classification & regression?
DOESN’T assume any input–output formula for prediction?

$meme$

K-Nearest Neighbors (KNN) Algorithm

Introduction

Models that DO NOT assume any input–output form for prediction are known as Non-parametric models.
In this case, the prediction is based on two main points
- Similarity of input data (x ≈ x_i?)
- Using the output of those similar points (y_i) to predict the output of query point (y).
We use this idea all the time:
- “You are the average of the five people you spend the most time with”—Jim Rohn.
- “The sky is so dark, it’s going to be raining!” …

x1	x2	y
-0.752759	2.704286	1
1.935603	-0.838856	0
-0.546282	-1.960234	0
0.952162	-2.022393	0
-0.955179	2.584544	1
-2.458261	2.011815	1
2.449595	-1.562629	0
1.065386	-2.900473	0
-0.793301	0.793835	1
2.015881	1.175845	0
-0.016509	-1.194730	0

K-Nearest Neighbors (KNN) Algorithm

Euclidean Distance

The core idea of some Non-parametric models is using the outputs of similar data points to predict any query point.
But what does similar or difference mean?
In ML, we often use distances to measure how difference the data points are.
The most common distance is the Euclidean one:
- For example: $A = (1,\ 3,\ 4)$ and $B = (-1,\ 2,\ 5)$ then

\[D(A, B) = \sqrt{(1-(-1))^2 + (3-2)^2 + (4-5)^2} = \sqrt{4} = 2 \text{ (unit)}.\]

For two input data x = (x₁, x₂, …, x_d) and x’ = (x’₁, x’₂, …, x’_d) then the Euclidean distance between them is given by

\[D(\mathbf{x}, \mathbf{x}') = \sqrt{\sum_{i=1}^{D} (x_i - x'_i)^2}.\]

Smaller distance = Closer the points = More similar the data.

x1	x2	y
-0.752759	2.704286	1
1.935603	-0.838856	0
-0.546282	-1.960234	0
0.952162	-2.022393	0
-0.955179	2.584544	1

Can you identify the most similar point to the first point based on its input?
What’s the label of that nearest point?
Assume that you know the labels and of all the points except for the first one.
How would you guess its label?

K-Nearest Neighbors (K-NN)

Given the training data: $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\} \subset \mathbb{R}^d \times \mathcal{Y}$.

If $D$ is a distance on $\mathbb{R}^d$ (e.g. Euclidean distance), $\mathbf{x}_{(k)}$ is called the $k$-th nearest neighbor of $\mathbf{x} \in \mathbb{R}^d$ if its distance to $\mathbf{x}$ ranks $k$-th among all the input points, i.e.,

$D(\mathbf{x}, \mathbf{x}_{(1)}) \leq D(\mathbf{x}, \mathbf{x}_{(2)}) \leq \cdots \leq D(\mathbf{x}, \mathbf{x}_{(k-1)}) \leq D(\mathbf{x}, \mathbf{x}_{(k)}) \leq \cdots \leq D(\mathbf{x}, \mathbf{x}_{(n)})$.
Let $y_{(1)}, \ldots, y_{(n)}$ be the target of $\mathbf{x}_{(1)}, \ldots, \mathbf{x}_{(n)}$ respectively.

If $k \geq 1$, then $k$-NN predicts the target of an input $\mathbf{x}$ by

• Regression:

$\hat{y} = \frac{1}{k} \sum_{j=1}^{k} y_{(j)}$

$= \text{Average } y_{(j)} \text{ among the } k \text{ neighbors.}$

$= \text{The predicted value.}$

• Classification with M classes:

$\hat{y} = \arg\max_{1 \leq m \leq M} \frac{1}{k} \sum_{j=1}^{k} \mathbb{1}_{\{y_{(j)}=m\}}$

$= \text{Majority group among the } k \text{ neighbors.}$

$= \text{The predicted class.}$

K-Nearest Neighbors (K-NN)

Example

• Regression:

$\hat{y} = \frac{1}{k} \sum_{j=1}^{k} y_{(j)}$

$= \text{Average } y_{(j)} \text{ among the } k \text{ neighbors.}$

$= \text{The predicted value.}$

• Classification with M classes:

$\hat{y} = \arg\max_{1 \leq m \leq M} \frac{1}{k} \sum_{j=1}^{k} \mathbb{1}_{\{y_{(j)}=m\}}$

$= \text{Majority group among the } k \text{ neighbors.}$

$= \text{The predicted class.}$

K-Nearest Neighbors (K-NN)

Influence of K

Regression vs Classification

Too large k ⇔ Using many points ⇔ too inflexible ⇔ Underfitting.

Too small k ⇔ Using less points ⇔ too flexible ⇔ Overfitting.

How to choose a good k?

K-Nearest Neighbors (K-NN)

Fine-Tune K

Data Splitting:Train/Validate/Test

A good model is the one that can generalize/predict new unseen data.
The first attempt: splitting the data into 3 parts.

Set	Common %	Purpose
Train	60%-70%	For training the model
Validation	15%-20%	Tune hyperparameters k
Test	15%-20%	Evaluate final model performance

Data Splitting

In this case, the best k is the one achieving the best performance on Validation set.
The final performance is measured using the Test set.

K-Nearest Neighbors (K-NN)

Fine-Tune K

Data Splitting:K-Fold Cross Validation

Data Splitting

K-Nearest Neighbors (K-NN)

Performance Metrics

Selecting the best k depends not only the splitting scheme, but also the performance metric to define What does the best mean?
What’s performance metric?
It’s a value that measures the quality of a model when using to predict new unseen observations.
They are divided into two main types:

Score: larger ⇔ better model.
Error: smaller ⇔ better model.
⚠️ Not to confuse:

Metric: For fine-tuning the key hyperparameters of the model (use validating or testing data when being measured).
- Example: $R^2$, Adjusted $R^2$, Accuracy…
Loss: For training the model and is computed using the training data.
- Example: Mean Squared Error (MSE), Mean Absolute Error (MAE)…

K-Nearest Neighbors (K-NN)

Regression Metrics

These are some common metrics in regression problems.

• Mean Squared Error (MSE):

$\text{MSE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2.$

• Mean Absolute Error (MAE):

$\text{MAE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} |\color{blue}{y_i} - \color{red}{\hat{y}_i}|.$

• Root Mean Squared Error (RMSE):

$\text{RMSE} = \sqrt{\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2}.$

• Mean Absolute Percentage Error (MAPE):

$\text{MAPE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \left|\frac{\color{blue}{y_i} - \color{red}{\hat{y}_i}}{\color{blue}{y_i}}\right|.$

• Coefficient of Determination: $R^2 = 1 - \frac{\sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2}{\sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \bar{y})^2}.$

K-Nearest Neighbors (K-NN)

Classification Metrics

These are some common metrics in classification problems.

• Misclassification Error (ME):

$\text{ME} = \frac{\#\{i : \color{blue}{y_i} \neq \color{red}{\hat{y}_i}\}}{n_{\text{test}}}.$

• Precision:

$\text{Precision} = \frac{\text{True Positive}}{\text{All Positive Prediction}}.$

• Accuracy:

$\text{Accuracy} = \frac{\#\{i : \color{blue}{y_i} = \color{red}{\hat{y}_i}\}}{n_{\text{test}}}.$

• Recall:

$\text{Recall} = \frac{\text{True Positive}}{\text{All Positive Labels}}.$

• F1-score: $\text{F1-score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}.$

K-Nearest Neighbors (K-NN)

Classification Application

Let’s work with our Heart Disease Dataset (shape: $1025 \times 14$) and choose $K = 5$.
K-NN is a distance-based method, it’s essential to

Scale the inputs
Watch out for outliers/missing values
Encode categorical inputs
Watch out for the effect of imbalanced class…

▸ Code

import numpy as np
import pandas as pd
data = pd.read_csv("../Data/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=["5NN"])
test_perf

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239

Q3: Can we do better?
A3: Yes! $K = 5$ is arbitrary. We should fine-tune it!

K-Nearest Neighbors (K-NN)

Fine-Tuning k: Cross-Validation

There are many way to perform Cross-validation in python.
Let’s use GridSearchCV from sklearn.model_selection module.

▸ Code

knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(1,20))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239
1NN	1.000000	1.000000	1.000000	1.000000

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239
1NN	1.000000	1.000000	1.000000	1.000000

K-Nearest Neighbors (K-NN)

What Can Go Wrong?

It doesn't work... why?

It’s because of the duplicated data!
Let’s try again.

▸ Code

# Remove duplicates
data_no_dup = data.drop_duplicates()

# Re-split the data
X_no_dup, y_no_dup = data_no_dup.iloc[:,:-1], data_no_dup.iloc[:,-1]
X_train_nd, X_test_nd, y_train_nd, y_test_nd = train_test_split(X_no_dup, y_no_dup, test_size=0.2, stratify=y_no_dup, random_state=42)

# Re-encode and scale
X_train_cat_nd = pd.get_dummies(X_train_nd.select_dtypes(include="category"), drop_first=True)
X_train_encoded_nd = scaler.fit_transform(np.column_stack([X_train_nd.select_dtypes(include="number").to_numpy(), X_train_cat_nd]))
X_test_cat_nd = pd.get_dummies(X_test_nd.select_dtypes(include="category"), drop_first=True)
X_test_encoded_nd = scaler.transform(np.column_stack([X_test_nd.select_dtypes(include="number").to_numpy(), X_test_cat_nd]))

# GridSearch again
grid_search_nd = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search_nd.fit(X_train_encoded_nd, y_train_nd)

# Train and predict with optimal K
knn_nd = KNeighborsClassifier(n_neighbors=grid_search_nd.best_params_['n_neighbors'])
knn_nd = knn_nd.fit(X_train_encoded_nd, y_train_nd)
y_pred_nd = knn_nd.predict(X_test_encoded_nd)

# Add results to table
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test_nd, y_pred_nd),
          'Precision': precision_score(y_test_nd, y_pred_nd),
          'Recall': recall_score(y_test_nd, y_pred_nd),
          'F1-score': f1_score(y_test_nd, y_pred_nd)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search_nd.best_params_['n_neighbors']}NN_No_Dup"])], axis=0)

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239
1NN	1.000000	1.000000	1.000000	1.000000
16NN_No_Dup	0.885246	0.882353	0.909091	0.895522

K-Nearest Neighbors ($k$-NN)

Summary

$k$-NN predicts the label/value of a new point by looking at the $k$ closest neighbors of the point.
Data preprocessing is essential: scaling, encoding, outliers…
The key parameter $k$ can be tuned using cross-validation technique.
$k$-NN may not be suitable in high-dimensional cases due to Curse of dimensionality. However, we can try:
- Feature selection
- Dimensional reduction
- Distance metric

Questions?

Instinct Institute

Mork Mongkul

K-Nearest Neighbors (K-NN)

K-Nearest Neighbors

K-Nearest Neighbors (KNN) Algorithm

Motivation and intorduction

K-Nearest Neighbors (KNN) Algorithm

Introduction

K-Nearest Neighbors (KNN) Algorithm

Euclidean Distance

Smaller distance = Closer the points = More similar the data.

K-Nearest Neighbors (K-NN)

K-Nearest Neighbors (K-NN)

Example

K-Nearest Neighbors (K-NN)

Influence of K

K-Nearest Neighbors (K-NN)

Fine-Tune K

Data Splitting:Train/Validate/Test

K-Nearest Neighbors (K-NN)

Fine-Tune K

Data Splitting:K-Fold Cross Validation

K-Nearest Neighbors (K-NN)

Performance Metrics

K-Nearest Neighbors (K-NN)

Regression Metrics

K-Nearest Neighbors (K-NN)

Classification Metrics

K-Nearest Neighbors (K-NN)

Classification Application

K-Nearest Neighbors (K-NN)

Fine-Tuning k: Cross-Validation

K-Nearest Neighbors (K-NN)

What Can Go Wrong?

K-Nearest Neighbors (\(k\)-NN)

Summary

Questions?