AI Bootcamp
• Linear Regression is for regression problems.
• Logistic Regression is for classification problems..
• Both models require an input-output formula or form.
• Do we have something that
| x1 | x2 | y |
|---|---|---|
| -0.752759 | 2.704286 | 1 |
| 1.935603 | -0.838856 | 0 |
| -0.546282 | -1.960234 | 0 |
| 0.952162 | -2.022393 | 0 |
| -0.955179 | 2.584544 | 1 |
| -2.458261 | 2.011815 | 1 |
| 2.449595 | -1.562629 | 0 |
| 1.065386 | -2.900473 | 0 |
| -0.793301 | 0.793835 | 1 |
| 2.015881 | 1.175845 | 0 |
| -0.016509 | -1.194730 | 0 |
\[D(A, B) = \sqrt{(1-(-1))^2 + (3-2)^2 + (4-5)^2} = \sqrt{4} = 2 \text{ (unit)}.\]
\[D(\mathbf{x}, \mathbf{x}') = \sqrt{\sum_{i=1}^{D} (x_i - x'_i)^2}.\]
| x1 | x2 | y |
|---|---|---|
| -0.752759 | 2.704286 | 1 |
| 1.935603 | -0.838856 | 0 |
| -0.546282 | -1.960234 | 0 |
| 0.952162 | -2.022393 | 0 |
| -0.955179 | 2.584544 | 1 |
• Regression:
\(\hat{y} = \frac{1}{k} \sum_{j=1}^{k} y_{(j)}\)
\(= \text{Average } y_{(j)} \text{ among the } k \text{ neighbors.}\)
\(= \text{The predicted value.}\)
• Classification with M classes:
\(\hat{y} = \arg\max_{1 \leq m \leq M} \frac{1}{k} \sum_{j=1}^{k} \mathbb{1}_{\{y_{(j)}=m\}}\)
\(= \text{Majority group among the } k \text{ neighbors.}\)
\(= \text{The predicted class.}\)
• Regression:
\(\hat{y} = \frac{1}{k} \sum_{j=1}^{k} y_{(j)}\)
\(= \text{Average } y_{(j)} \text{ among the } k \text{ neighbors.}\)
\(= \text{The predicted value.}\)
• Classification with M classes:
\(\hat{y} = \arg\max_{1 \leq m \leq M} \frac{1}{k} \sum_{j=1}^{k} \mathbb{1}_{\{y_{(j)}=m\}}\)
\(= \text{Majority group among the } k \text{ neighbors.}\)
\(= \text{The predicted class.}\)
| Set | Common % | Purpose |
|---|---|---|
| Train | 60%-70% | For training the model |
| Validation | 15%-20% | Tune hyperparameters k |
| Test | 15%-20% | Evaluate final model performance |
• Mean Squared Error (MSE):
\(\text{MSE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2.\)
• Mean Absolute Error (MAE):
\(\text{MAE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} |\color{blue}{y_i} - \color{red}{\hat{y}_i}|.\)
• Root Mean Squared Error (RMSE):
\(\text{RMSE} = \sqrt{\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2}.\)
• Mean Absolute Percentage Error (MAPE):
\(\text{MAPE} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \left|\frac{\color{blue}{y_i} - \color{red}{\hat{y}_i}}{\color{blue}{y_i}}\right|.\)
• Coefficient of Determination: \(R^2 = 1 - \frac{\sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \color{red}{\hat{y}_i})^2}{\sum_{i=1}^{n_{\text{test}}} (\color{blue}{y_i} - \bar{y})^2}.\)
• Misclassification Error (ME):
\(\text{ME} = \frac{\#\{i : \color{blue}{y_i} \neq \color{red}{\hat{y}_i}\}}{n_{\text{test}}}.\)
• Precision:
\(\text{Precision} = \frac{\text{True Positive}}{\text{All Positive Prediction}}.\)
• Accuracy:
\(\text{Accuracy} = \frac{\#\{i : \color{blue}{y_i} = \color{red}{\hat{y}_i}\}}{n_{\text{test}}}.\)
• Recall:
\(\text{Recall} = \frac{\text{True Positive}}{\text{All Positive Labels}}.\)
• F1-score: \(\text{F1-score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}.\)
import numpy as np
import pandas as pd
data = pd.read_csv("../Data/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']
# Convert to correct types
for i in quan_vars:
data[i] = data[i].astype('float')
for i in qual_vars:
data[i] = data[i].astype('category')
# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
test_perf = pd.DataFrame(
data={'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1-score': f1_score(y_test, y_pred)},
columns=["Accuracy", "Precision", "Recall", "F1-score"],
index=["5NN"])
test_perf| Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|
| 5NN | 0.868293 | 0.861111 | 0.885714 | 0.873239 |
| Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|
| 5NN | 0.868293 | 0.861111 | 0.885714 | 0.873239 |
knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(1,20))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)
knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)
test_perf = pd.concat([test_perf, pd.DataFrame(
data={'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1-score': f1_score(y_test, y_pred)},
columns=["Accuracy", "Precision", "Recall", "F1-score"],
index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf| Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|
| 5NN | 0.868293 | 0.861111 | 0.885714 | 0.873239 |
| 1NN | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|
| 5NN | 0.868293 | 0.861111 | 0.885714 | 0.873239 |
| 1NN | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
# Remove duplicates
data_no_dup = data.drop_duplicates()
# Re-split the data
X_no_dup, y_no_dup = data_no_dup.iloc[:,:-1], data_no_dup.iloc[:,-1]
X_train_nd, X_test_nd, y_train_nd, y_test_nd = train_test_split(X_no_dup, y_no_dup, test_size=0.2, stratify=y_no_dup, random_state=42)
# Re-encode and scale
X_train_cat_nd = pd.get_dummies(X_train_nd.select_dtypes(include="category"), drop_first=True)
X_train_encoded_nd = scaler.fit_transform(np.column_stack([X_train_nd.select_dtypes(include="number").to_numpy(), X_train_cat_nd]))
X_test_cat_nd = pd.get_dummies(X_test_nd.select_dtypes(include="category"), drop_first=True)
X_test_encoded_nd = scaler.transform(np.column_stack([X_test_nd.select_dtypes(include="number").to_numpy(), X_test_cat_nd]))
# GridSearch again
grid_search_nd = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search_nd.fit(X_train_encoded_nd, y_train_nd)
# Train and predict with optimal K
knn_nd = KNeighborsClassifier(n_neighbors=grid_search_nd.best_params_['n_neighbors'])
knn_nd = knn_nd.fit(X_train_encoded_nd, y_train_nd)
y_pred_nd = knn_nd.predict(X_test_encoded_nd)
# Add results to table
test_perf = pd.concat([test_perf, pd.DataFrame(
data={'Accuracy': accuracy_score(y_test_nd, y_pred_nd),
'Precision': precision_score(y_test_nd, y_pred_nd),
'Recall': recall_score(y_test_nd, y_pred_nd),
'F1-score': f1_score(y_test_nd, y_pred_nd)},
columns=["Accuracy", "Precision", "Recall", "F1-score"],
index=[f"{grid_search_nd.best_params_['n_neighbors']}NN_No_Dup"])], axis=0)| Accuracy | Precision | Recall | F1-score | |
|---|---|---|---|---|
| 5NN | 0.868293 | 0.861111 | 0.885714 | 0.873239 |
| 1NN | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 16NN_No_Dup | 0.885246 | 0.882353 | 0.909091 | 0.895522 |
Instinct Institute
Mork MongkulK-Nearest Neighbors | AI Bootcamp