Linear Regression

AI Bootcamp


Mork Mongkul

Linear Regression

Linear Regression

Linear Regression is a supervised learning algorithm that is used to model the relationship between a dependent variable and an independent variable. The algorithm finds the best fit straight line relationship (linear equation) between the two variables. This statistical method is then used to predict the outcome of future events and is quite useful for predictive analysis.

• Goal: We want to predict a continuous number (e.g., House Price, Temperature, Stock Value) based on input data.
• Input (\(X\)): Features (Square footage, number of rooms, location)
• Output (\(y\)): The Target (Price)

Regression Example

Linear Regression:Example1

Regression Example

Regression Example

Linear Regression:Example1

Training Set

Notation:

\(M\) = number of training examples
\(x\) = input variable / feature
\(y\) = output/target variable
\((x, y)\) = one training example
\((x^{(i)}, y^{(i)})\) = the \(i\)th training example

Regression Example

Linear Regression:Example1

Model Representation & Cost Function

Model Representation:

Regression Example

Cost Function(Hypothesis):

• Hypothesis: \(h(x) = ax+b\), where \(a\) and \(b\) are called parameters

Regression Example

Regression Example

Linear Regression:Example1

Cost Function(Cont.)

• Hypothesis: \(h(x) = ax+b\)

Regression Example

Linear Regression:Example1

Cost Function(Cont.)

Goal: Choose \(a\) and \(b\) so that : \(h(x) = ax+b\) is close to \(y\) for the training example \((x,y)\)

Regression Example

For each \((x^{(i)}, y^{(i)})\): Minimize \(|h(x^{(i)}) - y^{(i)}|\)

\[\Rightarrow \text{Minimize } \frac{1}{m} \sum_{i=1}^m |h(x^{(i)}) - y^{(i)}|\]

Squared Error Function:

\[J(a, b) = \frac{1}{m} \sum_{i=1}^m \left(h(x^{(i)}) - y^{(i)}\right)^2\] \[\Rightarrow \min_{a, b} J(a, b)\]

Linear Regression:Example1

Gradient Descent

Consider, we have cost function \(J(a,b)\), and our goal is to minimize \(\min_{a, b} J(a, b)\)

Regression Example

Algorithm Outline:

- Start with some value of \(a\) and \(b\)
- Keep changing \(a\) and \(b\) to reduce \(J (a, b)\) until hopefully we end up at a minimum

Things to consider:

- choose the learning rate \(\alpha\)
- Global minimum vs Local minimum

Linear Regression:Exercise 1

Land Price Prediction

Given a dataset of land price as illustrated in the table below, find a linear regression model which fits the data. Train the model using the gradient descent algorithm(implement from scratch).

Regression Example

Linear Regression:Example2

LR with Multiple Variables

Regression Example

Notation

- \(n\) = number of features
- \(x^{(i)}\) = input features of the \(i\)th example
- \(x_j^{(i)}\) = value of feature \(j\) of the \(i\)th example

Regression Example

Linear Regression:Example2

LR with Multiple Variables

Notation

- Hypothesis: \((h(x) = a x_1 + b x_2 + c)\)
or \((h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2)\)
or \((h(x) = \sum_{j=0}^n \theta_j x_j\)\), where \((x_0 = 1) and (n = 2)\)
- Cost function: \((J(\theta_0, \theta_1, \ldots, \theta_n) = J(\theta) = \frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2)\)

Regression Example

Linear Regression:Example2

LR with Multiple Variables:Feature Scaling

Make sure all features are on a similar scale

Regression Example

Linear Regression

Evaluation Approach

Train/Test Split (Data Generalization)

- The dataset is divided into two parts: training set and test set
- Training set: used to learn patterns and fit the model
- Test set: used to evaluate how well the model generalizes to unseen data
- Prevents overfitting and over-optimistic accuracy
- Analogy: Training = practice questions, Test = real exam

Regression Example

Linear Regression:Example2

Hyperparameter

Learning Rate

Learning rate is a floating point number you set that influences how quickly the model converges. If the learning rate is too low, the model can take a long time to converge. However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimize the loss. The goal is to pick a learning rate that’s not too high nor too low so that the model converges quickly. The ideal learning rate helps the model to converge within a reasonable number of iterations.

In Figure on the right, the loss curve shows the model significantly improving during the first 20 iterations before beginning to converge

Regression Example

In contrast, a learning rate that’s too small can take too many iterations to converge. In this second figure, loss graph showing a model trained with a small learning rate, the loss curve shows the model making only minor improvements after each iteration

Regression Example

A learning rate that’s too large never converges because each iteration either causes the loss to bounce around or continually increase. In the third figure, Loss graph showing a model trained with a learning rate that’s too big, where the loss curve fluctuates wildly, going up and down as the iterations increase.

Regression Example

The loss curve shows the model decreasing and then increasing loss after each iteration, and in the 4th figure, the loss increases at later iterations. Loss graph showing a model trained with a learning rate that’s too big, where the loss curve drastically increases in later iterations.

Regression Example

Linear Regression:Example2

Hyperparameter

Batch Size

Batch size is a hyperparameter that refers to the number of examples the model processes before updating its weights and bias. You might think that the model should calculate the loss for every example in the dataset before updating the weights and bias. However, when a dataset contains hundreds of thousands or even millions of examples, using the full batch isn’t practical.

Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent:

Stochastic gradient descent (SGD):

Stochastic gradient descent uses only a single example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy. “Noise” refers to variations during training that cause the loss to increase rather than decrease during an iteration. The term “stochastic” indicates that the one example comprising each batch is chosen at random. Notice in the image on the right how loss slightly fluctuates as the model updates its weights and bias using SGD, which can lead to noise in the loss graph:

Regression Example

Tip

Note that using stochastic gradient descent can produce noise throughout the entire loss curve, not just near convergence.

Mini-batch stochastic gradient descent (mini-batch SGD)::

Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For number of data points, the batch size can be any number greater than 1 and less than . The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration. Determining the number of examples for each batch depends on the dataset and the available compute resources. In general, small batch sizes behaves like SGD, and larger batch sizes behaves like full-batch gradient descent.

Regression Example

Tip

When training a model, you might think that noise is an undesirable characteristic that should be eliminated. However, a certain amount of noise can be a good thing.

Linear Regression:Example2

Hyperparameter

Epochs

During training, an epoch means that the model has processed every example in the training set once. For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch. Training typically requires many epochs. That is, the system needs to process every example in the training set multiple times. The number of epochs is a hyperparameter you set before the model begins training. In many cases, you’ll need to experiment with how many epochs it takes for the model to converge. In general, more epochs produces a better model, but also takes more time to train.

Regression Example

Linear Regression:Example2

Evaluation Approach

Error-Based Metrics (Predictive Accuracy)

Mean Absolute Error (MAE)

- Measures the average absolute difference between predicted and actual values
- Uses the same unit as the target variable
- Less sensitive to outliers than MSE

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]

Mean Squared Error (MSE)

- Squares prediction errors
- Large errors are penalized heavily
- Sensitive to outliers
- MSE is commonly used as a loss function during training because it is differentiable.

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\]

Root Mean Squared Error (RMSE)

- Represents the typical size of prediction error
- Same unit as the target variable
- Sensitive to outliers
\[\text{RMSE} = \sqrt{\text{MSE}}\]

Linear Regression: Example 2

Evaluation Approach

Goodness-of-Fit Metrics (Model Explanatory Power)

R-Squared (\(R^2\))

- Measures the proportion of variance in the target variable explained by the model
- Compares the model against a baseline that predicts the mean
- Values range from 0 to 1 (higher is better)
- Does NOT measure prediction error

\[ R^2 = 1 - \frac{ \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 }{ \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2 } \]

Adjusted R-Squared

- Adjusted version of \(R^2\) that accounts for the number of predictors
- Penalizes adding irrelevant features
- Essential for Multiple Linear Regression
- Increases only when a new feature improves the model

\[ \text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right) \]

Where: \(n\) = number of samples, \(p\) = number of predictors

Linear Regression: Exercise2

Linear Regression:Exercise2

Multiple Linear Regression

Given a dataset of land price:

1. Build a linear regression which predicts the land price using both the land_area and the distance_to_city feature. (See the dataset in ’land_price_ _1.csv’)
2. Using only the distance feature, build a model with hypothesis \(h(x) = \theta_{0}+\theta_{1}x+\theta_{3}\sqrt{x}\) to predict the land price. (See the dataset in ‘land_price_2.csv’)

Questions?

Instinct Institute

Mork Mongkul