Machine Learning

image
Konstantinos Perrakis

Introduction to Week 1: Beyond Least Squares Regression

The Linear Model

Recall the multiple linear model: \[y= \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_p x_p +\epsilon\] Here, \(y\) is the response variable, the \(x\)’s are the predictor variables and \(\epsilon\) is an error term.
For simplicity we refer to such types of models as linear models from now on.
Linear models:

The Least Squares fit

We “fit” or “train” linear models by minimising the residual sum of squares (RSS) \[RSS = {\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2},\] where \(n\) is the training sample size and the minimisation is with respect to the regression coefficients \(\beta_0,\, \beta_1,\, \beta_2,\, \ldots,\, \beta_p\).
This gives us the best possible fit to the training data and we can use the model to perform inference.
Importantly, we can then use the estimated coefficients which we denote as \(\hat{\beta}_j\) (or sometimes as \(b_j\)) for \(j=0,1,\ldots,p\) for prediction. Given a new set of observations \(x_1^*,\, x_2^*,\, \ldots,\, x_p^*\) our prediction will be \[\hat{y}^* = \hat{\beta_0}+\hat{\beta_1}x_1^* + \hat{\beta_2}x_2^*+ \ldots + \hat{\beta_p}x_p^*.\]

Why depart from the least squares fit?

The model that considers all \(p\) available predictor variables is commonly referred to as the “full model”.
In many occasions it is preferable to select a subset of predictor variables or consider extensions of the least squares (LS) solution of the full model, because of two important issues:

Predictive accuracy

Model interpretability

Often, many predictors are not (effectively) associated with the response. For instance, we may have a model with 5 predictors \[y= \beta_0 + \underset{\color{green} \mbox{relevant}}{{\color{green}\beta_1x_1} + {\color{green}\beta_2x_2}} + \underset{\color{red} \mbox{irrelevant}}{{\color{red}\beta_3x_3} +{\color{red}\beta_4x_4} +{\color{red}\beta_5x_5}} +\epsilon,\] where only the first 2 variables are relevant (have an influence) to the response.
Including irrelevant variables leads to unnecessary complexity. Ideally, we would like to keep only the relevant variables \(\rightarrow\) simpler, more interpretable model!
This is an important topic in stats/ML known as feature selection or variable selection.

How can we tackle these challenges?

We will look into two (fundamentally) different strategies:

A few general remarks before going into details

Next topic

Prior to proceeding to model-search algorithms and penalised regression we will discuss the important distinction between training error and prediction error, and also discuss about available model selection criteria and techniques.