Machine Learning

Konstantinos Perrakis

Introduction to Week 1: Beyond Least Squares Regression

The Linear Model

Recall the multiple linear model: \[y= \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_p x_p +\epsilon\] Here, \(y\) is the response variable, the \(x\)’s are the predictor variables and \(\epsilon\) is an error term.
For simplicity we refer to such types of models as linear models from now on.
Linear models:

Offer significant inferential tools.
Can be very competitive (despite their simplicity) in comparison to more complicated models in terms of prediction.
Unlike non-linear models, they are also easy to interpret.

The Least Squares fit

We “fit” or “train” linear models by minimising the residual sum of squares (RSS) \[RSS = {\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2},\] where \(n\) is the training sample size and the minimisation is with respect to the regression coefficients \(\beta_0,\, \beta_1,\, \beta_2,\, \ldots,\, \beta_p\).
This gives us the best possible fit to the training data and we can use the model to perform inference.
Importantly, we can then use the estimated coefficients which we denote as \(\hat{\beta}_j\) (or sometimes as \(b_j\)) for \(j=0,1,\ldots,p\) for prediction. Given a new set of observations \(x_1^*,\, x_2^*,\, \ldots,\, x_p^*\) our prediction will be \[\hat{y}^* = \hat{\beta_0}+\hat{\beta_1}x_1^* + \hat{\beta_2}x_2^*+ \ldots + \hat{\beta_p}x_p^*.\]

Why depart from the least squares fit?

The model that considers all \(p\) available predictor variables is commonly referred to as the “full model”.
In many occasions it is preferable to select a subset of predictor variables or consider extensions of the least squares (LS) solution of the full model, because of two important issues:

Predictive accuracy \(\longrightarrow\) “bias-variance trade-off”
Model interpretability \(\longrightarrow\) “sparsity”

Predictive accuracy

Assuming that the relationship between the response and the predictors is approximately linear, the LS estimates (the \(\hat{\beta}_j\)’s) will have low bias and variance (\(\implies\) good predictive performance) when:
When these conditions are violated the LS estimates are unstable (\(\implies\) poor predictive performance). Also, it is not uncommon in ML applications to have \(n<p\), in which case the LS solution is not feasible!! Solutions:

Model interpretability

Often, many predictors are not (effectively) associated with the response. For instance, we may have a model with 5 predictors \[y= \beta_0 + \underset{\color{green} \mbox{relevant}}{{\color{green}\beta_1x_1} + {\color{green}\beta_2x_2}} + \underset{\color{red} \mbox{irrelevant}}{{\color{red}\beta_3x_3} +{\color{red}\beta_4x_4} +{\color{red}\beta_5x_5}} +\epsilon,\] where only the first 2 variables are relevant (have an influence) to the response.
Including irrelevant variables leads to unnecessary complexity. Ideally, we would like to keep only the relevant variables \(\rightarrow\) simpler, more interpretable model!
This is an important topic in stats/ML known as feature selection or variable selection.

How can we tackle these challenges?

We will look into two (fundamentally) different strategies:

The first consists of traditional statistical algorithmic procedures which perform a model search and pick the model which minimises or maximises a given predictive criterion. Specifically, we will talk about:
- Best Subset Selection
- Stepwise Selection
The second relates to more modern regression methods which impose constraints that shrink towards zero the regression coefficients of the full model. These are shrinkage methods which extend LS regression to what is called penalised or regularised regression. Namely, we will explore the following two fundamental methods:
- Ridge Regression
- Lasso Regression

A few general remarks before going into details

Research on model-search approaches is limited nowadays due to the fact that such strategies are not scalable as we will see (problems quickly start to arise even when the number of predictos is more than 10).
In the era of “Big Data” in certain applications \(p\) can be in the order of thousands or even millions; consider e.g. biomedical applications where the purpose is to associate a specific disease/phenotype with genes.
Penalised regression methods despite being not new – ridge (Hoerl and Kennard, 1970), lasso (Tibshirani, 1996) – are extremely scalable. Many papers on variations/extensions of these methods are still being published up to this day!

Next topic

Prior to proceeding to model-search algorithms and penalised regression we will discuss the important distinction between training error and prediction error, and also discuss about available model selection criteria and techniques.