Machine Learning

Konstantinos Perrakis

Shrinkage Methods: Lasso Regression

One disadvantage of ridge regression

Unlike model search methods which select models that include subsets of predictors, ridge will include all \(p\) predictors.
In the plot below the grey lines are the coefficient paths of irrelevant variables: always close to zero but never set exactly equal to zero!

We could perform a post-hoc analysis, but ideally we would like a method that sets these coefficients equal to zero automatically.

Lasso Regression (Least Absolute Shrinkage and Selection Operator)

First of all: different ways to pronounce it, pick your favourite!
Lasso is a more recent shrinkage method.
Similar to ridge it penalises the size of the coefficients but instead of the squares it penalises the absolute values.
The lasso regression estimates minimise the quantity \[{\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2} + {\lambda\sum_{j=1}^p|\beta_j|} = RSS + \lambda\sum_{j=1}^p|\beta_j|.\]
Small change but huge advantage!

The advantage Lasso offers the same benefits as ridge:

It introduces some bias but decreases the variance, so it improves predictive performance.
It is extremely scalable and can cope with \(n<p\) problems.

At the same time: Lasso sets coefficients exactly equal to zero \(\implies\) feature selection.
Lasso delivers sparse models \(\rightarrow\) models involving subsets of predictors.
As with ridge proper tuning of the penalty parameter \(\lambda\) is critical.

Lasso paths for Credit data based on \(\lambda\)

Like ridge, lasso shrinks the coefficients as \(\lambda\) increases.
However now after certain values of \(\lambda\) the coefficients become exactly zero.

Lasso paths for Credit data based on \(\lVert\hat{\beta}^R_{\lambda}\rVert_1/\lVert\hat{\beta}\rVert_1\)

Similarly for the “flipped” regularisation diagram based on the ratio of the \(\ell_1\) norms.

How does this “magic” happen?

Mathematically:

The ridge penalty is a smooth differentiable function of the coefficients.
The lasso penalty is discontinuous at 0, which results in the desired property of sparsity...

We will see a geometric interpretation of ridge and lasso which will make more sense.

Next topic

We will discuss ridge and lasso from the perspective of constrained optimisation which provides a useful insight about the geometry of their corresponding solutions.