Machine Learning

image
Konstantinos Perrakis

Shrinkage Methods: Ridge vs. Lasso

Ridge and Lasso with \(\ell\)-norms

Lets just denote \(\beta=(\beta_1,\ldots,\beta_p)\). Then a more compact representation of ridge and lasso minimisation is the following

That is why we say that ridge uses \(\ell_2\)-penalisation, while lasso uses \(\ell_1\)-penalisation.

Ridge and Lasso as constrained optimisation problems There is an equivalent representation for both methods:

Ridge: minimise \({\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2}\) subject to \(\sum_{j=1}^p\beta_j^2 \le t\).

Lasso: minimise \({\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2}\) subject to \(\sum_{j=1}^p|\beta_j| \le t\).

“subject to” above essentially imposes the corresponding constraints.

Threshold \(t\) may be considered as a “budget”: large \(t\) allows coefficients to be large.

The is a one-to-one correspondence between penalty parameter \(\lambda\) and threshold \(t\), meaning for every set of coefficients given a specific \(\lambda\) there exists always a corresponding \(t\).

Just keep in mind it is the other way around: when \(\lambda\) is large \(t\) is small and vice versa!

The ridge constraint is “smooth”, while the lasso constraint is “edgy”.

Diamonds and Circles in 2-dimensions

image

Solutions: RSS + Constraints

image

Image from https://medium.com/@davidsotunbo/ridge-and-lasso-regression-an-illustration-and-explanation-using-sklearn-in-python-4853cd543898

How do we find ridge/lasso solutions for different \(\lambda\)’s?

The last point is very useful: it implies that we can obtain the solutions for a grid of values of \(\lambda\) very fast and then pick a specific \(\lambda\) that is optimal in regard to some objective function.

Ridge vs. Lasso: Predictive performance?

So, lasso has a clear advantage over ridge, it can set coefficients equal to zero!
This must mean that it is also better for prediction, right?

Well not necessarily, things are a bit more complicated than that...it all comes down to the nature of the actual underlying mechanism.

Lets explain this rather cryptic answer...

Sparse vs. Dense Mechanisms Generally...

Sparse mechanisms: Few predictors are relevant to the response \(\rightarrow\) good setting for lasso regression.
Dense mechanisms: A lot of predictors are relevant to the response \(\rightarrow\) good setting for ridge regression.

Things may look like that in extreme cases...

\(~~~~~~\leftarrow \mbox{ {\color{NavyBlue}few predictors} vs.{\color{Maroon}many predictors}}\rightarrow\)

This is a rough “extreme” illustration, the actual performance will also depend upon:

Takeaway message

Question: If we have to choose between ridge and lasso?
Answer: Keep a part of the data for testing purposes only and compare predictive performance of the two methods.
(We did the same at the end of Workshop 1 tutorial, when we wanted to decide which model to choose among the best \(C_p\), BIC, adjusted \(R^2\) models)

Next topic

We will see how we tune penalty parameter \(\lambda\) for ridge and lasso regression in practice.