Machine Learning

Konstantinos Perrakis

Shrinkage Methods: Ridge vs. Lasso

Ridge and Lasso with \(\ell\)-norms

Lets just denote \(\beta=(\beta_1,\ldots,\beta_p)\). Then a more compact representation of ridge and lasso minimisation is the following

Ridge: \({\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2} + {\lambda\Vert\beta\Vert_2^2}\), where \(\Vert\beta\Vert_2=\sqrt{\beta_1^2 + \ldots + \beta_p^2}\) is the \(\ell_2\)-norm of \(\beta\).
Lasso: \({\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2} + {\lambda\Vert\beta\Vert_1}\), where \(\Vert\beta\Vert_1={|\beta_1| + \ldots + |\beta_p|}\) is the \(\ell_1\)-norm of \(\beta\).

That is why we say that ridge uses \(\ell_2\)-penalisation, while lasso uses \(\ell_1\)-penalisation.

Ridge and Lasso as constrained optimisation problems There is an equivalent representation for both methods:

Ridge: minimise \({\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2}\) subject to \(\sum_{j=1}^p\beta_j^2 \le t\).

Lasso: minimise \({\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2}\) subject to \(\sum_{j=1}^p|\beta_j| \le t\).

“subject to” above essentially imposes the corresponding constraints.

Threshold \(t\) may be considered as a “budget”: large \(t\) allows coefficients to be large.

The is a one-to-one correspondence between penalty parameter \(\lambda\) and threshold \(t\), meaning for every set of coefficients given a specific \(\lambda\) there exists always a corresponding \(t\).

Just keep in mind it is the other way around: when \(\lambda\) is large \(t\) is small and vice versa!

The ridge constraint is “smooth”, while the lasso constraint is “edgy”.

Diamonds and Circles in 2-dimensions

Left: the constraint \(|\beta_1|+|\beta_2|\le t\)
Right: the constraint \(\beta_1^2+\beta_2^2\le t\)
Our friends \(\beta_1\) and \(\beta_2\) are not allowed to go outside the red areas!

Solutions: RSS + Constraints

Image from https://medium.com/@davidsotunbo/ridge-and-lasso-regression-an-illustration-and-explanation-using-sklearn-in-python-4853cd543898

How do we find ridge/lasso solutions for different \(\lambda\)’s?

Both ridge and lasso are convex optimisation problems.
In optimisation convexity is a desired property.
In fact the ridge solution exists in closed form (meaning we have a formula for it).
Lasso does not have a closed form solution, but nowadays there exist very efficient optimisation algorithms for the lasso solution.
Package glmnet in R implements coordinate descent (very fast) – same time to solve ridge and lasso.
Due to clever optimisation tricks finding the solutions for multiple \(\lambda\)’s takes the same time as finding the solution for one single \(\lambda\) value.

The last point is very useful: it implies that we can obtain the solutions for a grid of values of \(\lambda\) very fast and then pick a specific \(\lambda\) that is optimal in regard to some objective function.

Ridge vs. Lasso: Predictive performance?

So, lasso has a clear advantage over ridge, it can set coefficients equal to zero!
This must mean that it is also better for prediction, right?

Well not necessarily, things are a bit more complicated than that...it all comes down to the nature of the actual underlying mechanism.

Lets explain this rather cryptic answer...

Sparse vs. Dense Mechanisms Generally...

When the actual data-generating mechanism is sparse lasso has the advantage.
When the actual data-generating mechanism is dense ridge has the advantage.

Sparse mechanisms: Few predictors are relevant to the response \(\rightarrow\) good setting for lasso regression.
Dense mechanisms: A lot of predictors are relevant to the response \(\rightarrow\) good setting for ridge regression.

Things may look like that in extreme cases...

\(~~~~~~\leftarrow \mbox{ {\color{NavyBlue}few predictors} vs.{\color{Maroon}many predictors}}\rightarrow\)

This is a rough “extreme” illustration, the actual performance will also depend upon:

Signal strength (the magnitude of the effects of the relevant variables).
The correlation structure among predictors.
Sample size vs. number of predictors.

Takeaway message

In general, none of the two shrinkage methods will dominate in terms of predictive performance under all settings.
Lasso performs better when few predictors have a substantial effect on the response variable.
Ridge performs better when a lot of predictors have a substantial effect on the response.
Keep in mind: “a few” and “a lot” is always relative to the total number of available predictors.
Overall: in most applications lasso is more robust.

Question: If we have to choose between ridge and lasso?
Answer: Keep a part of the data for testing purposes only and compare predictive performance of the two methods.
(We did the same at the end of Workshop 1 tutorial, when we wanted to decide which model to choose among the best \(C_p\), BIC, adjusted \(R^2\) models)

Next topic

We will see how we tune penalty parameter \(\lambda\) for ridge and lasso regression in practice.