Machine Learning

image
Konstantinos Perrakis

Shrinkage Methods: Ridge Regression

First some intuition on interesting connections...

Previously, we have seen discrete model search methods, where one option is to pick the “best” model based on information criteria (\(C_p,\) BIC). In this case we generally seek the minimum of \[\underbrace{\mbox{model fit}}_{\mbox{RSS}}~ + ~{\color{red} \mbox{penalty on model dimensionality}}\] across models of differing dimensionality (number of predictors).
Interestingly, as we will see shrinkage methods offer a continuous “analogue” (which is much faster). In this case we train only the full model, but the estimated regression coefficients minimise a general solution of the form \[\underbrace{\mbox{model fit}}_{\mbox{RSS}}~ + ~{\color{blue} \mbox{penalty on size of coefficients}}.\] This approach is called penalised or regularised regression.

Ridge Regression

Trade-off and shrinkage \[\underbrace{\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2}_{\mbox{model fit}} + \underbrace{\lambda\sum_{j=1}^p\beta_j^2}_{\mbox{penalty}}\]

Regularisation paths for Credit data based on \(\lambda\)

image

Regularisation paths for increasing values of \(\lambda\): at the extreme left of the x-axis \(\lambda\) is very close to zero and the coefficients for income, limit, rating and student are large (corresponding to the LS coefficients); at the extreme right of the x-axis \(\lambda\) is very large and all coefficients are almost zero.
Note: Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Regularisation paths for Credit data based on \(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2\)

image

Another way to represent this is in terms of the metric \(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2\) where \(\hat{\beta}^R_{\lambda}\) are the ridge coefficients for a given value of \(\lambda\) and \(\hat{\beta}\) are the LS coefficients. The symbol \(\lVert\cdot\rVert_2\) is the \(\ell_2\)-norm, for \(\hat{\beta}\): \(\lVert\hat{\beta}\rVert_2 =\sqrt{\beta_1^2 + \ldots +\beta_p^2}\) (similarly for \(\lVert\hat{\beta}^R_{\lambda}\rVert_2\)). This is a flipped image of the previous one: on the extreme left of the x-axis \(\lambda\) is very large (\(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2=0\)); on the extreme right of the x-axis \(\lambda=0\) (\(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2=1\)).

Ridge vs. Least-Squares: variance reduction

For a carefully chosen \(\lambda\) ridge can significantly outperform least squares in terms of estimation (and thus prediction) in the following cases:

In modern ML applications \(p\) is large and it is very likely to have strong correlations among the many predictors!

Illustration from a toy simulation

The bias-variance trade-off in prediction error

MSE = Bias\(^2\) + Variance

image

The influence of \(\lambda\):

Ridge vs. Least-Squares: scalability & feature selection

The importance of scaling

Next topic

We will revisit ridge, but first the lasso.