First some intuition on interesting connections...
Previously, we have seen discrete model search methods, where one option is to pick the “best” model based on information criteria (\(C_p,\) BIC). In this case we generally seek the minimum of \[\underbrace{\mbox{model fit}}_{\mbox{RSS}}~ + ~{\color{red} \mbox{penalty on model dimensionality}}\] across models of differing dimensionality (number of predictors).
Interestingly, as we will see shrinkage methods offer a continuous “analogue” (which is much faster). In this case we train only the full model, but the estimated regression coefficients minimise a general solution of the form \[\underbrace{\mbox{model fit}}_{\mbox{RSS}}~ + ~{\color{blue} \mbox{penalty on size of coefficients}}.\] This approach is called penalised or regularised regression.
Ridge Regression
Recall LS produces estimates of \(\beta_0,\, \beta_1,\, \beta_2,\, \ldots,\, \beta_p\) by minimising \[RSS = \sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2.\]
Ridge regression on the other hand minimises \[\underbrace{\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2}_{\mbox{model fit}} + \underbrace{\lambda\sum_{j=1}^p\beta_j^2}_{\mbox{penalty}} = RSS + \lambda\sum_{j=1}^p\beta_j^2,\] where \(\lambda\ge 0\) is a tuning parameter. The role of this parameter is crucial; it controls the trade-off between model fit and the size of the coefficients. We will see later how we tune \(\lambda\)...
Trade-off and shrinkage \[\underbrace{\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2}_{\mbox{model fit}} + \underbrace{\lambda\sum_{j=1}^p\beta_j^2}_{\mbox{penalty}}\]
When \(\lambda=0\) the penalty term has no effect and the ridge solution is the same as the LS solution.
But as \(\lambda\) gets larger naturally the penalty term gets larger; so in order to minimise the entire function (model fit + penalty) the regression coefficients will necessarily get smaller!
So, unlike LS, ridge regression does not produce one set of coefficients, it produces different sets of coefficients for different values of \(\lambda\)!
As \(\lambda\) is increasing the coefficients are shrunk towards 0 (for really large \(\lambda\) the coefficients are almost 0).
Regularisation paths for Credit data based on \(\lambda\)
Regularisation paths for increasing values of \(\lambda\): at the extreme left of the x-axis \(\lambda\) is very close to zero and the coefficients for income, limit, rating and student are large (corresponding to the LS coefficients); at the extreme right of the x-axis \(\lambda\) is very large and all coefficients are almost zero.
Note: Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
Regularisation paths for Credit data based on \(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2\)
Another way to represent this is in terms of the metric \(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2\) where \(\hat{\beta}^R_{\lambda}\) are the ridge coefficients for a given value of \(\lambda\) and \(\hat{\beta}\) are the LS coefficients. The symbol \(\lVert\cdot\rVert_2\) is the \(\ell_2\)-norm, for \(\hat{\beta}\): \(\lVert\hat{\beta}\rVert_2 =\sqrt{\beta_1^2 + \ldots +\beta_p^2}\) (similarly for \(\lVert\hat{\beta}^R_{\lambda}\rVert_2\)). This is a flipped image of the previous one: on the extreme left of the x-axis \(\lambda\) is very large (\(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2=0\)); on the extreme right of the x-axis \(\lambda=0\) (\(\lVert\hat{\beta}^R_{\lambda}\rVert_2/\lVert\hat{\beta}\rVert_2=1\)).
Ridge vs. Least-Squares: variance reduction
For a carefully chosen \(\lambda\) ridge can significantly outperform least squares in terms of estimation (and thus prediction) in the following cases:
Multicollinearity. It occurs when there exist strong correlations between the predictor variables. The LS coefficients are always unbiased but are very unstable (high variance) in the presence of multicollinearity. This affects prediction error. Ridge introduces some bias but decreases the variance, which results in better predictions.
No correlations but \(p\) close to \(n\). LS estimates suffer again from high variance. Ridge estimates are again more stable.
In modern ML applications \(p\) is large and it is very likely to have strong correlations among the many predictors!
Illustration from a toy simulation
The bias-variance trade-off in prediction error
MSE = Bias\(^2\) + Variance
The influence of \(\lambda\):
As we leave \(\lambda=0\) the variance decreases fast, but the bias increases at a slow rate so we can have significant gains in MSE reduction (circle indicates the least squares MSE, cross indicates the minimum MSE obtained from ridge).
After a point, predictive performance starts to deteriorate, since for large \(\lambda\) all coefficients are shrunk towards zero.
Ridge vs. Least-Squares: scalability & feature selection
High-dimensionality. Ridge is applicable for any number of predictors even when \(n < p\) – remember LS solutions do not exist in this case. It is also much faster than model-search based methods since we train only one model.
Feature selection. Strictly speaking ridge does not perform feature selection (since the coefficients are almost never exactly 0), but it can give us a good idea of which predictors are not influential and can be combined with post-hoc analysis based on ranking the absolute values of the coefficients.
The importance of scaling
The LS estimates are scale invariant: multiplying a feature \(X_j\) by some constant \(c\) will scale down \(\hat{\beta}_j\) by \(1/c\), so \(X_j\hat{\beta}_j\) will remain unaffected.
In ridge regression (and any shrinkage method) the scaling of the features matters! If a relevant feature is in a large scale in relation to other features it will have a small coefficient which will shrink fast to zero.
So, we want the features to be on the same scale.
In practice, we standardise \[x_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n} (x_{ij}-\bar{x}_j)^2}},\] so that all predictors have variance equal to one.
Next topic
We will revisit ridge, but first the lasso.