Importance of \(\lambda\)
As we have seen the penalty parameter \(\lambda\) is of crucial importance in penalised regression.
For \(\lambda=0\): we essentially just get the LS estimates of the full model.
For very large \(\lambda\): all ridge estimates become extremely small, while all lasso estimates are exactly zero!
We require a principled way to fine-tune \(\lambda\) in order to get optimal results.
What about using \(C_p\), BIC, adjusted-\(R^2\)?
We used these techniques for model search methods, perhaps we can use theme here as well...
For instance, define a grid of values for \(\lambda\) calculate the corresponding \(C_p\) under each model and then find the model with the corresponding \(\lambda\) that yields the lowest \(C_p\) value.
The problem is that all these techniques depend on model dimensionality (number of predictors).
With model-search methods when we have a model with \(d\) predictors it is clear that the model’s dimensionality is \(d\).
However, with shrinkage methods the very notion of model dimensionality becomes obscure somehow.
Back to the Credit data lasso paths
The brown line highlights a specific model in the path. This model contains predictors income, limit, rating and student, so \(d=4\).
The pink line highlights another model which contains the same predictors.
So, in both models \(d=4\) but they are not the same models, because their regression coefficients are different!
Solution: cross-validation
In this case the only viable strategy is \(K\)-fold cross-validation. The steps are the following:
Choose the number of folds \(K\).
Split the data accordingly into training and testing sets.
Define a grid of values for \(\lambda\).
For each \(\lambda\) calculate the validation MSE within each fold.
For each \(\lambda\) calculate the overall cross-validation.
Locate under which \(\lambda\) cross-validation MSE is minimized.
Seems difficult, but fortunately glmnet
in R will do all of these things for us automatically!
Some glmnet
output
The dotted line to the left of 0 indicates the value of \(\lambda\) which minimises CV error.
The dotted line to the right of 0 indicates the 1-standard-error \(\lambda\); that is, the maximum value that \(\lambda\) can take while still falling within the one standard error interval of the minimum-CV \(\lambda\).
The 1-standard-error \(\lambda\) is always greater and so it results in sparser models.
Some simulation-based comparisons We will close this lecture with a couple of comparisons between ridge and lasso based on two simulations. We consider:
A sparse scenario where \(n=100\), \(p=20\) with only the first 2 predictors being relevant: \(\beta_1 = 1\), \(\beta_2 = -0.5\) and all other \(\beta\)’s equalling 0.
A dense scenario where \(n=100\), \(p=20\) with the first 18 predictors being relevant: here \(\beta_1, \ldots, \beta_{18}\) are assigned random values within the interval \([-1, 1]\).
Sparse scenario
Dense scenario
Next week lectures
Next week we will discuss principal component regression which operates on a completely different way! We will also learn about flexible regression methods that allow non-linear relationships between the response and the predictors.