Important: In the RStudio menu go to Tools->Global Options. Under Workspace untick the box Restore .RData into workspace at startup (if it is ticked). The reason wil be explained during the session.
We will continue analysing the Hitters
dataset included in package ISLR
, this time applying the method of ridge regression.
We perform the usual steps that are required as we have seen in the previous two workshops. .
library(ISLR)
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0
The first thing to do is to install package glmnet
, run the command install.packages('glmnet')
(not needed if you have already done this).
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1
Initially, we will make use of function glmnet()
which implements ridge regression without cross-validation, but it does give a range of solutions over a grid of \(\lambda\) values. The grid is produced automatically, we do not need to specify anything. Of course, if we want we can define the grid manually. Let’s have a look first at the help document of glmnet()
.
?glmnet
A first important thing to observe is that glmnet()
requires a different syntax than regsubsets()
for declaring the response variable and the predictors. Now the response should be a vector and the predictor variables must be stacked column-wise as a matrix. This is easy to do for the response. For the predictors we have to make use of the model.matrix()
command. Let us define as y
the response and as x
the matrix of predictors in R.
y = Hitters$Salary
x = model.matrix(Salary~., Hitters)[,-1] # Here we exclude the first column
#because it is the salary variable
head(x)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## -Al Newman 185 37 1 23 8 21 2 214 42 1
## CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors
## -Alan Ashby 321 414 375 1 1 632 43 10
## -Alvin Davis 224 266 263 0 1 880 82 14
## -Andre Dawson 828 838 354 1 0 200 11 3
## -Andres Galarraga 48 46 33 1 0 805 40 4
## -Alfredo Griffin 501 336 194 0 1 282 421 25
## -Al Newman 30 9 24 1 0 76 127 7
## NewLeagueN
## -Alan Ashby 1
## -Alvin Davis 0
## -Andre Dawson 1
## -Andres Galarraga 1
## -Alfredo Griffin 0
## -Al Newman 0
As we see the command model.matrix()
transformed automatically the categorical variables in Hitters
with names League
, Division
and NewLeague
to dummy variables with names LeagueN
, DivisionW
and NewLeagueN
. Other important things that we see in the help document is that glmnet()
uses a grid of 100 values (nlambda = 100')
for \(\lambda\) and that parameter alpha
is used to select between ridge implementation and lasso implementation. The default option is alpha = 1
which runs lasso, for ridge we have to set alpha = 0
. Finally we see the default option standardize = TRUE
, which means we do not need to standardize the predictor variables manually.
Now we are ready to run our first ridge regression! The command line is extremely simple.
ridge = glmnet(x, y, alpha = 0)
names(ridge)
## [1] "a0" "beta" "df" "dim" "lambda" "dev.ratio"
## [7] "nulldev" "npasses" "jerr" "offset" "call" "nobs"
ridge$lambda
## [1] 255282.09651 232603.53866 211939.68139 193111.54424 175956.04690
## [6] 160324.59666 146081.80138 133104.29678 121279.67791 110505.52560
## [11] 100688.51928 91743.62874 83593.37763 76167.17236 69400.69070
## [16] 63235.32462 57617.67267 52499.07743 47835.20409 43585.65640
## [21] 39713.62682 36185.57767 32970.95069 30041.90230 27373.06250
## [26] 24941.31507 22725.59739 20706.71795 18867.19020 17191.08102
## [31] 15663.87277 14272.33748 13004.42236 11849.14532 10796.49991
## [36] 9837.36861 8963.44390 8167.15625 7441.60860 6780.51660
## [41] 6178.15419 5629.30400 5129.21215 4673.54708 4258.36204
## [46] 3880.06089 3535.36698 3221.29472 2935.12377 2674.37547
## [51] 2436.79132 2220.31350 2023.06697 1843.34327 1679.58574
## [56] 1530.37597 1394.42159 1270.54502 1157.67330 1054.82879
## [61] 961.12071 875.73740 797.93930 727.05257 662.46322
## [66] 603.61182 549.98861 501.12914 456.61020 416.04621
## [71] 379.08581 345.40887 314.72370 286.76452 261.28915
## [76] 238.07694 216.92684 197.65566 180.09647 164.09720
## [81] 149.51926 136.23638 124.13351 113.10583 103.05782
## [86] 93.90245 85.56042 77.95946 71.03376 64.72332
## [91] 58.97348 53.73443 48.96082 44.61127 40.64813
## [96] 37.03706 33.74679 30.74882 28.01718 25.52821
dim(ridge$beta)
## [1] 19 100
ridge$beta[,1:3]
## 19 x 3 sparse Matrix of class "dgCMatrix"
## s0 s1 s2
## AtBat 1.221172e-36 0.0023084432 0.0025301407
## Hits 4.429736e-36 0.0083864077 0.0091931957
## HmRun 1.784944e-35 0.0336900933 0.0369201974
## Runs 7.491019e-36 0.0141728323 0.0155353349
## RBI 7.912870e-36 0.0149583386 0.0163950163
## Walks 9.312961e-36 0.0176281785 0.0193237654
## Years 3.808598e-35 0.0718368521 0.0787191799
## CAtBat 1.048494e-37 0.0001980310 0.0002170325
## CHits 3.858759e-37 0.0007292122 0.0007992257
## CHmRun 2.910036e-36 0.0054982153 0.0060260057
## CRuns 7.741531e-37 0.0014629668 0.0016034328
## CRBI 7.989430e-37 0.0015098418 0.0016548119
## CWalks 8.452752e-37 0.0015957724 0.0017488195
## LeagueN -1.301217e-35 -0.0230232091 -0.0250669217
## DivisionW -1.751460e-34 -0.3339842326 -0.3663713385
## PutOuts 4.891197e-37 0.0009299203 0.0010198028
## Assists 7.989093e-38 0.0001517222 0.0001663687
## Errors -3.725027e-37 -0.0007293365 -0.0008021006
## NewLeagueN -2.585026e-36 -0.0036169801 -0.0038293583
Using the command names()
we can find what is returned by the function. The most important components are "beta"
and "lambda"
. We can easily extract each using the $
symbol. As we see above glmnet()
defines the grid starting with large values of \(\lambda\). This is because this helps speeding up the optimisation procedure. Also, glmnet()
performs some internal checks on the predictor variables in order to define the grid values. Therefore, it is generally better to let glmnet()
to define the grid automatically, rather than defining a grid. We further see that as expected ridge$beta
had 19 rows (# of predictors) and 100 columns (lenght of the grid). Above the first 3 columns are displayed. Notice that ridge$beta
does not return the estimates of the intercepts. We can use the command coef()
for this which returns all betas..
coef(ridge)[,1:3]
## 20 x 3 sparse Matrix of class "dgCMatrix"
## s0 s1 s2
## (Intercept) 5.359259e+02 5.279266e+02 5.271582e+02
## AtBat 1.221172e-36 2.308443e-03 2.530141e-03
## Hits 4.429736e-36 8.386408e-03 9.193196e-03
## HmRun 1.784944e-35 3.369009e-02 3.692020e-02
## Runs 7.491019e-36 1.417283e-02 1.553533e-02
## RBI 7.912870e-36 1.495834e-02 1.639502e-02
## Walks 9.312961e-36 1.762818e-02 1.932377e-02
## Years 3.808598e-35 7.183685e-02 7.871918e-02
## CAtBat 1.048494e-37 1.980310e-04 2.170325e-04
## CHits 3.858759e-37 7.292122e-04 7.992257e-04
## CHmRun 2.910036e-36 5.498215e-03 6.026006e-03
## CRuns 7.741531e-37 1.462967e-03 1.603433e-03
## CRBI 7.989430e-37 1.509842e-03 1.654812e-03
## CWalks 8.452752e-37 1.595772e-03 1.748819e-03
## LeagueN -1.301217e-35 -2.302321e-02 -2.506692e-02
## DivisionW -1.751460e-34 -3.339842e-01 -3.663713e-01
## PutOuts 4.891197e-37 9.299203e-04 1.019803e-03
## Assists 7.989093e-38 1.517222e-04 1.663687e-04
## Errors -3.725027e-37 -7.293365e-04 -8.021006e-04
## NewLeagueN -2.585026e-36 -3.616980e-03 -3.829358e-03
We can produce a 1 by 2 plot similar to the ones we have seen in Lecture 2 via the following simple code. On the left we have the regularisation paths based on \(\log\lambda\) on the right we have the “flipped” version based on the \(\ell_1\)-norm.
par(mfrow=c(1, 2))
plot(ridge, xvar = 'lambda')
plot(ridge, xvar = 'norm') # or simply plot(ridge) because xvar = 'norm' is the default
For implementing cross-validation (CV) in order to locate the min-CV \(\lambda\) or the 1 standard error (1-se) \(\lambda\) we will have to use the function cv.glmnet()
. Lets have a look first at the help document by typing ?cv.glmnet
. We see the argument nfolds = 10
, so the default option is 10-fold CV (this is something we can change if we want to).
In order to get the results we simply run the following lines of code.
set.seed(1)
ridge.cv = cv.glmnet(x, y, alpha=0)
ridge.cv$lambda.min
## [1] 25.52821
ridge.cv$lambda.1se
## [1] 1843.343
We see that in this case the 1-se \(\lambda\) is much larger than the min-CV \(\lambda\). This means that we expect the coefficients under 1-se \(\lambda\) to be much smaller. Lets have a look at the corresponding coefficients from the two models, rounding them to 3 decimal places.
round(cbind(
coef(ridge.cv, s = 'lambda.min'),
coef(ridge.cv, s = 'lambda.1se')), # here we can also use coef(ridge.cv)
# becauce 'lambda.1se' is the default
3)
## 20 x 2 sparse Matrix of class "dgCMatrix"
## 1 1
## (Intercept) 81.127 159.797
## AtBat -0.682 0.102
## Hits 2.772 0.447
## HmRun -1.366 1.289
## Runs 1.015 0.703
## RBI 0.713 0.687
## Walks 3.379 0.926
## Years -9.067 2.715
## CAtBat -0.001 0.009
## CHits 0.136 0.034
## CHmRun 0.698 0.254
## CRuns 0.296 0.069
## CRBI 0.257 0.071
## CWalks -0.279 0.066
## LeagueN 53.213 5.396
## DivisionW -122.834 -29.097
## PutOuts 0.264 0.068
## Assists 0.170 0.009
## Errors -3.686 -0.236
## NewLeagueN -18.105 4.458
Finally, lets plot the results based on the outputs of the objects ridge
and ridge.cv
which we have created.
par(mfrow=c(1,2))
plot(ridge.cv)
plot(ridge, xvar = 'lambda')
abline(v = log(ridge.cv$lambda.min), lty = 3) # careful to use the log here and below
abline(v = log(ridge.cv$lambda.1se), lty = 3)
Important note: Any method/technique that relies on validation/cross-validation is subject to variability. If we re-run this code under a different random seed results will change.
One thing mentioned in the lectures is that one can potentially consider some kind of post-hoc analysis in order to set some of the ridge coefficients equal to zero. A very simplistic approach is to just examine the ranking of the absolute coefficients and decide to use a “cutting-threshold”. We can use a combination of the commands sort()
and abs()
for this.
beta.1se = coef(ridge.cv, s = 'lambda.1se')[-1]
rank.coef = sort(abs(beta.1se), decreasing = TRUE)
rank.coef
## [1] 29.096663728 5.396487429 4.457548059 2.714623467 1.289060569
## [6] 0.925962427 0.702915317 0.686866068 0.446840518 0.253594870
## [11] 0.235989097 0.102483884 0.071334608 0.068874010 0.067805862
## [16] 0.066114944 0.034359576 0.009201998 0.008746278
The last two coefficients are quite small, we could potentially set these two equal to zero. Lets do that and examine whether this “sparsification” offers any benefits in terms of predictive performace. This means that we will have to perform data splitting into training and testing samples in order to check.
repetitions = 50
cor.1 = c()
cor.2 = c()
set.seed(1)
for(i in 1:repetitions){
# Step (i) data splitting
training.obs = sample(1:263, 175)
y.train = Hitters$Salary[training.obs]
x.train = model.matrix(Salary~., Hitters[training.obs, ])[,-1]
y.test = Hitters$Salary[-training.obs]
x.test = model.matrix(Salary~., Hitters[-training.obs, ])[,-1]
# Step (ii) training phase
ridge.train1 = cv.glmnet(x.train, y.train, alpha = 0)
# Step (iii) here we find which coefficients to set equal to zero
ind = which(abs(coef(ridge.train1)) < 0.02)
beta.sparse = coef(ridge.train1)
beta.sparse[ind] = 0
# Step (iv) generating predictions
predict.1 = predict(ridge.train1, x.test, s = 'lambda.1se')
predict.2 = cbind(1, x.test) %*% as.numeric(beta.sparse) # here we cannot use the
# predict function so we
# use standard matrix
# multiplication
# Step (v) evaluating predictive performance
cor.1[i] = cor(y.test, predict.1)
cor.2[i] = cor(y.test, predict.2)
}
boxplot(cor.1, cor.2, names = c('Standard Ridge','Sparse Ridge'), ylab = 'Test correlation', col = 3)
Predictive performance in this case is almost identical. However, in some cases where we have many coefficients that are very small, sparsification of ridge can actually help in terms of prediction. We also note that ridge has a better predictive performance than the model search methods based on \(C_p\), BIC, adjusted-\(R^2\) if we compare the medians (black lines in the boxplots above).
Lets continue the analysis of body measurements data.
library(faraway)
fat = fat[,-c(2, 3, 8)]
dim(fat)
## [1] 252 15
Ridge regression. Use glmnet()
to perform a similar analysis as above to produce the regularisation-path plots based on the logarithm of \(\lambda\) and on the \(\ell_1\)-norm, using a grid length of 200.
Ridge regression with cross-validation. Using cv.glmnet
with 5-fold CV and random seed equal to 2 find the values of min-CV \(\lambda\) and 1-se \(\lambda\). Plot the CV \(\lambda\) graph next to the regularisation-path plot indicating with vertical lines the two \(\lambda\) values. Which predictor has the highest absolute coefficient?
Prediction. Using 50 repetitions with random seed set to 2 compare the predictive performance of the model with min-CV \(\lambda\) vs. the model with 1-se \(\lambda\). In this case use 10-fold CV within cv.glmnet()
and also a 50% splitting for training and testing. Produce boxplots as above and also calculate the mean correlations from the two models. Under which approach does ridge seem to perform better?