Important: In the RStudio menu go to Tools->Global Options. Under Workspace untick the box Restore .RData into workspace at startup (if it is ticked). The reason wil be explained during the session.

Tutorial: Baseball Data

We will continue analysing the Hitters dataset included in package ISLR, this time applying the method of ridge regression.

Removing missing entries

We perform the usual steps that are required as we have seen in the previous two workshops. .

library(ISLR)
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263  20
sum(is.na(Hitters))
## [1] 0

Ridge regression

The first thing to do is to install package glmnet, run the command install.packages('glmnet') (not needed if you have already done this).

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1

Initially, we will make use of function glmnet() which implements ridge regression without cross-validation, but it does give a range of solutions over a grid of \(\lambda\) values. The grid is produced automatically, we do not need to specify anything. Of course, if we want we can define the grid manually. Let’s have a look first at the help document of glmnet().

?glmnet

A first important thing to observe is that glmnet() requires a different syntax than regsubsets() for declaring the response variable and the predictors. Now the response should be a vector and the predictor variables must be stacked column-wise as a matrix. This is easy to do for the response. For the predictors we have to make use of the model.matrix() command. Let us define as y the response and as x the matrix of predictors in R.

y = Hitters$Salary
x = model.matrix(Salary~., Hitters)[,-1] # Here we exclude the first column 
                                         #because it is the salary variable
head(x)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
## -Al Newman          185   37     1   23   8    21     2    214    42      1
##                   CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors
## -Alan Ashby         321  414    375       1         1     632      43     10
## -Alvin Davis        224  266    263       0         1     880      82     14
## -Andre Dawson       828  838    354       1         0     200      11      3
## -Andres Galarraga    48   46     33       1         0     805      40      4
## -Alfredo Griffin    501  336    194       0         1     282     421     25
## -Al Newman           30    9     24       1         0      76     127      7
##                   NewLeagueN
## -Alan Ashby                1
## -Alvin Davis               0
## -Andre Dawson              1
## -Andres Galarraga          1
## -Alfredo Griffin           0
## -Al Newman                 0

As we see the command model.matrix() transformed automatically the categorical variables in Hitters with names League, Division and NewLeague to dummy variables with names LeagueN, DivisionW and NewLeagueN. Other important things that we see in the help document is that glmnet() uses a grid of 100 values (nlambda = 100') for \(\lambda\) and that parameter alpha is used to select between ridge implementation and lasso implementation. The default option is alpha = 1 which runs lasso, for ridge we have to set alpha = 0. Finally we see the default option standardize = TRUE, which means we do not need to standardize the predictor variables manually.

Now we are ready to run our first ridge regression! The command line is extremely simple.

ridge = glmnet(x, y, alpha = 0)
names(ridge)
##  [1] "a0"        "beta"      "df"        "dim"       "lambda"    "dev.ratio"
##  [7] "nulldev"   "npasses"   "jerr"      "offset"    "call"      "nobs"
ridge$lambda
##   [1] 255282.09651 232603.53866 211939.68139 193111.54424 175956.04690
##   [6] 160324.59666 146081.80138 133104.29678 121279.67791 110505.52560
##  [11] 100688.51928  91743.62874  83593.37763  76167.17236  69400.69070
##  [16]  63235.32462  57617.67267  52499.07743  47835.20409  43585.65640
##  [21]  39713.62682  36185.57767  32970.95069  30041.90230  27373.06250
##  [26]  24941.31507  22725.59739  20706.71795  18867.19020  17191.08102
##  [31]  15663.87277  14272.33748  13004.42236  11849.14532  10796.49991
##  [36]   9837.36861   8963.44390   8167.15625   7441.60860   6780.51660
##  [41]   6178.15419   5629.30400   5129.21215   4673.54708   4258.36204
##  [46]   3880.06089   3535.36698   3221.29472   2935.12377   2674.37547
##  [51]   2436.79132   2220.31350   2023.06697   1843.34327   1679.58574
##  [56]   1530.37597   1394.42159   1270.54502   1157.67330   1054.82879
##  [61]    961.12071    875.73740    797.93930    727.05257    662.46322
##  [66]    603.61182    549.98861    501.12914    456.61020    416.04621
##  [71]    379.08581    345.40887    314.72370    286.76452    261.28915
##  [76]    238.07694    216.92684    197.65566    180.09647    164.09720
##  [81]    149.51926    136.23638    124.13351    113.10583    103.05782
##  [86]     93.90245     85.56042     77.95946     71.03376     64.72332
##  [91]     58.97348     53.73443     48.96082     44.61127     40.64813
##  [96]     37.03706     33.74679     30.74882     28.01718     25.52821
dim(ridge$beta)
## [1]  19 100
ridge$beta[,1:3]
## 19 x 3 sparse Matrix of class "dgCMatrix"
##                       s0            s1            s2
## AtBat       1.221172e-36  0.0023084432  0.0025301407
## Hits        4.429736e-36  0.0083864077  0.0091931957
## HmRun       1.784944e-35  0.0336900933  0.0369201974
## Runs        7.491019e-36  0.0141728323  0.0155353349
## RBI         7.912870e-36  0.0149583386  0.0163950163
## Walks       9.312961e-36  0.0176281785  0.0193237654
## Years       3.808598e-35  0.0718368521  0.0787191799
## CAtBat      1.048494e-37  0.0001980310  0.0002170325
## CHits       3.858759e-37  0.0007292122  0.0007992257
## CHmRun      2.910036e-36  0.0054982153  0.0060260057
## CRuns       7.741531e-37  0.0014629668  0.0016034328
## CRBI        7.989430e-37  0.0015098418  0.0016548119
## CWalks      8.452752e-37  0.0015957724  0.0017488195
## LeagueN    -1.301217e-35 -0.0230232091 -0.0250669217
## DivisionW  -1.751460e-34 -0.3339842326 -0.3663713385
## PutOuts     4.891197e-37  0.0009299203  0.0010198028
## Assists     7.989093e-38  0.0001517222  0.0001663687
## Errors     -3.725027e-37 -0.0007293365 -0.0008021006
## NewLeagueN -2.585026e-36 -0.0036169801 -0.0038293583

Using the command names() we can find what is returned by the function. The most important components are "beta" and "lambda". We can easily extract each using the $ symbol. As we see above glmnet() defines the grid starting with large values of \(\lambda\). This is because this helps speeding up the optimisation procedure. Also, glmnet() performs some internal checks on the predictor variables in order to define the grid values. Therefore, it is generally better to let glmnet() to define the grid automatically, rather than defining a grid. We further see that as expected ridge$beta had 19 rows (# of predictors) and 100 columns (lenght of the grid). Above the first 3 columns are displayed. Notice that ridge$beta does not return the estimates of the intercepts. We can use the command coef() for this which returns all betas..

coef(ridge)[,1:3]
## 20 x 3 sparse Matrix of class "dgCMatrix"
##                        s0            s1            s2
## (Intercept)  5.359259e+02  5.279266e+02  5.271582e+02
## AtBat        1.221172e-36  2.308443e-03  2.530141e-03
## Hits         4.429736e-36  8.386408e-03  9.193196e-03
## HmRun        1.784944e-35  3.369009e-02  3.692020e-02
## Runs         7.491019e-36  1.417283e-02  1.553533e-02
## RBI          7.912870e-36  1.495834e-02  1.639502e-02
## Walks        9.312961e-36  1.762818e-02  1.932377e-02
## Years        3.808598e-35  7.183685e-02  7.871918e-02
## CAtBat       1.048494e-37  1.980310e-04  2.170325e-04
## CHits        3.858759e-37  7.292122e-04  7.992257e-04
## CHmRun       2.910036e-36  5.498215e-03  6.026006e-03
## CRuns        7.741531e-37  1.462967e-03  1.603433e-03
## CRBI         7.989430e-37  1.509842e-03  1.654812e-03
## CWalks       8.452752e-37  1.595772e-03  1.748819e-03
## LeagueN     -1.301217e-35 -2.302321e-02 -2.506692e-02
## DivisionW   -1.751460e-34 -3.339842e-01 -3.663713e-01
## PutOuts      4.891197e-37  9.299203e-04  1.019803e-03
## Assists      7.989093e-38  1.517222e-04  1.663687e-04
## Errors      -3.725027e-37 -7.293365e-04 -8.021006e-04
## NewLeagueN  -2.585026e-36 -3.616980e-03 -3.829358e-03

We can produce a 1 by 2 plot similar to the ones we have seen in Lecture 2 via the following simple code. On the left we have the regularisation paths based on \(\log\lambda\) on the right we have the “flipped” version based on the \(\ell_1\)-norm.

par(mfrow=c(1, 2))
plot(ridge, xvar = 'lambda')
plot(ridge, xvar = 'norm')  # or simply plot(ridge) because xvar = 'norm' is the default

Ridge regression with cross-validation for \(\lambda\)

For implementing cross-validation (CV) in order to locate the min-CV \(\lambda\) or the 1 standard error (1-se) \(\lambda\) we will have to use the function cv.glmnet(). Lets have a look first at the help document by typing ?cv.glmnet. We see the argument nfolds = 10, so the default option is 10-fold CV (this is something we can change if we want to).

In order to get the results we simply run the following lines of code.

set.seed(1)
ridge.cv = cv.glmnet(x, y, alpha=0)
ridge.cv$lambda.min
## [1] 25.52821
ridge.cv$lambda.1se 
## [1] 1843.343

We see that in this case the 1-se \(\lambda\) is much larger than the min-CV \(\lambda\). This means that we expect the coefficients under 1-se \(\lambda\) to be much smaller. Lets have a look at the corresponding coefficients from the two models, rounding them to 3 decimal places.

round(cbind(
  coef(ridge.cv, s = 'lambda.min'),
  coef(ridge.cv, s = 'lambda.1se')),   # here we can also use coef(ridge.cv) 
                                       # becauce 'lambda.1se' is the default
3)
## 20 x 2 sparse Matrix of class "dgCMatrix"
##                    1       1
## (Intercept)   81.127 159.797
## AtBat         -0.682   0.102
## Hits           2.772   0.447
## HmRun         -1.366   1.289
## Runs           1.015   0.703
## RBI            0.713   0.687
## Walks          3.379   0.926
## Years         -9.067   2.715
## CAtBat        -0.001   0.009
## CHits          0.136   0.034
## CHmRun         0.698   0.254
## CRuns          0.296   0.069
## CRBI           0.257   0.071
## CWalks        -0.279   0.066
## LeagueN       53.213   5.396
## DivisionW   -122.834 -29.097
## PutOuts        0.264   0.068
## Assists        0.170   0.009
## Errors        -3.686  -0.236
## NewLeagueN   -18.105   4.458

Finally, lets plot the results based on the outputs of the objects ridge and ridge.cv which we have created.

par(mfrow=c(1,2))
plot(ridge.cv)
plot(ridge, xvar = 'lambda')
abline(v = log(ridge.cv$lambda.min), lty = 3) # careful to use the log here and below
abline(v = log(ridge.cv$lambda.1se), lty = 3)

Important note: Any method/technique that relies on validation/cross-validation is subject to variability. If we re-run this code under a different random seed results will change.

Sparsifying the ridge coefficients?

One thing mentioned in the lectures is that one can potentially consider some kind of post-hoc analysis in order to set some of the ridge coefficients equal to zero. A very simplistic approach is to just examine the ranking of the absolute coefficients and decide to use a “cutting-threshold”. We can use a combination of the commands sort() and abs() for this.

beta.1se =   coef(ridge.cv, s = 'lambda.1se')[-1]
rank.coef = sort(abs(beta.1se), decreasing = TRUE)
rank.coef
##  [1] 29.096663728  5.396487429  4.457548059  2.714623467  1.289060569
##  [6]  0.925962427  0.702915317  0.686866068  0.446840518  0.253594870
## [11]  0.235989097  0.102483884  0.071334608  0.068874010  0.067805862
## [16]  0.066114944  0.034359576  0.009201998  0.008746278

The last two coefficients are quite small, we could potentially set these two equal to zero. Lets do that and examine whether this “sparsification” offers any benefits in terms of predictive performace. This means that we will have to perform data splitting into training and testing samples in order to check.

repetitions = 50
cor.1 = c()
cor.2 = c()

set.seed(1)                
for(i in 1:repetitions){
  # Step (i) data splitting
  training.obs = sample(1:263,  175)
  y.train = Hitters$Salary[training.obs]
  x.train = model.matrix(Salary~., Hitters[training.obs, ])[,-1]
  y.test = Hitters$Salary[-training.obs]
  x.test = model.matrix(Salary~., Hitters[-training.obs, ])[,-1]
  
  # Step (ii) training phase
  ridge.train1 = cv.glmnet(x.train, y.train, alpha = 0)
  
  # Step (iii) here we find which coefficients to set equal to zero
  ind = which(abs(coef(ridge.train1)) < 0.02)
  beta.sparse = coef(ridge.train1)
  beta.sparse[ind] = 0
  
  # Step (iv) generating predictions
  predict.1 = predict(ridge.train1, x.test, s = 'lambda.1se')
  predict.2 = cbind(1, x.test) %*% as.numeric(beta.sparse)   # here we cannot use the
                                                             # predict function so we  
                                                             # use standard matrix 
                                                             # multiplication  
  # Step (v) evaluating predictive performance
  cor.1[i] = cor(y.test, predict.1)
  cor.2[i] = cor(y.test, predict.2)
}

boxplot(cor.1, cor.2, names = c('Standard Ridge','Sparse Ridge'), ylab = 'Test correlation', col = 3)

Predictive performance in this case is almost identical. However, in some cases where we have many coefficients that are very small, sparsification of ridge can actually help in terms of prediction. We also note that ridge has a better predictive performance than the model search methods based on \(C_p\), BIC, adjusted-\(R^2\) if we compare the medians (black lines in the boxplots above).

Practice: Body Fat and Body Measurements Data

Lets continue the analysis of body measurements data.

library(faraway) 
fat = fat[,-c(2, 3, 8)]
dim(fat)
## [1] 252  15

Task 1

Ridge regression. Use glmnet() to perform a similar analysis as above to produce the regularisation-path plots based on the logarithm of \(\lambda\) and on the \(\ell_1\)-norm, using a grid length of 200.

Task 2

Ridge regression with cross-validation. Using cv.glmnet with 5-fold CV and random seed equal to 2 find the values of min-CV \(\lambda\) and 1-se \(\lambda\). Plot the CV \(\lambda\) graph next to the regularisation-path plot indicating with vertical lines the two \(\lambda\) values. Which predictor has the highest absolute coefficient?

Task 3

Prediction. Using 50 repetitions with random seed set to 2 compare the predictive performance of the model with min-CV \(\lambda\) vs. the model with 1-se \(\lambda\). In this case use 10-fold CV within cv.glmnet() and also a 50% splitting for training and testing. Produce boxplots as above and also calculate the mean correlations from the two models. Under which approach does ridge seem to perform better?