Tutorial: Baseball Data

In this workshop we will start with an analysis of the Hitters dataset included in package ISLR. This dataset consists of 322 records of baseball players. The response variable is the players’ salary and the number of predictors is 19, including variables such as number of hits, years in the league and so forth.

Data handling

As in Workshop 1 we first load the dataset and “clean” it from entries with missing values.

library(ISLR)
help(Hitters)
names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

head(Hitters)

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson      293   66     1   30  29    14     1    293    66      1
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson       30   29     14      A        E     446      33     20
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
##                   Salary NewLeague
## -Andy Allanson        NA         A
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A

dim(Hitters)

## [1] 322  20

sum(is.na(Hitters$Salary))

## [1] 59

Hitters = na.omit(Hitters)
dim(Hitters)

## [1] 263  20

sum(is.na(Hitters))

## [1] 0

Best Subset Selection: validation

Now we will apply best subset selection not based on model selection criteria but on the validation approach. As we know for validation (and cross-validation) we need to be able to predict. As discussed in Workshop 1 regsubsets() does not have its own build-in function for prediction, so we will use this function.

predict.regsubsets = function(object, newdata, id, ...){
  form = as.formula(object$call[[2]])
  mat = model.matrix(form, newdata)
  coefi = coef(object, id = id)
  xvars = names(coefi)
  mat[, xvars]%*%coefi
}

The validation approach is based on splitting the data once into a training sample and a validation or testing sample. We also need to decide how to split the data. A common approach is to use 2/3 of the sample for training and 1/3 of the sample for testing (although it does not have to be necessarily so). At the end of Workshop 1 we saw how we can do this using the command sample(). The code is given below.

set.seed(10)          # for the results to be reproducible   
dim(Hitters)

## [1] 263  20

training.obs = sample(1:263, 175)
Hitters.train = Hitters[training.obs,  ]
Hitters.test = Hitters[-training.obs,  ]
dim(Hitters.train)

## [1] 175  20

dim(Hitters.test)

## [1] 88 20

We will also need to load package leaps again as we will use the command regsubsets().

library(leaps)

The procedure is executed as follows. We first apply regsubsets() to the training set. Then we create the empty vector val.error where we will store the 19 (because we have 19 models) validation mean squared errors (MSE). Then we have to use a for loop inside which we (i) generate predictions using predict.regsubsets() and (ii) calculate the validation MSEs. At the end we just locate which model produces the lowest MSE. The code is the following.

best = regsubsets(Salary~., data = Hitters.train, nvmax = 19)

val.error<-c()
for(i in 1:19){
  pred = predict.regsubsets(best, Hitters.test, i)
  val.error[i] = mean((Hitters.test$Salary - pred)^2)
}

val.error

##  [1] 199324.8 184215.9 196949.2 192709.1 180806.0 179388.2 178929.9 176754.4
##  [9] 177029.2 163013.2 165134.5 164986.7 166072.1 164983.9 164890.7 165144.4
## [17] 164983.7 164962.6 164959.9

which.min(val.error)

## [1] 10

In this instance under the option set.seed(10) (more on that below) we see that the model with 10 predictors is selected as the best. This was also the model selected by $C_p$ in Workshop 1.

It is important to note that there is a final step in this procedure. We should not use the current coefficients for any purpose, because these were “learned” from the reduced training sample. These will not be the same with the corresponding coefficients obtained by training the 10-predictor model to the entire data; in fact this procedure may actually select different variables!

coef(best, 10)

##  (Intercept)        AtBat         Hits         Runs        Walks       CAtBat 
##  -21.6323918   -0.7961450    4.5874937   -3.1211583    7.5830718   -0.1058408 
##        CRuns         CRBI       CWalks    DivisionW      PutOuts 
##    1.9895304    0.3010303   -1.1736748 -121.3385526    0.1989660

best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(best, 10)

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
##         CRBI       CWalks    DivisionW      PutOuts      Assists 
##    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

Best Subset Selection: cross-validation

Now lets turn our attention to the K-fold cross-validation (CV) approach. This procedure is generally preferable to simple validation because it utilizes the entire sample for training as well as for testing. As we will see below this also decreases its variance. A first question is how many folds to use. Common choices in practice are 5 and 10, leading to 5-fold and 10-fold CV. In this example we will use 10 folds. Of course, many times sample size n will not be perfectly dividable with K (as in our case where n=263 and K=10). This does not matter, it is not a problem if some folds have fewer observations.

Lets see how we can create folds in R. First we need to form an indicator vector with length equal to 263 containing the numbers 1,2,3,4,5,6,7,8,9,10 in roughly the same number of times. This can be done by combining the commands cut() and seq() in R as below.

k = 10
folds = cut(seq(1, 263), breaks=10, labels=FALSE)
folds

##   [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [26]  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
##  [51]  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
##  [76]  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [101]  4  4  4  4  4  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
## [126]  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6
## [151]  6  6  6  6  6  6  6  6  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7
## [176]  7  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
## [201]  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9
## [226]  9  9  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10 10 10 10 10 10 10 10 10
## [251] 10 10 10 10 10 10 10 10 10 10 10 10 10

table(folds)

## folds
##  1  2  3  4  5  6  7  8  9 10 
## 27 26 26 26 27 26 26 26 26 27

It is then very important to randomly re-shuffle this vector, because we do not want the fold creation procedure to be deterministic. We can do this very easily in R using once again the command sample()

set.seed(2)
folds = sample(folds)
folds

##   [1]  8 10  8  7  3  5  3  6  9  2  3  4  5  9  7  2 10  5  7  6  2 10  8  6  4
##  [26]  1  5  8  7  5 10  4  5 10  8  8  1  5  9  2  2 10  8 10  1  1  7  9  2  6
##  [51]  5  8  3  4  8  2  1  6  8 10  5  7  2  3  7  7  1  1  7  3  9  8  4  3  6
##  [76]  7  5  5 10  2  5  6  2  1  4  5  9  1  3  3  8  3 10  8  6  1  2  6  1  4
## [101]  9  6  4 10  9  1  9  7  9  8  4  8  6  1 10  2 10 10  7  9  3  9  4  4  5
## [126]  6 10  3  9  2  8  8  6  2  2  6  1  9  1 10  7  4  2  4  8  5  8  1  6  3
## [151]  1 10 10  1  3  3  3  7  3  4  2  9  4  6  8  9  7  2  1  7  4  3  5  7  8
## [176]  4  9  7  9  5  2  1  1  6  4  4  5 10  6  5  1 10  9  7  3  1  5  2  7  7
## [201]  4  6  7 10  3  6  5 10  4  9  9  5  7  7  2  6  9  5  5  3  2  3  5  5  9
## [226]  3  6  1  1  9  8  6  8  9 10  4  4 10 10  5  2  3  3  6  3  4 10  9  6  2
## [251]  1  8  6  7  8  4  2  8  1 10  2  7  4

Finally, we need to create a matrix this time with K=10 rows and 19 columns. The validation MSEs of the folds for model 1 will be stored in the first column of this matrix, he validation MSEs of the folds for model 2 will be stored in the second column of this matrix and so forth.

cv.errors = matrix(NA, nrow = k, ncol = 19, dimnames = list(NULL, paste(1:19)))
cv.errors

##        1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
##  [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [7,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [8,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [9,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [10,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Above NA means “not available”. The NA entries will be filled with the MSEs. The part dimnames = list(NULL, paste(1:19)) just assigns no names for rows and names the columns 1,2,…,19.

Now we can implement best subset selection based on CV. Here we will need a double for loop; one loop is required for the folds and one for the models. The important part is to understand what data = Hitters[folds!=j, ] is doing in the outer loop. In R != means “not equal”. So for example when j=1 we discard the rows of Hitter for which the corresponding fold vector equals 1. In the inside loop where we predict things are reversed and we use Hitters[folds==j, ] and Hitters$Salary[folds==j]. As we see the validation MSEs are iteratively stored in the matrix cv.errors.

for(j in 1:k){
  best.fit = regsubsets(Salary~., data = Hitters[folds!=j, ], nvmax = 19)
  for(i in 1:19){
    pred = predict(best.fit, Hitters[folds==j, ], id = i)
    cv.errors[j,i] = mean( (Hitters$Salary[folds==j]-pred)^2)
  }
}
cv.errors

##               1         2         3         4        5         6        7
##  [1,] 180095.35 140597.34 139979.20 155610.99 178269.2 169549.63 163415.9
##  [2,] 102991.38  95451.06 113165.61 121295.24 135624.2 123478.30 133437.4
##  [3,] 155638.99 122730.49 126006.08 115333.77 108912.7  98247.02 104038.5
##  [4,] 152325.76 114621.36 125147.80 110042.53 113753.6 120745.66 117996.8
##  [5,] 255514.09 277193.55 270456.47 258450.67 255676.2 231835.72 244154.3
##  [6,] 177009.41 331571.03 226916.27 203351.68 177070.7 151180.00 148773.5
##  [7,] 181272.43 127906.28 126475.30 120322.72 119688.6  95275.05 105056.3
##  [8,]  68261.04  87882.66  97202.19 102823.06 105857.1 106254.47 108113.3
##  [9,]  68283.58 100979.35  98096.98 111093.82 104191.5  97551.93 102718.8
## [10,] 132803.93  83946.03  88047.42  72391.53  70098.4  63602.55  66458.2
##               8         9        10        11        12        13        14
##  [1,] 149809.38 139194.15 145352.00 149295.49 155846.57 166191.81 160500.72
##  [2,] 118057.28 122960.98 111992.83 114778.64 114437.92 111970.89 112782.56
##  [3,]  92079.40 102035.64 102208.75  87132.03  94810.27  90853.96  89914.72
##  [4,] 109997.63 108747.37 113161.24 119481.82 119081.00 119718.04 125344.00
##  [5,] 234995.16 228870.88 229668.35 230603.16 226742.89 228211.32 227006.44
##  [6,] 167150.69 132632.73 147014.32 125897.63 123587.05 131048.28 129096.09
##  [7,]  96645.30  97194.98 115230.50 103373.29 107759.17 107231.14 108093.31
##  [8,]  96720.35  99794.87 100105.45  98519.80 104422.89 102786.61 101791.75
##  [9,]  95467.90 102061.37  97701.55 104063.92 105347.58 104597.96 106703.51
## [10,]  56856.51  55356.77  48005.26  48964.83  47179.56  49761.98  48941.38
##              15        16        17        18        19
##  [1,] 158269.29 157488.10 157949.72 157841.95 159337.52
##  [2,] 112515.11 111740.27 112085.76 111576.45 111492.04
##  [3,]  93356.48  93141.77  93031.62  92770.19  92834.25
##  [4,] 126959.49 125441.40 125312.29 125078.64 125069.05
##  [5,] 226562.34 226576.32 226875.52 228599.44 228438.30
##  [6,] 130855.21 131669.09 130801.91 131222.61 131767.46
##  [7,] 107299.53 107115.77 107135.90 107100.60 107117.21
##  [8,] 102182.97 102297.60 102684.98 102856.16 102882.81
##  [9,] 105865.52 105706.35 105443.95 105648.15 105519.79
## [10,]  52259.35  52401.54  52216.29  52054.44  52034.40

Once we obtain the matrix cv.errors we can use apply(cv.errors, 2, mean) as below to get the column-wise means. If we had used apply(cv.errors, 1, mean) we would have wrongly calculated the 10 row-wise means. Then all we do is locate the minimum. As we see below the best model based on CV is that with 11 predictors; as before we re-train this model using the entire dataset.

mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors

##        1        2        3        4        5        6        7        8 
## 147419.6 148287.9 141149.3 137071.6 136914.2 125772.0 129416.3 121778.0 
##        9       10       11       12       13       14       15       16 
## 118885.0 121044.0 118211.1 119921.5 121237.2 121017.4 121612.5 121357.8 
##       17       18       19 
## 121353.8 121474.9 121649.3

which.min(mean.cv.errors)

## 11 
## 11

best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(best, 11)

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914    1.4553310 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##    0.7852528   -0.8228559   43.1116152 -111.1460252    0.2894087    0.2688277

Variance of selection based on validation and cross-validation

When using validation and cross-validation for best subset selection (or any stepwise method) it is important to understand that results depend on the random splitting between training and testing samples. This is generally a drawback which is not discussed a lot, especially for the simple validation approach. The below code implements the validation approach presented previously but now starting from 50 different random seeds.

min.valid = c()

for(n in 1:50){
  set.seed(n)          
  training.obs = sample(1:263, 175)
  Hitters.train = Hitters[training.obs,  ]
  Hitters.test = Hitters[-training.obs,  ]
  dim(Hitters.train)
  dim(Hitters.test)
  
  best = regsubsets(Salary~., data = Hitters.train, nvmax = 19)
  
  val.error<-c()
  for(i in 1:19){
    pred = predict.regsubsets(best, Hitters.test, i)
    val.error[i] = mean((Hitters.test$Salary - pred)^2)
  }
  val.error
  min.valid[n] = which.min(val.error)
}

The histogram below shows that the final selection of the best model can vary a lot! Obviously this is not desirable. Of course, on average we are selecting close to 9 or 10 predictors, but this would imply that we would have to repeat the procedure many times in order to be on the safe side! This increases computational costs of BSS based on validation.

hist(min.valid, col = 3, nclass = 50, xlab = 'Number of Predictors', main = 'BSS with validation')
abline(v = mean(min.valid), col = 2, lwd = 4)
legend('topright', legend=c('Average selection'),bty = 'n', lty = 1, lwd = 4, col = 2)

Lets examine now how cross-validation is affected. In the code below we implement BSS based on CV starting again from 50 different random seeds.

min.valid1 = c()

for(n in 1:50){
  set.seed(n)   
  k = 10
  set.seed(n)
  folds = sample(1:k, 263, replace = TRUE)
  folds
  cv.errors = matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
  
  for(j in 1:k){
    best.fit = regsubsets(Salary~., data = Hitters[folds!=j, ], nvmax = 19)
    for(i in 1:19){
      pred = predict(best.fit, Hitters[folds==j, ], id = i)
      cv.errors[j,i] = mean( (Hitters$Salary[folds==j]-pred)^2)
    }
  }
  
  mean.cv.errors = apply(cv.errors, 2, mean)
  mean.cv.errors
  min.valid1[n] = which.min(mean.cv.errors)
}

As we see from the histogram there is variability, but in general the selection based on cross-validation is much more stable than the selection based on simple validation. This is because eventually all of the data are used for training and testing. Still however, one should be a bit reluctant to draw a conclusion from just one run. Note also that now the computational costs are increased even more due to the repeated data splitting that is required.

hist(min.valid1, col = 3, nclass = 50, xlab = 'Number of Predictors', main = 'BSS with cross-validation')
abline(v = mean(min.valid1), col = 2, lwd = 4)
legend('topright', legend=c('Average selection'),bty = 'n', lty = 1, lwd = 4, col = 2)

What about inference?

As we have seen the output from regsubsets() relates to model selection procedure; i.e. we get the values of $Cp$, BIC, adjusted-$R^2$. We can also extract model specific regression coefficients via the command coef(). For quantities such as standard errors, confidence and prediction intervals etc. we need re-fit the selected best model using standard least squares.

Say for example that we have confidence on the model with 10 predictors, but we also want additional information about it. In this case we use the command lm() in R, but first we need to check whether the selected model contains any factors. We need to perform this check because the command regsubsets() automatically converts factor variables to dummy variables, while lm() does not (meaning that you will get an error if the selected model contains factors). Lets check the coefficients of the model with 10 predictors and also examine what the first 4 rows if the data matrix look like.

best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(best, 10)

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
##         CRBI       CWalks    DivisionW      PutOuts      Assists 
##    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

head(Hitters, 4L)

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
##                   Salary NewLeague
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N

From the output of head() we see that the variable Division is a factor. From the output of coef() we can deduct that regsubset() automatically made a dummy variable which equals 1 when Division=W and 0 otherwise. Therefore when we use the command lm() we must inform that this variable is a factor by enclosing it within I() as we do below.

ls10 = lm(Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + CRBI + CWalks +
          I(Division) + PutOuts + Assists, data = Hitters)

Then we can get standard output using commands like summary() or confint() for confidence intervals.

summary(ls10)

## 
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + 
##     CRBI + CWalks + I(Division) + PutOuts + Assists, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -939.11 -176.87  -34.08  130.90 1910.55 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   162.53544   66.90784   2.429 0.015830 *  
## AtBat          -2.16865    0.53630  -4.044 7.00e-05 ***
## Hits            6.91802    1.64665   4.201 3.69e-05 ***
## Walks           5.77322    1.58483   3.643 0.000327 ***
## CAtBat         -0.13008    0.05550  -2.344 0.019858 *  
## CRuns           1.40825    0.39040   3.607 0.000373 ***
## CRBI            0.77431    0.20961   3.694 0.000271 ***
## CWalks         -0.83083    0.26359  -3.152 0.001818 ** 
## I(Division)W -112.38006   39.21438  -2.866 0.004511 ** 
## PutOuts         0.29737    0.07444   3.995 8.50e-05 ***
## Assists         0.28317    0.15766   1.796 0.073673 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 311.8 on 252 degrees of freedom
## Multiple R-squared:  0.5405, Adjusted R-squared:  0.5223 
## F-statistic: 29.64 on 10 and 252 DF,  p-value: < 2.2e-16

confint(ls10, 1:11)

##                      2.5 %       97.5 %
## (Intercept)    30.76564039 294.30524369
## AtBat          -3.22485010  -1.11245000
## Hits            3.67507200  10.16096299
## Walks           2.65203223   8.89441706
## CAtBat         -0.23937493  -0.02078463
## CRuns           0.63939259   2.17710543
## CRBI            0.36150857   1.18711573
## CWalks         -1.34995509  -0.31169761
## I(Division)W -189.60973671 -35.15037828
## PutOuts         0.15076820   0.44397699
## Assists        -0.02732144   0.59365750

Practice: Body Fat and Body Measurements Data

With this data we want to predict body fat (variable brozek) using the other variables available, except for siri (another way of computing body fat), density (it is used in the brozek and siri formulas) and free (it is computed using brozek formula). So, this variables need to be removed.

library(faraway) 
help(fat)    # read the help file for more information about the variables
head(fat)

##   brozek siri density age weight height adipos  free neck chest abdom   hip
## 1   12.6 12.3  1.0708  23 154.25  67.75   23.7 134.9 36.2  93.1  85.2  94.5
## 2    6.9  6.1  1.0853  22 173.25  72.25   23.4 161.3 38.5  93.6  83.0  98.7
## 3   24.6 25.3  1.0414  22 154.00  66.25   24.7 116.0 34.0  95.8  87.9  99.2
## 4   10.9 10.4  1.0751  26 184.75  72.25   24.9 164.7 37.4 101.8  86.4 101.2
## 5   27.8 28.7  1.0340  24 184.25  71.25   25.6 133.1 34.4  97.3 100.0 101.9
## 6   20.6 20.9  1.0502  24 210.25  74.75   26.5 167.0 39.0 104.5  94.4 107.8
##   thigh knee ankle biceps forearm wrist
## 1  59.0 37.3  21.9   32.0    27.4  17.1
## 2  58.7 37.3  23.4   30.5    28.9  18.2
## 3  59.6 38.9  24.0   28.8    25.2  16.6
## 4  60.1 37.3  22.8   32.4    29.4  18.2
## 5  63.2 42.2  24.0   32.2    27.7  17.7
## 6  66.0 42.0  25.6   35.7    30.6  18.8

dim(fat)

## [1] 252  18

names(fat)

##  [1] "brozek"  "siri"    "density" "age"     "weight"  "height"  "adipos" 
##  [8] "free"    "neck"    "chest"   "abdom"   "hip"     "thigh"   "knee"   
## [15] "ankle"   "biceps"  "forearm" "wrist"

fat = fat[,-c(2, 3, 8)]
dim(fat)

## [1] 252  15

sum(is.na(fat$brozek))

## [1] 0

Task 1

Forward stepwise regression: validation. Use regsubsets() to perform forward stepwise selection under the validation approach using 127 observations as training sample and 125 observations as testing sampling. Set seed equal to 1. Store the coefficients of the final model to an object called coef.fwd1. Repeat this with seed set to 13 and again store the coefficients of the final model to an object coef.fwd2. What is the first coefficient of predictor age from the first run? Is variable age included in the model from the second run? If not, what is the first predictor in coef.fwd2?

Task 2

Backward stepwise regression: cross-validation. Implement backward stepwise selections using 5-fold cross-validation with random seed set to 20. Plot the mean cross-validation MSE as a function of number of predictors and use the command `points()’ to clearly indicate on the plot which model has the lowest validation MSE.

Task 3

Standard Inference. Fit a standard LS model for the best final model from Task 2. Produce 95% confidence intervals for the variables abdom and forearm.

Workshop 2: Best Subset and Stepwise Selection with Validation and Cross-Validation