In this workshop we will start with an analysis of the Hitters
dataset included in package ISLR
. This dataset consists of 322 records of baseball players. The response variable is the players’ salary and the number of predictors is 19, including variables such as number of hits, years in the league and so forth.
As in Workshop 1 we first load the dataset and “clean” it from entries with missing values.
library(ISLR)
help(Hitters)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson 293 66 1 30 29 14 1 293 66 1
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson 30 29 14 A E 446 33 20
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## Salary NewLeague
## -Andy Allanson NA A
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
## -Alfredo Griffin 750.0 A
dim(Hitters)
## [1] 322 20
sum(is.na(Hitters$Salary))
## [1] 59
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0
Now we will apply best subset selection not based on model selection criteria but on the validation approach. As we know for validation (and cross-validation) we need to be able to predict. As discussed in Workshop 1 regsubsets()
does not have its own build-in function for prediction, so we will use this function.
predict.regsubsets = function(object, newdata, id, ...){
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars]%*%coefi
}
The validation approach is based on splitting the data once into a training sample and a validation or testing sample. We also need to decide how to split the data. A common approach is to use 2/3 of the sample for training and 1/3 of the sample for testing (although it does not have to be necessarily so). At the end of Workshop 1 we saw how we can do this using the command sample()
. The code is given below.
set.seed(10) # for the results to be reproducible
dim(Hitters)
## [1] 263 20
training.obs = sample(1:263, 175)
Hitters.train = Hitters[training.obs, ]
Hitters.test = Hitters[-training.obs, ]
dim(Hitters.train)
## [1] 175 20
dim(Hitters.test)
## [1] 88 20
We will also need to load package leaps
again as we will use the command regsubsets()
.
library(leaps)
The procedure is executed as follows. We first apply regsubsets()
to the training set. Then we create the empty vector val.error
where we will store the 19 (because we have 19 models) validation mean squared errors (MSE). Then we have to use a for loop inside which we (i) generate predictions using predict.regsubsets()
and (ii) calculate the validation MSEs. At the end we just locate which model produces the lowest MSE. The code is the following.
best = regsubsets(Salary~., data = Hitters.train, nvmax = 19)
val.error<-c()
for(i in 1:19){
pred = predict.regsubsets(best, Hitters.test, i)
val.error[i] = mean((Hitters.test$Salary - pred)^2)
}
val.error
## [1] 199324.8 184215.9 196949.2 192709.1 180806.0 179388.2 178929.9 176754.4
## [9] 177029.2 163013.2 165134.5 164986.7 166072.1 164983.9 164890.7 165144.4
## [17] 164983.7 164962.6 164959.9
which.min(val.error)
## [1] 10
In this instance under the option set.seed(10)
(more on that below) we see that the model with 10 predictors is selected as the best. This was also the model selected by \(C_p\) in Workshop 1.
It is important to note that there is a final step in this procedure. We should not use the current coefficients for any purpose, because these were “learned” from the reduced training sample. These will not be the same with the corresponding coefficients obtained by training the 10-predictor model to the entire data; in fact this procedure may actually select different variables!
coef(best, 10)
## (Intercept) AtBat Hits Runs Walks CAtBat
## -21.6323918 -0.7961450 4.5874937 -3.1211583 7.5830718 -0.1058408
## CRuns CRBI CWalks DivisionW PutOuts
## 1.9895304 0.3010303 -1.1736748 -121.3385526 0.1989660
best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(best, 10)
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
## CRBI CWalks DivisionW PutOuts Assists
## 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
Now lets turn our attention to the K-fold cross-validation (CV) approach. This procedure is generally preferable to simple validation because it utilizes the entire sample for training as well as for testing. As we will see below this also decreases its variance. A first question is how many folds to use. Common choices in practice are 5 and 10, leading to 5-fold and 10-fold CV. In this example we will use 10 folds. Of course, many times sample size n will not be perfectly dividable with K (as in our case where n=263 and K=10). This does not matter, it is not a problem if some folds have fewer observations.
Lets see how we can create folds in R. First we need to form an indicator vector with length equal to 263 containing the numbers 1,2,3,4,5,6,7,8,9,10 in roughly the same number of times. This can be done by combining the commands cut()
and seq()
in R as below.
k = 10
folds = cut(seq(1, 263), breaks=10, labels=FALSE)
folds
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [26] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [51] 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [76] 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [101] 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## [126] 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [151] 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
## [176] 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [201] 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [226] 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10
## [251] 10 10 10 10 10 10 10 10 10 10 10 10 10
table(folds)
## folds
## 1 2 3 4 5 6 7 8 9 10
## 27 26 26 26 27 26 26 26 26 27
It is then very important to randomly re-shuffle this vector, because we do not want the fold creation procedure to be deterministic. We can do this very easily in R using once again the command sample()
set.seed(2)
folds = sample(folds)
folds
## [1] 8 10 8 7 3 5 3 6 9 2 3 4 5 9 7 2 10 5 7 6 2 10 8 6 4
## [26] 1 5 8 7 5 10 4 5 10 8 8 1 5 9 2 2 10 8 10 1 1 7 9 2 6
## [51] 5 8 3 4 8 2 1 6 8 10 5 7 2 3 7 7 1 1 7 3 9 8 4 3 6
## [76] 7 5 5 10 2 5 6 2 1 4 5 9 1 3 3 8 3 10 8 6 1 2 6 1 4
## [101] 9 6 4 10 9 1 9 7 9 8 4 8 6 1 10 2 10 10 7 9 3 9 4 4 5
## [126] 6 10 3 9 2 8 8 6 2 2 6 1 9 1 10 7 4 2 4 8 5 8 1 6 3
## [151] 1 10 10 1 3 3 3 7 3 4 2 9 4 6 8 9 7 2 1 7 4 3 5 7 8
## [176] 4 9 7 9 5 2 1 1 6 4 4 5 10 6 5 1 10 9 7 3 1 5 2 7 7
## [201] 4 6 7 10 3 6 5 10 4 9 9 5 7 7 2 6 9 5 5 3 2 3 5 5 9
## [226] 3 6 1 1 9 8 6 8 9 10 4 4 10 10 5 2 3 3 6 3 4 10 9 6 2
## [251] 1 8 6 7 8 4 2 8 1 10 2 7 4
Finally, we need to create a matrix this time with K=10 rows and 19 columns. The validation MSEs of the folds for model 1 will be stored in the first column of this matrix, he validation MSEs of the folds for model 2 will be stored in the second column of this matrix and so forth.
cv.errors = matrix(NA, nrow = k, ncol = 19, dimnames = list(NULL, paste(1:19)))
cv.errors
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [7,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [8,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [9,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [10,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Above NA
means “not available”. The NA entries will be filled with the MSEs. The part dimnames = list(NULL, paste(1:19))
just assigns no names for rows and names the columns 1,2,…,19.
Now we can implement best subset selection based on CV. Here we will need a double for loop; one loop is required for the folds and one for the models. The important part is to understand what data = Hitters[folds!=j, ]
is doing in the outer loop. In R !=
means “not equal”. So for example when j=1
we discard the rows of Hitter
for which the corresponding fold
vector equals 1. In the inside loop where we predict things are reversed and we use Hitters[folds==j, ]
and Hitters$Salary[folds==j]
. As we see the validation MSEs are iteratively stored in the matrix cv.errors
.
for(j in 1:k){
best.fit = regsubsets(Salary~., data = Hitters[folds!=j, ], nvmax = 19)
for(i in 1:19){
pred = predict(best.fit, Hitters[folds==j, ], id = i)
cv.errors[j,i] = mean( (Hitters$Salary[folds==j]-pred)^2)
}
}
cv.errors
## 1 2 3 4 5 6 7
## [1,] 180095.35 140597.34 139979.20 155610.99 178269.2 169549.63 163415.9
## [2,] 102991.38 95451.06 113165.61 121295.24 135624.2 123478.30 133437.4
## [3,] 155638.99 122730.49 126006.08 115333.77 108912.7 98247.02 104038.5
## [4,] 152325.76 114621.36 125147.80 110042.53 113753.6 120745.66 117996.8
## [5,] 255514.09 277193.55 270456.47 258450.67 255676.2 231835.72 244154.3
## [6,] 177009.41 331571.03 226916.27 203351.68 177070.7 151180.00 148773.5
## [7,] 181272.43 127906.28 126475.30 120322.72 119688.6 95275.05 105056.3
## [8,] 68261.04 87882.66 97202.19 102823.06 105857.1 106254.47 108113.3
## [9,] 68283.58 100979.35 98096.98 111093.82 104191.5 97551.93 102718.8
## [10,] 132803.93 83946.03 88047.42 72391.53 70098.4 63602.55 66458.2
## 8 9 10 11 12 13 14
## [1,] 149809.38 139194.15 145352.00 149295.49 155846.57 166191.81 160500.72
## [2,] 118057.28 122960.98 111992.83 114778.64 114437.92 111970.89 112782.56
## [3,] 92079.40 102035.64 102208.75 87132.03 94810.27 90853.96 89914.72
## [4,] 109997.63 108747.37 113161.24 119481.82 119081.00 119718.04 125344.00
## [5,] 234995.16 228870.88 229668.35 230603.16 226742.89 228211.32 227006.44
## [6,] 167150.69 132632.73 147014.32 125897.63 123587.05 131048.28 129096.09
## [7,] 96645.30 97194.98 115230.50 103373.29 107759.17 107231.14 108093.31
## [8,] 96720.35 99794.87 100105.45 98519.80 104422.89 102786.61 101791.75
## [9,] 95467.90 102061.37 97701.55 104063.92 105347.58 104597.96 106703.51
## [10,] 56856.51 55356.77 48005.26 48964.83 47179.56 49761.98 48941.38
## 15 16 17 18 19
## [1,] 158269.29 157488.10 157949.72 157841.95 159337.52
## [2,] 112515.11 111740.27 112085.76 111576.45 111492.04
## [3,] 93356.48 93141.77 93031.62 92770.19 92834.25
## [4,] 126959.49 125441.40 125312.29 125078.64 125069.05
## [5,] 226562.34 226576.32 226875.52 228599.44 228438.30
## [6,] 130855.21 131669.09 130801.91 131222.61 131767.46
## [7,] 107299.53 107115.77 107135.90 107100.60 107117.21
## [8,] 102182.97 102297.60 102684.98 102856.16 102882.81
## [9,] 105865.52 105706.35 105443.95 105648.15 105519.79
## [10,] 52259.35 52401.54 52216.29 52054.44 52034.40
Once we obtain the matrix cv.errors
we can use apply(cv.errors, 2, mean)
as below to get the column-wise means. If we had used apply(cv.errors, 1, mean)
we would have wrongly calculated the 10 row-wise means. Then all we do is locate the minimum. As we see below the best model based on CV is that with 11 predictors; as before we re-train this model using the entire dataset.
mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 147419.6 148287.9 141149.3 137071.6 136914.2 125772.0 129416.3 121778.0
## 9 10 11 12 13 14 15 16
## 118885.0 121044.0 118211.1 119921.5 121237.2 121017.4 121612.5 121357.8
## 17 18 19
## 121353.8 121474.9 121649.3
which.min(mean.cv.errors)
## 11
## 11
best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(best, 11)
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914 1.4553310
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.7852528 -0.8228559 43.1116152 -111.1460252 0.2894087 0.2688277
When using validation and cross-validation for best subset selection (or any stepwise method) it is important to understand that results depend on the random splitting between training and testing samples. This is generally a drawback which is not discussed a lot, especially for the simple validation approach. The below code implements the validation approach presented previously but now starting from 50 different random seeds.
min.valid = c()
for(n in 1:50){
set.seed(n)
training.obs = sample(1:263, 175)
Hitters.train = Hitters[training.obs, ]
Hitters.test = Hitters[-training.obs, ]
dim(Hitters.train)
dim(Hitters.test)
best = regsubsets(Salary~., data = Hitters.train, nvmax = 19)
val.error<-c()
for(i in 1:19){
pred = predict.regsubsets(best, Hitters.test, i)
val.error[i] = mean((Hitters.test$Salary - pred)^2)
}
val.error
min.valid[n] = which.min(val.error)
}
The histogram below shows that the final selection of the best model can vary a lot! Obviously this is not desirable. Of course, on average we are selecting close to 9 or 10 predictors, but this would imply that we would have to repeat the procedure many times in order to be on the safe side! This increases computational costs of BSS based on validation.
hist(min.valid, col = 3, nclass = 50, xlab = 'Number of Predictors', main = 'BSS with validation')
abline(v = mean(min.valid), col = 2, lwd = 4)
legend('topright', legend=c('Average selection'),bty = 'n', lty = 1, lwd = 4, col = 2)
Lets examine now how cross-validation is affected. In the code below we implement BSS based on CV starting again from 50 different random seeds.
min.valid1 = c()
for(n in 1:50){
set.seed(n)
k = 10
set.seed(n)
folds = sample(1:k, 263, replace = TRUE)
folds
cv.errors = matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
best.fit = regsubsets(Salary~., data = Hitters[folds!=j, ], nvmax = 19)
for(i in 1:19){
pred = predict(best.fit, Hitters[folds==j, ], id = i)
cv.errors[j,i] = mean( (Hitters$Salary[folds==j]-pred)^2)
}
}
mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
min.valid1[n] = which.min(mean.cv.errors)
}
As we see from the histogram there is variability, but in general the selection based on cross-validation is much more stable than the selection based on simple validation. This is because eventually all of the data are used for training and testing. Still however, one should be a bit reluctant to draw a conclusion from just one run. Note also that now the computational costs are increased even more due to the repeated data splitting that is required.
hist(min.valid1, col = 3, nclass = 50, xlab = 'Number of Predictors', main = 'BSS with cross-validation')
abline(v = mean(min.valid1), col = 2, lwd = 4)
legend('topright', legend=c('Average selection'),bty = 'n', lty = 1, lwd = 4, col = 2)
As we have seen the output from regsubsets()
relates to model selection procedure; i.e. we get the values of \(Cp\), BIC, adjusted-\(R^2\). We can also extract model specific regression coefficients via the command coef()
. For quantities such as standard errors, confidence and prediction intervals etc. we need re-fit the selected best model using standard least squares.
Say for example that we have confidence on the model with 10 predictors, but we also want additional information about it. In this case we use the command lm()
in R, but first we need to check whether the selected model contains any factors. We need to perform this check because the command regsubsets()
automatically converts factor variables to dummy variables, while lm()
does not (meaning that you will get an error if the selected model contains factors). Lets check the coefficients of the model with 10 predictors and also examine what the first 4 rows if the data matrix look like.
best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(best, 10)
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
## CRBI CWalks DivisionW PutOuts Assists
## 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
head(Hitters, 4L)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## Salary NewLeague
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
From the output of head()
we see that the variable Division
is a factor. From the output of coef()
we can deduct that regsubset()
automatically made a dummy variable which equals 1 when Division=W
and 0 otherwise. Therefore when we use the command lm()
we must inform that this variable is a factor by enclosing it within I()
as we do below.
ls10 = lm(Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + CRBI + CWalks +
I(Division) + PutOuts + Assists, data = Hitters)
Then we can get standard output using commands like summary()
or confint()
for confidence intervals.
summary(ls10)
##
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CRuns +
## CRBI + CWalks + I(Division) + PutOuts + Assists, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -939.11 -176.87 -34.08 130.90 1910.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.53544 66.90784 2.429 0.015830 *
## AtBat -2.16865 0.53630 -4.044 7.00e-05 ***
## Hits 6.91802 1.64665 4.201 3.69e-05 ***
## Walks 5.77322 1.58483 3.643 0.000327 ***
## CAtBat -0.13008 0.05550 -2.344 0.019858 *
## CRuns 1.40825 0.39040 3.607 0.000373 ***
## CRBI 0.77431 0.20961 3.694 0.000271 ***
## CWalks -0.83083 0.26359 -3.152 0.001818 **
## I(Division)W -112.38006 39.21438 -2.866 0.004511 **
## PutOuts 0.29737 0.07444 3.995 8.50e-05 ***
## Assists 0.28317 0.15766 1.796 0.073673 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 311.8 on 252 degrees of freedom
## Multiple R-squared: 0.5405, Adjusted R-squared: 0.5223
## F-statistic: 29.64 on 10 and 252 DF, p-value: < 2.2e-16
confint(ls10, 1:11)
## 2.5 % 97.5 %
## (Intercept) 30.76564039 294.30524369
## AtBat -3.22485010 -1.11245000
## Hits 3.67507200 10.16096299
## Walks 2.65203223 8.89441706
## CAtBat -0.23937493 -0.02078463
## CRuns 0.63939259 2.17710543
## CRBI 0.36150857 1.18711573
## CWalks -1.34995509 -0.31169761
## I(Division)W -189.60973671 -35.15037828
## PutOuts 0.15076820 0.44397699
## Assists -0.02732144 0.59365750
With this data we want to predict body fat (variable brozek
) using the other variables available, except for siri
(another way of computing body fat), density
(it is used in the brozek
and siri
formulas) and free
(it is computed using brozek formula). So, this variables need to be removed.
library(faraway)
help(fat) # read the help file for more information about the variables
head(fat)
## brozek siri density age weight height adipos free neck chest abdom hip
## 1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5
## 2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7
## 3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2
## 4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2
## 5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9
## 6 20.6 20.9 1.0502 24 210.25 74.75 26.5 167.0 39.0 104.5 94.4 107.8
## thigh knee ankle biceps forearm wrist
## 1 59.0 37.3 21.9 32.0 27.4 17.1
## 2 58.7 37.3 23.4 30.5 28.9 18.2
## 3 59.6 38.9 24.0 28.8 25.2 16.6
## 4 60.1 37.3 22.8 32.4 29.4 18.2
## 5 63.2 42.2 24.0 32.2 27.7 17.7
## 6 66.0 42.0 25.6 35.7 30.6 18.8
dim(fat)
## [1] 252 18
names(fat)
## [1] "brozek" "siri" "density" "age" "weight" "height" "adipos"
## [8] "free" "neck" "chest" "abdom" "hip" "thigh" "knee"
## [15] "ankle" "biceps" "forearm" "wrist"
fat = fat[,-c(2, 3, 8)]
dim(fat)
## [1] 252 15
sum(is.na(fat$brozek))
## [1] 0
Forward stepwise regression: validation. Use regsubsets()
to perform forward stepwise selection under the validation approach using 127 observations as training sample and 125 observations as testing sampling. Set seed equal to 1. Store the coefficients of the final model to an object called coef.fwd1
. Repeat this with seed set to 13 and again store the coefficients of the final model to an object coef.fwd2
. What is the first coefficient of predictor age
from the first run? Is variable age
included in the model from the second run? If not, what is the first predictor in coef.fwd2
?
Backward stepwise regression: cross-validation. Implement backward stepwise selections using 5-fold cross-validation with random seed set to 20. Plot the mean cross-validation MSE as a function of number of predictors and use the command `points()’ to clearly indicate on the plot which model has the lowest validation MSE.
Standard Inference. Fit a standard LS model for the best final model from Task 2. Produce 95% confidence intervals for the variables abdom
and forearm
.