Today we will work with the Cars93 dataset from the standard R package MASS. So, first lets load the package.
library('MASS')
Now, lets import the Cars93
data.
data(Cars93)
class(Cars93)
## [1] "data.frame"
dim(Cars93)
## [1] 93 27
head(Cars93)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
## 1 Acura Integra Small 12.9 15.9 18.8 25 31
## 2 Acura Legend Midsize 29.2 33.9 38.7 18 25
## 3 Audi 90 Compact 25.9 29.1 32.3 20 26
## 4 Audi 100 Midsize 30.8 37.7 44.6 19 26
## 5 BMW 535i Midsize 23.7 30.0 36.2 22 30
## 6 Buick Century Midsize 14.2 15.7 17.3 22 31
## AirBags DriveTrain Cylinders EngineSize Horsepower RPM
## 1 None Front 4 1.8 140 6300
## 2 Driver & Passenger Front 6 3.2 200 5500
## 3 Driver only Front 6 2.8 172 5500
## 4 Driver & Passenger Front 6 2.8 172 5500
## 5 Driver only Rear 4 3.5 208 5700
## 6 Driver only Front 4 2.2 110 5200
## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1 2890 Yes 13.2 5 177 102
## 2 2335 Yes 18.0 5 195 115
## 3 2280 Yes 16.9 5 180 102
## 4 2535 Yes 21.1 6 193 106
## 5 2545 Yes 21.1 4 186 109
## 6 2565 No 16.4 6 189 105
## Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
## 1 68 37 26.5 11 2705 non-USA Acura Integra
## 2 71 38 30.0 15 3560 non-USA Acura Legend
## 3 67 37 28.0 14 3375 non-USA Audi 90
## 4 70 37 31.0 17 3405 non-USA Audi 100
## 5 69 39 27.0 13 3640 non-USA BMW 535i
## 6 69 41 28.0 16 2880 USA Buick Century
As we see the initial sample size is 93 and we have 27 variables.
Prior to proceeding to any analysis, it is important to check whether a
dataset contains missing values (NA
entries), we can do
this as follows.
sum(is.na(Cars93))
## [1] 13
We see that we have 13 entries with missing values. It is better to
remove those in order not to have problems later on. We can do this
using the command Cars93 = na.omit(Cars93)
.
Cars93 = na.omit(Cars93)
n = nrow(Cars93)
n
## [1] 82
So, the sample size of the cleaned data is now 82. Typing
?Cars93
will open the help window.
?Cars93
There we see basic descriptions of the variables contained in the dataset and read that “Cars were selected at random from among 1993 passenger car models that were listed in both the Consumer Reports issue and the PACE Buying Guide. Pickup trucks and Sport/Utility vehicles were eliminated due to incomplete information in the Consumer Reports source. Duplicate models (e.g., Dodge Shadow and Plymouth Sundance) were listed at most once.”
The command attach
comes in very handy as we can use it
to extract the columns (with their names) from any object of the class
data.frame
.
attach(Cars93)
We will begin by focusing on the variables Price (as a
potential response variable) and the variable Horsepower (as a
potential covariate). We can initially have a look at the corresponding
scatterplot and superimpose on top the regression line obtained from the
R function lm
. We can also calculate the sample correlation
estimate using cor
.
plot(Horsepower, Price, col = 2, bty = 'l', pch = 16)
linear.model = lm(Price ~ Horsepower, data = Cars93)
abline(linear.model, col='blue')
cor(Horsepower, Price)
## [1] 0.7872511
boot
package for non-regression analysis
(continued)We have already seen at the end of Practical 1 that one can use the
boot
R package for basic bootstrap analysis. Let’s revisit
this, focusing on the correlation between Price and
Horsepower. Initially, we load the library and then we define
an appropriate function for calculating correlation which will be used
as input to boot
.
library(boot)
boot.cor.f <- function(y, x, data, indices){
dt<-data[indices,]
r = cor(dt[,y], dt[,x])
r
}
Above data
is a data frame, while
y
and x
correspond to columns of this data
frame. Note that above we do not include an additional argument for the
type of correlation, so we use the default type in cor
which is Pearson’s correlation coefficient. So,
dt<-data[indices,]
defines a resampled (row-wise) matrix
of data
. Therefore, cor(dt[,y], dt[,x])
is the
more appropriate code getting the columns that we want. However, as R
code becomes more and more flexible and adjustable we can also use the
below code without commas in cor(dt[,y], dt[,x])
:
boot.cor.f <- function(y, x, data, indices){
dt<-data[indices,]
r = cor(dt[y], dt[x])
r
}
Recall, we can use names
to check again the column names
of our data frame just to be sure. Then, we use boot
in
combination with our function boot.cor.f
. Below we use 5000
bootstrap replications.
names(Cars93)
## [1] "Manufacturer" "Model" "Type"
## [4] "Min.Price" "Price" "Max.Price"
## [7] "MPG.city" "MPG.highway" "AirBags"
## [10] "DriveTrain" "Cylinders" "EngineSize"
## [13] "Horsepower" "RPM" "Rev.per.mile"
## [16] "Man.trans.avail" "Fuel.tank.capacity" "Passengers"
## [19] "Length" "Wheelbase" "Width"
## [22] "Turn.circle" "Rear.seat.room" "Luggage.room"
## [25] "Weight" "Origin" "Make"
set.seed(1)
bootstrap.cor = boot(y = 'Price', x ='Horsepower', data = Cars93,
statistic = boot.cor.f, R=5000)
head(bootstrap.cor$t, n = 10)
## [,1]
## [1,] 0.7516637
## [2,] 0.7725906
## [3,] 0.8158606
## [4,] 0.8570860
## [5,] 0.8637691
## [6,] 0.7187499
## [7,] 0.7223043
## [8,] 0.7909135
## [9,] 0.7535251
## [10,] 0.7400386
bootstrap.cor
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Cars93, statistic = boot.cor.f, R = 5000, y = "Price",
## x = "Horsepower")
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.7872511 0.004906231 0.04708936
plot(bootstrap.cor)
As we have seen, for confidence intervals we can use
boot.ci
(type ?boot.ci
for help) as
follows
boot.ci(bootstrap.cor, conf = 0.95)
## Warning in boot.ci(bootstrap.cor, conf = 0.95): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap.cor, conf = 0.95)
##
## Intervals :
## Level Normal Basic
## 95% ( 0.6901, 0.8746 ) ( 0.6960, 0.8791 )
##
## Level Percentile BCa
## 95% ( 0.6954, 0.8785 ) ( 0.6752, 0.8663 )
## Calculations and Intervals on Original Scale
We get the four bootstrap confidence intervals explained in Practical 1, but again we get a warning message for the “studentized intervals” (these correspond to the bootstrap-\(t\) intervals in the lectures notes in Section 4.4). To get those we need to work a bit harder and define this function which implements nested bootstrap:
boot.cor.nested.f <- function(y, x, data,indices, iter){
dt<-data[indices,]
r = cor(dt[y], dt[x])
nested = boot(
data = data,
y = y,
x = x,
R = iter,
statistic = boot.cor.f
)
v = var(nested$t, na.rm = TRUE)
c(r,v)
}
Note that, the argument iter
(corresponding to bootstrap
samples of correlations!) is for the boot
nested loop which
calls upon our previously defined function boot.cor.f
.
Without going into too much detail about the above function we can now
re-run boot
using
statistic = boot.cor.nested.f
(below with 1000 outer
bootstrap replications and 100 inner bootstrap replications) and obtain
all five confidence intervals.
bootstrap.cor = boot(y = 'Price', x ='Horsepower', data = Cars93,
statistic = boot.cor.nested.f, R=1000, iter = 100)
boot.ci(bootstrap.cor, conf = 0.95, type = 'all')
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap.cor, conf = 0.95, type = "all")
##
## Intervals :
## Level Normal Basic Studentized
## 95% ( 0.6886, 0.8751 ) ( 0.6904, 0.8795 ) ( 0.6933, 0.8802 )
##
## Level Percentile BCa
## 95% ( 0.6950, 0.8841 ) ( 0.6803, 0.8646 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
Let us now start to consider bootstrap linear regression, with an
initial focus on linear regression for models with one predictor of the
form \[y_i = \beta_0 + \beta_1 x_i +
\epsilon_i,\] for \(i\in\{1,\ldots,n\}\). As said we will
consider Price to be the response and Horsepower to be
the covariate. We have already fitted this linear model above using
linear.model = lm(Price ~ Horsepower, data = Cars93)
. Some
useful functions (compatible with lm
) for obtaining
interesting quantities are as follows:
summary(linear.model)
##
## Call:
## lm(formula = Price ~ Horsepower, data = Cars93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9493 -2.8548 -0.6301 1.8289 30.8969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.32148 2.00202 -1.16 0.25
## Horsepower 0.15357 0.01345 11.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.18 on 80 degrees of freedom
## Multiple R-squared: 0.6198, Adjusted R-squared: 0.615
## F-statistic: 130.4 on 1 and 80 DF, p-value: < 2.2e-16
coef(linear.model) # Estimates of the betas
## (Intercept) Horsepower
## -2.3214825 0.1535693
confint(linear.model) # Standard CIs for the betas
## 2.5 % 97.5 %
## (Intercept) -6.305634 1.6626692
## Horsepower 0.126806 0.1803326
sigma(linear.model) # Estimate of residual standard deviation
## [1] 6.179503
summary(linear.model)$r.squared # The R^2 coefficient
## [1] 0.6197643
Recall that \(R^2\) is the proportion of the variation in the response variable that is explained from the predictor variable(s). When we have only one predictor this is just the correlation squared; indeed we have
cor(Horsepower, Price)^2
## [1] 0.6197643
Now, we have seen in lectures how to perform fully-manual linear
regression using the OLS estimator \[
\hat{\boldsymbol{\beta}} =
\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}.
\] This is okay, of course, but we might be interested in other
quantities as well (as the ones presented above), so it is here where
the lm
functions comes in handy as we can incorporate the
results from it in a function of our own.
The below function performs Nonparametric Paired Bootstrap (NPB)
regression using lm
and stores the regression coefficients
\((\hat{\beta}_0,\hat{\beta}_1)\), the
residual standard deviation \(\hat{\sigma}\) and \(R^2\) in
result = matrix(NA, B, 4)
which is a matrix of
dimensionality \((B\times 4)\).
NPB.reg <- function(y, x, data, B){
n = nrow(data)
result = matrix(NA, B, 4)
colnames(result) = c('b0','b1','sigma','Rsquare')
for(j in 1:B){
indices = sample(1:n, n, replace = TRUE)
reg = lm(y[indices] ~ x[indices])
betas = coef(reg)
sigma = sigma(reg)
R2 = summary(reg)$r.squared
result[j,] = c(betas, sigma, R2)
}
result
}
Lets use NPB regression with \(B = 1000\) and see what we can get.
set.seed(1)
NPB.reg.results = NPB.reg(y = Price, x = Horsepower, data = Cars93, B = 1000)
head(NPB.reg.results)
## b0 b1 sigma Rsquare
## [1,] -0.7108329 0.1353524 5.254561 0.6604955
## [2,] 1.3792831 0.1185900 5.569575 0.5562221
## [3,] -3.2611997 0.1583319 5.875815 0.6355520
## [4,] -4.6731001 0.1782806 6.473332 0.6227720
## [5,] -4.5324996 0.1751890 6.696474 0.6156759
## [6,] -4.2214474 0.1717109 6.437548 0.6876424
Lets plot histograms of bootstrap samples and calculate bootstrap standard errors and percentile intervals.
par(mfrow = c(2,2))
hist(NPB.reg.results[,1], main = expression(hat(beta)[0]), xlab = '', col = 2)
hist(NPB.reg.results[,2], main = expression(hat(beta)[1]), xlab = '', col = 3)
hist(NPB.reg.results[,3], main = expression(hat(sigma)), xlab = '', col = 4)
hist(NPB.reg.results[,4], main = expression(R^2), xlab = '', col = 5)
dev.off()
## null device
## 1
apply(NPB.reg.results, 2, sd)
## b0 b1 sigma Rsquare
## 2.44841501 0.02067921 1.02448221 0.07490683
apply(NPB.reg.results, 2, quantile, probs = c(0.025, 0.975))
## b0 b1 sigma Rsquare
## 2.5% -7.332912 0.1189282 4.125760 0.4906822
## 97.5% 1.847488 0.1961853 8.056665 0.7768442
We can visualise further, presenting a scatterplot of the data
points, the fitted regression line from the original data (in black
below) and the \(B = 1000\) bootstraped
regression lines (see ?abline
for help).
plot(1, bty="l", xlab="Horsepower", ylab="Price", xlim = range(Horsepower),
ylim = range(Price))
for (j in 1:1000) {
abline(a = NPB.reg.results[j,1], b = NPB.reg.results[j,2], col='pink', lwd = 2)
}
abline(linear.model, col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)
In a similar manner (with sligthly more complicated coding) we can
use lm
for Semiparametric Residual Bootstrap (SRB)
regression. As we saw in lectures now we resample from the fitted
residuals of the original data and then set the response variable based
on the OLS coefficients, the predictor matrix and the bootstrap
residuals. The below function does that for us.
SRB.reg <- function(y, x, data, B){
n = nrow(data)
result = matrix(NA, B, 4)
colnames(result) = c('b0','b1','sigma','Rsquare')
res = lm(y~x)$residuals # residuals from original data
beta.ols = coef(lm(y~x)) # OLS estimates from original data
X = cbind(1,x) # Matrix containing a vector of 1s and the predictor
for(j in 1:B){
indices = sample(1:n, n, replace = TRUE)
res.boot = res[indices]
y.boot = X %*% beta.ols + res.boot
reg = lm(y.boot ~ x)
betas = coef(reg)
sigma = sigma(reg)
R2 = summary(reg)$r.squared
result[j,] = c(betas, sigma, R2)
}
result
}
Now with similar code, as the one presented above, we can execute SRB regression and obtain quantities of interest; for instance:
set.seed(1)
SRB.reg.results = SRB.reg(y = Price, x = Horsepower, data = Cars93, B = 1000)
apply(SRB.reg.results, 2, sd)
## b0 b1 sigma Rsquare
## 2.02362663 0.01370479 1.02986016 0.08821007
apply(SRB.reg.results, 2, quantile, probs = c(0.025, 0.975))
## b0 b1 sigma Rsquare
## 2.5% -6.300754 0.1272001 4.148172 0.4532511
## 97.5% 1.692494 0.1817538 8.057334 0.7961323
We can now also plot figures of regression lines side by side:
par(mfrow = c(1,2))
# NPB
plot(1, bty="l", xlab="Horsepower", ylab="Price", main = 'Nonparametric paired',
xlim = range(Horsepower), ylim = range(Price))
for (j in 1:1000) {
abline(a = NPB.reg.results[j,1], b = NPB.reg.results[j,2], col='pink', lwd = 2)
}
abline(linear.model, col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)
#SRB
plot(1, bty="l", xlab="Horsepower", ylab="Price", main = 'Semiparametric residual',
xlim = range(Horsepower), ylim = range(Price))
for (j in 1:1000) {
abline(a = SRB.reg.results[j,1], b = SRB.reg.results[j,2], col='lightgreen', lwd = 2)
}
abline(linear.model, col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)
car
We could write our own functions (similar to the ones considered
above) using lm
to extend to multiple linear regression
models of the from \[ y_i = \beta_0 +\beta_1
x_{i1}+ \ldots + \beta_p x_{ip} + \epsilon_i.\] Fortunately, we
do not have to do that as we can use the R packacge car
(standing for “Companion to Applied Regression”) which can automatically
call boot
functionality for regression problems. To install
the package type install.packages('car')
. Then load it and
call the help file for the function Boot
which is the
function of interest.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
Boot()
This function provides a simple front-end to the boot
function that is tailored to bootstrapping for regression models. The
Boot
function has very few basic arguments.
?Boot
## starting httpd help server ... done
Boot
## function (object, f = coef, labels = names(f(object)), R = 999,
## method = c("case", "residual"), ncores = 1, ...)
## {
## UseMethod("Boot")
## }
## <bytecode: 0x0000028ecba358f0>
## <environment: namespace:car>
object
: must be a regression-model object such as the
object linear.model
we obtained previously from
lm()
.f
: function to be computed for the model on each
bootstrap replication. The default is the coef()
function.
This can be extended to include more than one functions as we will
see.labels
: argument providing name(s) for the quantities
that are kept on each bootstrap iteration.R
: number of bootstrap replications (default is
999).method
: can be set either to “case” (the default), for
case resampling (NPB regression), or to “residual”, for residual
resampling (SRB regression).ncores
: A numeric argument that specifies the number of
cores for parallel processing (advanced not needed)....
: permits passing additional arguments to the
boot()
function.Boot()
with dataset
Cars93
Lets have a look again at the first rows of Cars93
.
head(Cars93, n = 2)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
## 1 Acura Integra Small 12.9 15.9 18.8 25 31
## 2 Acura Legend Midsize 29.2 33.9 38.7 18 25
## AirBags DriveTrain Cylinders EngineSize Horsepower RPM
## 1 None Front 4 1.8 140 6300
## 2 Driver & Passenger Front 6 3.2 200 5500
## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1 2890 Yes 13.2 5 177 102
## 2 2335 Yes 18.0 5 195 115
## Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
## 1 68 37 26.5 11 2705 non-USA Acura Integra
## 2 71 38 30.0 15 3560 non-USA Acura Legend
Suppose we are interested in a linear model where Price is again the response but now we consider Horsepower, Fuel.tank.capacity and Passengers as predictor variables. We can fit this model as follows.
linear.model1 = lm(Price ~ Horsepower + Fuel.tank.capacity + Passengers)
summary(linear.model1)
##
## Call:
## lm(formula = Price ~ Horsepower + Fuel.tank.capacity + Passengers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.745 -2.912 -0.501 1.872 31.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.71665 5.15147 -2.080 0.0408 *
## Horsepower 0.12153 0.02245 5.413 6.63e-07 ***
## Fuel.tank.capacity 0.64762 0.41396 1.564 0.1218
## Passengers 0.49466 1.09750 0.451 0.6534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.102 on 78 degrees of freedom
## Multiple R-squared: 0.6386, Adjusted R-squared: 0.6247
## F-statistic: 45.93 on 3 and 78 DF, p-value: < 2.2e-16
In this example the Fuel.tank.capacity and Passengers variables do not seem very valuable as their coefficients are not statistically significant and, thus, \(R^2\) hasn’t increased a lot. Nevertheless, for the sake of demonstration let’s continue.
To start, we can use Boot
as follows and then use the
summary()
command.
set.seed(1)
NPB.results1 = Boot(linear.model1, R=1000)
summary(NPB.results1)
##
## Number of bootstrap replications R = 1000
## original bootBias bootSE bootMed
## (Intercept) -10.71665 0.1354137 4.521220 -10.44808
## Horsepower 0.12153 0.0042572 0.034251 0.12280
## Fuel.tank.capacity 0.64762 -0.0304775 0.424594 0.62988
## Passengers 0.49466 -0.0362997 0.899936 0.50852
As we see this returns the OLS estimates and also NPB (default
option) estimates of bias, standard deviations and medians. we can
further use the additional argument high.moments = TRUE
in
Boot()
to get estimates of skewness and kurtosis (as a
reminder, see Wikipedia pages here and here,
respectively).
summary(NPB.results1, high.moments = TRUE)
##
## Number of bootstrap replications R = 1000
## original bootBias bootSE bootMed bootSkew
## (Intercept) -10.71665 0.1354137 4.521220 -10.44808 -0.147825
## Horsepower 0.12153 0.0042572 0.034251 0.12280 0.479270
## Fuel.tank.capacity 0.64762 -0.0304775 0.424594 0.62988 -0.078434
## Passengers 0.49466 -0.0362997 0.899936 0.50852 -0.267173
## bootKurtosis
## (Intercept) 0.135168
## Horsepower 0.155500
## Fuel.tank.capacity 0.014498
## Passengers 1.038316
Now lets say that we also want to include the bootstrap estimates of the error standard deviation and also of the \(R^2\) statistic. We also want to be precise in the method that we use; i.e., NPB regression. We can adjust the above R code as follows.
set.seed(1)
NPB.results2 = Boot(linear.model1, method = 'case',
f = function(model){c(coef(model), sigma(model), summary(model)$r.squared)},
labels=c(names(coef(linear.model1)), 'sigmaHat', 'Rsquared'), R=1000)
summary(NPB.results2, high.moments = TRUE)
##
## Number of bootstrap replications R = 1000
## original bootBias bootSE bootMed bootSkew
## (Intercept) -10.71665 0.1354137 4.521220 -10.44808 -0.147825
## Horsepower 0.12153 0.0042572 0.034251 0.12280 0.479270
## Fuel.tank.capacity 0.64762 -0.0304775 0.424594 0.62988 -0.078434
## Passengers 0.49466 -0.0362997 0.899936 0.50852 -0.267173
## sigmaHat 6.10158 -0.2520093 1.081002 5.86923 0.176454
## Rsquared 0.63856 0.0194506 0.072615 0.65640 0.053391
## bootKurtosis
## (Intercept) 0.135168
## Horsepower 0.155500
## Fuel.tank.capacity 0.014498
## Passengers 1.038316
## sigmaHat -0.390158
## Rsquared -0.389620
What about CIs? We can use again the confint()
command,
although this time we need to be specific about which CI we want, plus,
we do not have an option for studentized (the bootstrap-\(t\)) intervals.
confint(NPB.results2, level= 0.95, type="norm")
## Bootstrap normal confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) -19.71349641 -1.9906380
## Horsepower 0.05014662 0.1844088
## Fuel.tank.capacity -0.15409654 1.5102827
## Passengers -1.23288322 2.2948016
## sigmaHat 4.23486207 8.4723102
## Rsquared 0.47678758 0.7614340
confint(NPB.results2, level= 0.95, type="basic")
## Bootstrap basic confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) -19.38130883 -2.0873697
## Horsepower 0.03931393 0.1756429
## Fuel.tank.capacity -0.15094712 1.5451919
## Passengers -1.12828207 2.2617890
## sigmaHat 4.18375995 8.2514854
## Rsquared 0.48208848 0.7571511
confint(NPB.results2, level= 0.95, type="perc")
## Bootstrap percent confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) -19.34593721 -2.0519981
## Horsepower 0.06742677 0.2037558
## Fuel.tank.capacity -0.24996066 1.4461784
## Passengers -1.27247010 2.1176009
## sigmaHat 3.95166837 8.0193939
## Rsquared 0.51997171 0.7950343
confint(NPB.results2, level= 0.95, type="bca")
## Warning in norm.inter(t, adj.alpha): extreme order statistics used as endpoints
## Bootstrap bca confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) -20.0056873 -2.4923130
## Horsepower 0.0659549 0.1992132
## Fuel.tank.capacity -0.2270605 1.4677597
## Passengers -1.2715970 2.1284371
## sigmaHat 4.5431185 9.5218231
## Rsquared 0.4820752 0.7678495
We see that for the BC bootstrap intervals we get a warning message about extreme order statistics being used as endpoints; this can be ignored at this phase.
A nice feature of the function Boot()
is its graphical
functionality connected to the generic command hist
. For
instance, the following line of code produces the below plot.
hist(NPB.results2, legend="top")
## Warning in norm.inter(t, adj.alpha): extreme order statistics used as endpoints
We see a separate histogram for each bootstrapped quantity. In addition
to the histograms we also get kernel density estimates (you will learn
more about these in the Epiphany term) and the normal density based on
the bootstrap mean and standard deviation. The vertical dashed line
marks the original point-estimate, and the thick horizontal line gives a
confidence interval based on the bootstrap (the defaults are BC
intervals). Type ?hist.boot
for further information.
In addition, the bootstrapped values of the defined quantities of
interest (in our example the regression coefficients, the error standard
deviation and \(R^2\)) are all retained
in the object NPB.results2$t
. For instance, the first 6
rows are the following
head(NPB.results2$t)
## (Intercept) Horsepower Fuel.tank.capacity Passengers sigmaHat Rsquared
## [1,] -9.897919 0.1592634 0.55667452 -0.3922574 8.116354 0.6053351
## [2,] -2.642872 0.1225939 0.51726929 -0.6977310 5.874785 0.6431785
## [3,] -16.966994 0.1105785 0.74579302 1.7444184 7.600543 0.5780805
## [4,] -9.948523 0.1178819 0.20079509 1.7583248 5.734196 0.6619305
## [5,] -11.962795 0.1104768 0.66871684 1.0050806 6.111675 0.6098824
## [6,] -8.937864 0.1303570 0.03334818 1.8611176 5.950973 0.6114832
We see that the function Boot()
from R package
car
generally makes our work easier and provides
flexibility. For instance if we want to apply SRB regression we can
simply use the following code.
set.seed(1)
SRB.results2 = Boot(linear.model1, method = 'residual',
f = function(model){c(coef(model), sigma(model), summary(model)$r.squared)},
labels=c(names(coef(linear.model1)), 'sigmaHat', 'Rsquared'),
R=1000)
We can then easily provide any of the results presented above; for instance, summaries
summary(SRB.results2, high.moments = TRUE)
##
## Number of bootstrap replications R = 1000
## original bootBias bootSE bootMed bootSkew
## (Intercept) -10.71665 0.32473470 5.021708 -10.62665 -0.018828
## Horsepower 0.12153 0.00019210 0.022720 0.12152 0.182086
## Fuel.tank.capacity 0.64762 -0.00076454 0.416748 0.63200 0.068808
## Passengers 0.49466 -0.06415861 1.111370 0.40527 -0.058098
## sigmaHat 6.10158 -0.01741883 1.100284 6.04756 0.250542
## Rsquared 0.63856 0.00486383 0.089583 0.64868 -0.283203
## bootKurtosis
## (Intercept) -0.16897
## Horsepower 0.52787
## Fuel.tank.capacity 0.20420
## Passengers 0.20946
## sigmaHat -0.21547
## Rsquared -0.25919
or the corresponding plot
hist(SRB.results2, legend ="top")