1 Cars93 dataset

Little cars

Today we will work with the Cars93 dataset from the standard R package MASS. So, first lets load the package.

library('MASS')

Now, lets import the Cars93 data.

data(Cars93)
class(Cars93)

## [1] "data.frame"

dim(Cars93)

## [1] 93 27

head(Cars93)

##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
## 1        Acura Integra   Small      12.9  15.9      18.8       25          31
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18          25
## 3         Audi      90 Compact      25.9  29.1      32.3       20          26
## 4         Audi     100 Midsize      30.8  37.7      44.6       19          26
## 5          BMW    535i Midsize      23.7  30.0      36.2       22          30
## 6        Buick Century Midsize      14.2  15.7      17.3       22          31
##              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
## 1               None      Front         4        1.8        140 6300
## 2 Driver & Passenger      Front         6        3.2        200 5500
## 3        Driver only      Front         6        2.8        172 5500
## 4 Driver & Passenger      Front         6        2.8        172 5500
## 5        Driver only       Rear         4        3.5        208 5700
## 6        Driver only      Front         4        2.2        110 5200
##   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1         2890             Yes               13.2          5    177       102
## 2         2335             Yes               18.0          5    195       115
## 3         2280             Yes               16.9          5    180       102
## 4         2535             Yes               21.1          6    193       106
## 5         2545             Yes               21.1          4    186       109
## 6         2565              No               16.4          6    189       105
##   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
## 1    68          37           26.5           11   2705 non-USA Acura Integra
## 2    71          38           30.0           15   3560 non-USA  Acura Legend
## 3    67          37           28.0           14   3375 non-USA       Audi 90
## 4    70          37           31.0           17   3405 non-USA      Audi 100
## 5    69          39           27.0           13   3640 non-USA      BMW 535i
## 6    69          41           28.0           16   2880     USA Buick Century

As we see the initial sample size is 93 and we have 27 variables. Prior to proceeding to any analysis, it is important to check whether a dataset contains missing values (NA entries), we can do this as follows.

sum(is.na(Cars93))

## [1] 13

We see that we have 13 entries with missing values. It is better to remove those in order not to have problems later on. We can do this using the command Cars93 = na.omit(Cars93).

Cars93 = na.omit(Cars93)
n = nrow(Cars93)
n

## [1] 82

So, the sample size of the cleaned data is now 82. Typing ?Cars93 will open the help window.

?Cars93

There we see basic descriptions of the variables contained in the dataset and read that “Cars were selected at random from among 1993 passenger car models that were listed in both the Consumer Reports issue and the PACE Buying Guide. Pickup trucks and Sport/Utility vehicles were eliminated due to incomplete information in the Consumer Reports source. Duplicate models (e.g., Dodge Shadow and Plymouth Sundance) were listed at most once.”

The command attach comes in very handy as we can use it to extract the columns (with their names) from any object of the class data.frame.

attach(Cars93)

We will begin by focusing on the variables Price (as a potential response variable) and the variable Horsepower (as a potential covariate). We can initially have a look at the corresponding scatterplot and superimpose on top the regression line obtained from the R function lm. We can also calculate the sample correlation estimate using cor.

plot(Horsepower, Price, col = 2, bty = 'l', pch = 16)
linear.model = lm(Price ~ Horsepower, data = Cars93)
abline(linear.model, col='blue')

cor(Horsepower, Price)

## [1] 0.7872511

2 Using the `boot` package for non-regression analysis (continued)

We have already seen at the end of Practical 1 that one can use the boot R package for basic bootstrap analysis. Let’s revisit this, focusing on the correlation between Price and Horsepower. Initially, we load the library and then we define an appropriate function for calculating correlation which will be used as input to boot.

library(boot)
boot.cor.f <- function(y, x, data, indices){
  dt<-data[indices,]
  r = cor(dt[,y], dt[,x])
  r
}

Above data is a data frame, while y and x correspond to columns of this data frame. Note that above we do not include an additional argument for the type of correlation, so we use the default type in cor which is Pearson’s correlation coefficient. So, dt<-data[indices,] defines a resampled (row-wise) matrix of data. Therefore, cor(dt[,y], dt[,x]) is the more appropriate code getting the columns that we want. However, as R code becomes more and more flexible and adjustable we can also use the below code without commas in cor(dt[,y], dt[,x]):

boot.cor.f <- function(y, x, data, indices){
  dt<-data[indices,]
  r = cor(dt[y], dt[x])
  r
}

Recall, we can use names to check again the column names of our data frame just to be sure. Then, we use boot in combination with our function boot.cor.f. Below we use 5000 bootstrap replications.

names(Cars93)

##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"

set.seed(1)
bootstrap.cor = boot(y = 'Price', x ='Horsepower', data = Cars93,
                     statistic = boot.cor.f, R=5000)
head(bootstrap.cor$t, n = 10)

##            [,1]
##  [1,] 0.7516637
##  [2,] 0.7725906
##  [3,] 0.8158606
##  [4,] 0.8570860
##  [5,] 0.8637691
##  [6,] 0.7187499
##  [7,] 0.7223043
##  [8,] 0.7909135
##  [9,] 0.7535251
## [10,] 0.7400386

bootstrap.cor

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Cars93, statistic = boot.cor.f, R = 5000, y = "Price", 
##     x = "Horsepower")
## 
## 
## Bootstrap Statistics :
##      original      bias    std. error
## t1* 0.7872511 0.004906231  0.04708936

plot(bootstrap.cor)

As we have seen, for confidence intervals we can use boot.ci (type ?boot.ci for help) as follows

boot.ci(bootstrap.cor, conf = 0.95)

## Warning in boot.ci(bootstrap.cor, conf = 0.95): bootstrap variances needed for
## studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap.cor, conf = 0.95)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.6901,  0.8746 )   ( 0.6960,  0.8791 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.6954,  0.8785 )   ( 0.6752,  0.8663 )  
## Calculations and Intervals on Original Scale

We get the four bootstrap confidence intervals explained in Practical 1, but again we get a warning message for the “studentized intervals” (these correspond to the bootstrap-$t$ intervals in the lectures notes in Section 4.4). To get those we need to work a bit harder and define this function which implements nested bootstrap:

boot.cor.nested.f <- function(y, x, data,indices, iter){
  dt<-data[indices,]
  r = cor(dt[y], dt[x])
  
  nested = boot(
  data = data,
  y = y,
  x = x,
  R = iter,
  statistic = boot.cor.f
  )
  v = var(nested$t, na.rm = TRUE)
  c(r,v)
}

Note that, the argument iter (corresponding to bootstrap samples of correlations!) is for the boot nested loop which calls upon our previously defined function boot.cor.f. Without going into too much detail about the above function we can now re-run boot using statistic = boot.cor.nested.f (below with 1000 outer bootstrap replications and 100 inner bootstrap replications) and obtain all five confidence intervals.

bootstrap.cor = boot(y = 'Price', x ='Horsepower', data = Cars93, 
                     statistic = boot.cor.nested.f, R=1000, iter = 100)

boot.ci(bootstrap.cor, conf = 0.95, type = 'all')

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap.cor, conf = 0.95, type = "all")
## 
## Intervals : 
## Level      Normal              Basic             Studentized     
## 95%   ( 0.6886,  0.8751 )   ( 0.6904,  0.8795 )   ( 0.6933,  0.8802 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.6950,  0.8841 )   ( 0.6803,  0.8646 )  
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable

3 Semi-manual bootstrap linear regression with one predictor

Let us now start to consider bootstrap linear regression, with an initial focus on linear regression for models with one predictor of the form \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i,\] for $i\in\{1,\ldots,n\}$. As said we will consider Price to be the response and Horsepower to be the covariate. We have already fitted this linear model above using linear.model = lm(Price ~ Horsepower, data = Cars93). Some useful functions (compatible with lm) for obtaining interesting quantities are as follows:

summary(linear.model)

## 
## Call:
## lm(formula = Price ~ Horsepower, data = Cars93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9493  -2.8548  -0.6301   1.8289  30.8969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.32148    2.00202   -1.16     0.25    
## Horsepower   0.15357    0.01345   11.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.18 on 80 degrees of freedom
## Multiple R-squared:  0.6198, Adjusted R-squared:  0.615 
## F-statistic: 130.4 on 1 and 80 DF,  p-value: < 2.2e-16

coef(linear.model)              # Estimates of the betas

## (Intercept)  Horsepower 
##  -2.3214825   0.1535693

confint(linear.model)           # Standard CIs for the betas

##                 2.5 %    97.5 %
## (Intercept) -6.305634 1.6626692
## Horsepower   0.126806 0.1803326

sigma(linear.model)             # Estimate of residual standard deviation

## [1] 6.179503

summary(linear.model)$r.squared # The R^2 coefficient

## [1] 0.6197643

Recall that $R^2$ is the proportion of the variation in the response variable that is explained from the predictor variable(s). When we have only one predictor this is just the correlation squared; indeed we have

cor(Horsepower, Price)^2

## [1] 0.6197643

Now, we have seen in lectures how to perform fully-manual linear regression using the OLS estimator \[ \hat{\boldsymbol{\beta}} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}. \] This is okay, of course, but we might be interested in other quantities as well (as the ones presented above), so it is here where the lm functions comes in handy as we can incorporate the results from it in a function of our own.

3.1 Nonparametric paired bootstrap

The below function performs Nonparametric Paired Bootstrap (NPB) regression using lm and stores the regression coefficients $(\hat{\beta}_0,\hat{\beta}_1)$, the residual standard deviation $\hat{\sigma}$ and $R^2$ in result = matrix(NA, B, 4) which is a matrix of dimensionality $(B\times 4)$.

NPB.reg <- function(y, x, data, B){
  n = nrow(data)
  result = matrix(NA, B, 4)
  colnames(result) = c('b0','b1','sigma','Rsquare')
  for(j in 1:B){
    indices = sample(1:n, n, replace = TRUE)
    reg = lm(y[indices] ~ x[indices])
    betas = coef(reg)
    sigma = sigma(reg)
    R2 = summary(reg)$r.squared
    result[j,] = c(betas, sigma, R2)
  }
  result
}

Lets use NPB regression with $B = 1000$ and see what we can get.

set.seed(1)
NPB.reg.results = NPB.reg(y = Price, x = Horsepower, data = Cars93, B = 1000)
head(NPB.reg.results)

##              b0        b1    sigma   Rsquare
## [1,] -0.7108329 0.1353524 5.254561 0.6604955
## [2,]  1.3792831 0.1185900 5.569575 0.5562221
## [3,] -3.2611997 0.1583319 5.875815 0.6355520
## [4,] -4.6731001 0.1782806 6.473332 0.6227720
## [5,] -4.5324996 0.1751890 6.696474 0.6156759
## [6,] -4.2214474 0.1717109 6.437548 0.6876424

Lets plot histograms of bootstrap samples and calculate bootstrap standard errors and percentile intervals.

par(mfrow = c(2,2))
hist(NPB.reg.results[,1], main = expression(hat(beta)[0]), xlab = '', col = 2)
hist(NPB.reg.results[,2], main = expression(hat(beta)[1]), xlab = '', col = 3)
hist(NPB.reg.results[,3], main = expression(hat(sigma)), xlab = '', col = 4)
hist(NPB.reg.results[,4], main = expression(R^2), xlab = '', col = 5)

dev.off()

## null device 
##           1

apply(NPB.reg.results, 2, sd)

##         b0         b1      sigma    Rsquare 
## 2.44841501 0.02067921 1.02448221 0.07490683

apply(NPB.reg.results, 2, quantile, probs = c(0.025, 0.975))

##              b0        b1    sigma   Rsquare
## 2.5%  -7.332912 0.1189282 4.125760 0.4906822
## 97.5%  1.847488 0.1961853 8.056665 0.7768442

We can visualise further, presenting a scatterplot of the data points, the fitted regression line from the original data (in black below) and the $B = 1000$ bootstraped regression lines (see ?abline for help).

plot(1, bty="l", xlab="Horsepower", ylab="Price", xlim = range(Horsepower),
     ylim = range(Price))
for (j in 1:1000) {
abline(a = NPB.reg.results[j,1], b = NPB.reg.results[j,2], col='pink', lwd = 2)
}
abline(linear.model, col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)

3.2 Semiparametric residual bootstrap

In a similar manner (with sligthly more complicated coding) we can use lm for Semiparametric Residual Bootstrap (SRB) regression. As we saw in lectures now we resample from the fitted residuals of the original data and then set the response variable based on the OLS coefficients, the predictor matrix and the bootstrap residuals. The below function does that for us.

SRB.reg <- function(y, x, data, B){
  n = nrow(data)
  result = matrix(NA, B, 4)
  colnames(result) = c('b0','b1','sigma','Rsquare')
  res = lm(y~x)$residuals         # residuals from original data
  beta.ols = coef(lm(y~x))        # OLS estimates from original data
  X = cbind(1,x)                  # Matrix containing a vector of 1s and the predictor
  for(j in 1:B){
    indices = sample(1:n, n, replace = TRUE)
    res.boot = res[indices]
    y.boot = X %*% beta.ols + res.boot
    reg = lm(y.boot ~ x)
    betas = coef(reg)
    sigma = sigma(reg)
    R2 = summary(reg)$r.squared
    result[j,] = c(betas, sigma, R2)
  }
  result
}

Now with similar code, as the one presented above, we can execute SRB regression and obtain quantities of interest; for instance:

set.seed(1)
SRB.reg.results = SRB.reg(y = Price, x = Horsepower, data = Cars93, B = 1000)

apply(SRB.reg.results, 2, sd)

##         b0         b1      sigma    Rsquare 
## 2.02362663 0.01370479 1.02986016 0.08821007

apply(SRB.reg.results, 2, quantile, probs = c(0.025, 0.975))

##              b0        b1    sigma   Rsquare
## 2.5%  -6.300754 0.1272001 4.148172 0.4532511
## 97.5%  1.692494 0.1817538 8.057334 0.7961323

We can now also plot figures of regression lines side by side:

par(mfrow = c(1,2))
# NPB
plot(1, bty="l", xlab="Horsepower", ylab="Price", main = 'Nonparametric paired', 
     xlim = range(Horsepower), ylim = range(Price))
for (j in 1:1000) {
  abline(a = NPB.reg.results[j,1], b = NPB.reg.results[j,2], col='pink', lwd = 2)
}
abline(linear.model, col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)
#SRB
plot(1, bty="l", xlab="Horsepower", ylab="Price", main = 'Semiparametric residual', 
     xlim = range(Horsepower), ylim = range(Price))
for (j in 1:1000) {
  abline(a = SRB.reg.results[j,1], b = SRB.reg.results[j,2], col='lightgreen', lwd = 2)
}
abline(linear.model, col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)

4 Multiple linear regression with R package `car`

We could write our own functions (similar to the ones considered above) using lm to extend to multiple linear regression models of the from \[ y_i = \beta_0 +\beta_1 x_{i1}+ \ldots + \beta_p x_{ip} + \epsilon_i.\] Fortunately, we do not have to do that as we can use the R packacge car (standing for “Companion to Applied Regression”) which can automatically call boot functionality for regression problems. To install the package type install.packages('car'). Then load it and call the help file for the function Boot which is the function of interest.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:boot':
## 
##     logit

4.1 Basic functionality of `Boot()`

This function provides a simple front-end to the boot function that is tailored to bootstrapping for regression models. The Boot function has very few basic arguments.

?Boot

## starting httpd help server ... done

Boot

## function (object, f = coef, labels = names(f(object)), R = 999, 
##     method = c("case", "residual"), ncores = 1, ...) 
## {
##     UseMethod("Boot")
## }
## <bytecode: 0x0000028ecba358f0>
## <environment: namespace:car>

object: must be a regression-model object such as the object linear.model we obtained previously from lm().
f: function to be computed for the model on each bootstrap replication. The default is the coef() function. This can be extended to include more than one functions as we will see.
labels: argument providing name(s) for the quantities that are kept on each bootstrap iteration.
R: number of bootstrap replications (default is 999).
method: can be set either to “case” (the default), for case resampling (NPB regression), or to “residual”, for residual resampling (SRB regression).
ncores: A numeric argument that specifies the number of cores for parallel processing (advanced not needed).
...: permits passing additional arguments to the boot() function.

4.2 An example on the use of `Boot()` with dataset `Cars93`

Lets have a look again at the first rows of Cars93.

head(Cars93, n = 2)

##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
## 1        Acura Integra   Small      12.9  15.9      18.8       25          31
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18          25
##              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
## 1               None      Front         4        1.8        140 6300
## 2 Driver & Passenger      Front         6        3.2        200 5500
##   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1         2890             Yes               13.2          5    177       102
## 2         2335             Yes               18.0          5    195       115
##   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
## 1    68          37           26.5           11   2705 non-USA Acura Integra
## 2    71          38           30.0           15   3560 non-USA  Acura Legend

Suppose we are interested in a linear model where Price is again the response but now we consider Horsepower, Fuel.tank.capacity and Passengers as predictor variables. We can fit this model as follows.

linear.model1 = lm(Price ~ Horsepower + Fuel.tank.capacity + Passengers)
summary(linear.model1)

## 
## Call:
## lm(formula = Price ~ Horsepower + Fuel.tank.capacity + Passengers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.745  -2.912  -0.501   1.872  31.789 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -10.71665    5.15147  -2.080   0.0408 *  
## Horsepower           0.12153    0.02245   5.413 6.63e-07 ***
## Fuel.tank.capacity   0.64762    0.41396   1.564   0.1218    
## Passengers           0.49466    1.09750   0.451   0.6534    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.102 on 78 degrees of freedom
## Multiple R-squared:  0.6386, Adjusted R-squared:  0.6247 
## F-statistic: 45.93 on 3 and 78 DF,  p-value: < 2.2e-16

In this example the Fuel.tank.capacity and Passengers variables do not seem very valuable as their coefficients are not statistically significant and, thus, $R^2$ hasn’t increased a lot. Nevertheless, for the sake of demonstration let’s continue.

To start, we can use Boot as follows and then use the summary() command.

set.seed(1)
NPB.results1 = Boot(linear.model1, R=1000)
summary(NPB.results1)

## 
## Number of bootstrap replications R = 1000 
##                     original   bootBias   bootSE   bootMed
## (Intercept)        -10.71665  0.1354137 4.521220 -10.44808
## Horsepower           0.12153  0.0042572 0.034251   0.12280
## Fuel.tank.capacity   0.64762 -0.0304775 0.424594   0.62988
## Passengers           0.49466 -0.0362997 0.899936   0.50852

As we see this returns the OLS estimates and also NPB (default option) estimates of bias, standard deviations and medians. we can further use the additional argument high.moments = TRUE in Boot() to get estimates of skewness and kurtosis (as a reminder, see Wikipedia pages here and here, respectively).

summary(NPB.results1, high.moments = TRUE)

## 
## Number of bootstrap replications R = 1000 
##                     original   bootBias   bootSE   bootMed  bootSkew
## (Intercept)        -10.71665  0.1354137 4.521220 -10.44808 -0.147825
## Horsepower           0.12153  0.0042572 0.034251   0.12280  0.479270
## Fuel.tank.capacity   0.64762 -0.0304775 0.424594   0.62988 -0.078434
## Passengers           0.49466 -0.0362997 0.899936   0.50852 -0.267173
##                    bootKurtosis
## (Intercept)            0.135168
## Horsepower             0.155500
## Fuel.tank.capacity     0.014498
## Passengers             1.038316

Now lets say that we also want to include the bootstrap estimates of the error standard deviation and also of the $R^2$ statistic. We also want to be precise in the method that we use; i.e., NPB regression. We can adjust the above R code as follows.

set.seed(1)
NPB.results2 = Boot(linear.model1, method = 'case', 
f = function(model){c(coef(model), sigma(model),  summary(model)$r.squared)},
labels=c(names(coef(linear.model1)), 'sigmaHat', 'Rsquared'), R=1000)

summary(NPB.results2, high.moments = TRUE)

## 
## Number of bootstrap replications R = 1000 
##                     original   bootBias   bootSE   bootMed  bootSkew
## (Intercept)        -10.71665  0.1354137 4.521220 -10.44808 -0.147825
## Horsepower           0.12153  0.0042572 0.034251   0.12280  0.479270
## Fuel.tank.capacity   0.64762 -0.0304775 0.424594   0.62988 -0.078434
## Passengers           0.49466 -0.0362997 0.899936   0.50852 -0.267173
## sigmaHat             6.10158 -0.2520093 1.081002   5.86923  0.176454
## Rsquared             0.63856  0.0194506 0.072615   0.65640  0.053391
##                    bootKurtosis
## (Intercept)            0.135168
## Horsepower             0.155500
## Fuel.tank.capacity     0.014498
## Passengers             1.038316
## sigmaHat              -0.390158
## Rsquared              -0.389620

What about CIs? We can use again the confint() command, although this time we need to be specific about which CI we want, plus, we do not have an option for studentized (the bootstrap-$t$) intervals.

confint(NPB.results2, level= 0.95, type="norm")

## Bootstrap normal confidence intervals
## 
##                           2.5 %     97.5 %
## (Intercept)        -19.71349641 -1.9906380
## Horsepower           0.05014662  0.1844088
## Fuel.tank.capacity  -0.15409654  1.5102827
## Passengers          -1.23288322  2.2948016
## sigmaHat             4.23486207  8.4723102
## Rsquared             0.47678758  0.7614340

confint(NPB.results2, level= 0.95, type="basic")

## Bootstrap basic confidence intervals
## 
##                           2.5 %     97.5 %
## (Intercept)        -19.38130883 -2.0873697
## Horsepower           0.03931393  0.1756429
## Fuel.tank.capacity  -0.15094712  1.5451919
## Passengers          -1.12828207  2.2617890
## sigmaHat             4.18375995  8.2514854
## Rsquared             0.48208848  0.7571511

confint(NPB.results2, level= 0.95, type="perc")

## Bootstrap percent confidence intervals
## 
##                           2.5 %     97.5 %
## (Intercept)        -19.34593721 -2.0519981
## Horsepower           0.06742677  0.2037558
## Fuel.tank.capacity  -0.24996066  1.4461784
## Passengers          -1.27247010  2.1176009
## sigmaHat             3.95166837  8.0193939
## Rsquared             0.51997171  0.7950343

confint(NPB.results2, level= 0.95, type="bca")

## Warning in norm.inter(t, adj.alpha): extreme order statistics used as endpoints

## Bootstrap bca confidence intervals
## 
##                          2.5 %     97.5 %
## (Intercept)        -20.0056873 -2.4923130
## Horsepower           0.0659549  0.1992132
## Fuel.tank.capacity  -0.2270605  1.4677597
## Passengers          -1.2715970  2.1284371
## sigmaHat             4.5431185  9.5218231
## Rsquared             0.4820752  0.7678495

We see that for the BC bootstrap intervals we get a warning message about extreme order statistics being used as endpoints; this can be ignored at this phase.

A nice feature of the function Boot() is its graphical functionality connected to the generic command hist. For instance, the following line of code produces the below plot.

hist(NPB.results2, legend="top")

## Warning in norm.inter(t, adj.alpha): extreme order statistics used as endpoints

We see a separate histogram for each bootstrapped quantity. In addition to the histograms we also get kernel density estimates (you will learn more about these in the Epiphany term) and the normal density based on the bootstrap mean and standard deviation. The vertical dashed line marks the original point-estimate, and the thick horizontal line gives a confidence interval based on the bootstrap (the defaults are BC intervals). Type ?hist.boot for further information.

In addition, the bootstrapped values of the defined quantities of interest (in our example the regression coefficients, the error standard deviation and $R^2$) are all retained in the object NPB.results2$t. For instance, the first 6 rows are the following

head(NPB.results2$t)

##      (Intercept) Horsepower Fuel.tank.capacity Passengers sigmaHat  Rsquared
## [1,]   -9.897919  0.1592634         0.55667452 -0.3922574 8.116354 0.6053351
## [2,]   -2.642872  0.1225939         0.51726929 -0.6977310 5.874785 0.6431785
## [3,]  -16.966994  0.1105785         0.74579302  1.7444184 7.600543 0.5780805
## [4,]   -9.948523  0.1178819         0.20079509  1.7583248 5.734196 0.6619305
## [5,]  -11.962795  0.1104768         0.66871684  1.0050806 6.111675 0.6098824
## [6,]   -8.937864  0.1303570         0.03334818  1.8611176 5.950973 0.6114832

We see that the function Boot() from R package car generally makes our work easier and provides flexibility. For instance if we want to apply SRB regression we can simply use the following code.

set.seed(1)
SRB.results2 = Boot(linear.model1, method = 'residual', 
                    f = function(model){c(coef(model), sigma(model), summary(model)$r.squared)},
                    labels=c(names(coef(linear.model1)), 'sigmaHat', 'Rsquared'),
                    R=1000)

We can then easily provide any of the results presented above; for instance, summaries

summary(SRB.results2, high.moments = TRUE)

## 
## Number of bootstrap replications R = 1000 
##                     original    bootBias   bootSE   bootMed  bootSkew
## (Intercept)        -10.71665  0.32473470 5.021708 -10.62665 -0.018828
## Horsepower           0.12153  0.00019210 0.022720   0.12152  0.182086
## Fuel.tank.capacity   0.64762 -0.00076454 0.416748   0.63200  0.068808
## Passengers           0.49466 -0.06415861 1.111370   0.40527 -0.058098
## sigmaHat             6.10158 -0.01741883 1.100284   6.04756  0.250542
## Rsquared             0.63856  0.00486383 0.089583   0.64868 -0.283203
##                    bootKurtosis
## (Intercept)            -0.16897
## Horsepower              0.52787
## Fuel.tank.capacity      0.20420
## Passengers              0.20946
## sigmaHat               -0.21547
## Rsquared               -0.25919

or the corresponding plot

hist(SRB.results2, legend ="top")

Application 1: Bootstrap estimation and regression

1 Cars93 dataset

2 Using the boot package for non-regression analysis (continued)