1 Iris dataset

Iris flower from the setosa species

First it is useful to remind that the datasets that are available in the installed packages that you have in RStudio can be viewed via the command data():

data()                  # datasets in all installed packages
data(package = 'MASS')  # datasets in package MASS

Today we will work with the iris dataset which is a build-in dataset. First lets load the dataset and view some of its characteristics.

data('iris')
class(iris)

## [1] "data.frame"

dim(iris)

## [1] 150   5

head(iris, n=5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

help(iris)

## starting httpd help server ... done

The last command opens the help window where we can find information about these data. As we see it contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 150 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Initially, lets store the numerical data in an object y, set the sample size n and produce some visualisations of the data.

y = iris[, 1:4]
n = nrow(y)

par(mfrow = c(2,2))
for (i in 1:4) {
  hist(y[,i], col=i+1, main = '', xlab = names(y)[i])
}

Suppose we are initially interested in the variable sepal length. Lets start with some bootstrap non-parametric data analysis of these variable.

2 Bootstrap manual implementation for the variable sepal length

2.1 Histograms and variance estimation

Suppose, that we are initially interested in the median of variable sepal length and that we are interested in its sampling distribution and variance. We have our data $x_1,\ldots,x_n$ and we know that $n=150$ is even (see below) so we are interested in the following estimator:

sepal.length = y[,1]
n

## [1] 150

\[ \hat{m}_n = \frac{1}{2} (x_{(n/2)}+x_{(n/2+1)}) \]

In R we get

median.hat = median(sepal.length)
median.hat

## [1] 5.8

As we have seen in lectures we can manually program and find the bootstrap variance of the above estimator.

bootstrap_variance_median = function(X, B)
{
  # X: Data
  # B: Number of Monte Carlo samples
  n = length(X)
  mhat = numeric(B)
  for (j in 1:B){
    X.boot = sample(X,n,replace=TRUE) # Sample bootstrap data from Fn
    mhat[j] = median(X.boot) # Median bootstrap samples
  }
  var = var(mhat) # Evaluate the bootstrap variance estimate
  return(list(var = var, mhat = mhat))
}

B = 10000
set.seed(1)
results = bootstrap_variance_median(sepal.length,B)
var.boot = results$var 
var.boot # the bootstrap variance

## [1] 0.01029736

mhat.boot = results$mhat # bootstrap median sample of size B
hist(mhat.boot, xlab='Bootstraped samples of the median.', 
     main='',col='lightblue2',breaks=20)
points(median.hat,0,pch = 22,col='red',bg='red')

3 Computing manually bootstrap confidence intervals

In the lectures we have seen four types of bootstrap CIs. These are the following.

3.1 Normal confidence intervals

In this case we use the bootstrap standard error

\[ \mathrm{se}_{B}(\hat{\theta}_n)=\hat{\sigma}_{n,B} = \sqrt{\frac{1}{B} \sum_{j=1}^B(\hat{\theta}_n^{*(j)}-\bar{\theta}^*_n)^2} \]

In order to obtain the following CI:

\[ \left[\hat{\theta}_n-z_{1-\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n), \hat{\theta}_n-z_{\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n) \right] \]

Task:

1. Compute the above interval for the median of the variable sepal length Using the R function qnorm(). Use a significance level $\alpha = 0.01$. Store your R intervals as normal.CI. You should obtain the following result

normal.CI

## [1] 5.538615 6.061385

That is we get that \[ m\in [5.538128, 6.061872]\] or when rounded to one decimal point \[m \in [5.5, 6.1].\].

3.2 Percentile confidence intervals

As we have discussed this approach is extremely simple and completely non-parametric; we simply consider the following CI \[[\hat{\kappa}_{\alpha/2}, \hat{\kappa}_{1-\alpha/2}], \] where $\hat{\kappa}_{x}$ denotes the $x$-th quantile (or percentile) of the empirical distribution function of $\hat{\theta}^*$.

Task:

2. Compute the above interval for the median of the variable sepal length Using the R function quantile() again for a significance level $\alpha = 0.01$. Name your interval in R percentile.CI. You should obtain the following result

percentile.CI

##  0.5% 99.5% 
##  5.55  6.10

So in this we have \[ m\in [5.55, 6.1]\]

and as see this variable the two CIs are almost identical.

3.3 Bootstrap-t confidence intervals

These CIs are of the following form \[\begin{equation*} \left[\hat{\theta}_n-\zeta_{1-\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n), \hat{\theta}_n-\zeta_{\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n) \right], \end{equation*}\] where $\zeta_x$ is the $x$-th quantile of the values $\xi^{(j)}_n$ which are given according to the below formula \[\begin{equation*} \xi^{(j)}_n = \frac{\hat{\theta}_n^{*(j)}-\hat{\theta}_n}{\mathrm{se}(\hat{\theta}_n^{*})}, \end{equation*}\] for $j = 1,\ldots, B$. Under this approach we have to estimate $\mathrm{se}(\hat{\theta}_n^{*})$ which means that we have to of $\hat{\theta}^*_n$ in order to get an estimate $\widehat{\mathrm{se}}_B(\hat{\theta}_n^{*})$.

Tasks:

3. Using set.seed(1), create a sample named mhat.double.boot by sampling with replacement from mhat.boot $B$ times. Calculate the standard deviation of mhat.double.boot. The result of this should be

mhat.se

## [1] 0.1005184

Note that this approach is quite approximate for obtaining $\widehat{\mathrm{se}}_B(\hat{\theta}_n^{*})$. A more correct approach would be to produce many bootstrap samples calculate the standard deviation of each bootstrap sample and then take the average over all standard deviations. The below code does this for 1000 samples.

set.seed(1)
B1 = 1000
mhat.se.boot = numeric()
for (j in 1:B1) {
mhat.double.boot = sample(mhat.boot,B,replace=TRUE) 
mhat.se.boot[j] = sd(mhat.double.boot)
}
mhat.se = mean(mhat.se.boot)
mhat.se

## [1] 0.101479

In practice, for this example, we see that the difference between the two approaches is not significant.

4. Calculate the $\xi$ values as

xi = (mhat.boot - median.hat)/mhat.se

5. Using the command quantiles, store the $\zeta_{\alpha/2}$ and $\zeta_{1-\alpha/2}$ percentiles from the $\xi$ values in R as zeta.percentiles. The result should be the following:

zeta.percentiles

##      0.5%     99.5% 
## -2.463564  2.956277

6. Finally, return the bootstrap-$t$ CI as an object named bootstrap_t.CI. The result should be

bootstrap_t.CI

##    99.5%     0.5% 
## 5.500009 6.049992

Note that the interval itself is the correct one, but the names of the lower and upper limits are wrong as they should be the other way around (they were transfered from zeta.percentiles). To correct this

names(bootstrap_t.CI) = c('0.5%','99.5%')
bootstrap_t.CI

##     0.5%    99.5% 
## 5.500009 6.049992

So in this case we have \[ m\in [5.50, 6.05]\].

3.4 Bias-corrected (BC) confidence intervals

These are defined as follows \[\begin{equation} [\hat{\kappa}_{p{_1}}, \hat{\kappa}_{p{_2}}], \end{equation}\] where $p_1=\Phi(z_{\alpha/2}+2b_0)$ and $p_2=\Phi(z_{1-\alpha/2}+2b_0)$ and $\Phi(\cdot)$ is the standard normal distribution (Gaussian) function. In addition, $\hat{\kappa}_x$ is the $x$-th quantile of the distribution of the bootstrap samples (similar to the notation for the percentile CI), while the quantity $b_0$ is given by \[\begin{equation*} b_0= \Phi^{-1}\left( \frac{1}{B} \sum_{j=1}^B I\left(\hat{\theta}_n^{*(j)} \le \hat{\theta}_n\right) \right). \end{equation*}\]

Tasks:

7. Combining the R commans qnorm() and pnorm() for $\Phi(\cdot)^{-1}$ and $\Phi(\cdot)$, respectively, calculate the quantities b0, p1 and p2. The results are the following:

b0; p1; p2

## [1] 0.7395057

## [1] 0.1363605

## [1] 0.9999749

8. Now find the BC interval by simply using the command quantile():

bc.CI

## 13.63605% 99.99749% 
##       5.7       6.2

4 Using the R function boot

So far we have practiced calculation of CIs manually. Fortunately, we do not need to perform such calculations manually as there are existing R packages available to us. In this practical we will see the use of the package boot. Type

library(boot)

to load the package. If that doesn’t work immediately you must install the package first by using install.packages('boot').

Suppose that we want use the package in order to find CIs for the medians of sepal length and petal length and the Spearman’s rank correlation coefficient (see Wikipedia page) between these two. First we have to create a function that calculates the statistic(s) of interest out of bootstrap resampled data. It should have at least two arguments: a dataset and a vector containing indices of elements from a dataset that were picked to create a bootstrap sample. If we wish to calculate CIs for more than one statistic at once, our function has to return them as a single vector.

For our example, it may look like this:

foo <- function(data, indices){
  dt<-data[indices,]
  c(
    cor(dt[,1], dt[,3], method='s'),
    median(dt[,1]),
    median(dt[,3])
  )
}

The function foo chooses desired elements (which numbers are stored in indices) from data and computes correlation coefficient of first two columns (method='s' is used to choose Spearman’s rank coefficient, method='p' will result to Pearson’s coefficient) and their medians. Above dt corresponds to the data we and use dt[,1] and dt[,3] for sepal length and petal length, respectively.

We can also add extra arguments; e.g., the type of correlation coefficient:

 foo <- function(data, indices, cor.type){
  dt<-data[indices,]
  c(
    cor(dt[,1], dt[,3], method=cor.type),
    median(dt[,1]),
    median(dt[,3])
  )
}

Now, we can use the boot function. We have to supply it with the dataset, the function that we have just created, the number of bootstrap repetitions (R) and any additional arguments of our function (like cor.type). Below, we use set.seed for reproducibility of this example.

set.seed(1)
myBootstrap <- boot(iris, foo, R=10000, cor.type='s')

The command returns an object of class boot. It has two interesting elements. The first, $t contains R values of our statistic(s) generated by the bootstrap procedure:

head(myBootstrap$t, n = 10)

##            [,1] [,2] [,3]
##  [1,] 0.8674138 5.70 4.20
##  [2,] 0.8558414 5.70 4.20
##  [3,] 0.8686652 5.90 4.50
##  [4,] 0.9108656 5.70 4.20
##  [5,] 0.8814744 5.70 4.10
##  [6,] 0.8683065 5.90 4.50
##  [7,] 0.8732391 5.85 4.40
##  [8,] 0.8990328 5.90 4.30
##  [9,] 0.8857161 5.75 4.25
## [10,] 0.8547038 5.70 4.25

The second, $t0 contains the estimates of the statistic(s) from the original dataset:

myBootstrap$t0

## [1] 0.8818981 5.8000000 4.3500000

Printing the boot object to the console gives some further information:

myBootstrap

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = iris, statistic = foo, R = 10000, cor.type = "s")
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.8818981 -0.003292796  0.01928435
## t2* 5.8000000 -0.011935000  0.10113767
## t3* 4.3500000 -0.033425000  0.16331622

As we see this returns the sample estimates of the statistics of interest (the correlation and the two medians) as well as the bootstrap bias and standard error estimates! Recall from lectures that the bias estimate is \[\begin{equation*} \hat{b}_{n,B} = \left( \frac{1}{B} \sum_{j=1}^B\hat{\theta}_n^{*(j)}\right)-\hat{\theta}_n, \end{equation*}\] so this essentially obtained in the following way

colMeans(myBootstrap$t)-myBootstrap$t0

## [1] -0.003292796 -0.011935000 -0.033425000

while the bootstrap standard error can be calculated as

apply(myBootstrap$t,2,sd)

## [1] 0.01928435 0.10113767 0.16331622

Before we start with CI, it is always worth to take a look at the distribution of bootstrap realizations. We can use plot function, with index telling at which of statistics computed in foo we wish to look. Here index=1 is a Spearman’s correlation coefficient between sepal length and petal length.

plot(myBootstrap, index = 1)

Distribution of bootstrap correlation coefficients seems quite normal-like.

Task:

9. Produce the corresponding plots for the median of sepal length. The resulting plot for the median will look like this

Now let’s find CIs for the correlation coefficient. We can use boot.ci. It defaults to 95-percent CIs, but it can be changed with the conf parameter. Type ?boot.ci to view the help file. Below we obtain 99% CIs.

boot.ci(myBootstrap, index=1, conf = 0.99)

## Warning in boot.ci(myBootstrap, index = 1, conf = 0.99): bootstrap variances
## needed for studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = myBootstrap, conf = 0.99, index = 1)
## 
## Intervals : 
## Level      Normal              Basic         
## 99%   ( 0.8355,  0.9349 )   ( 0.8420,  0.9401 )  
## 
## Level     Percentile            BCa          
## 99%   ( 0.8237,  0.9218 )   ( 0.8263,  0.9229 )  
## Calculations and Intervals on Original Scale

We see four confidence intervals:

Normal CIs as described in subection 3.1.
Basic CIs. These we haven’t covered in lectures.Basic CIs (also called pivotal or empirical CIs) compute differences between each bootstrap replication and the statistic of interest and use percentiles of their distribution. The corresponding intervals are of the form: \[[2\hat{\theta}_n-\hat{\kappa}_{\alpha/2}, 2\hat{\theta}_n-\hat{\kappa}_{1-\alpha/2}]. \]
Percentile CIs as described in subection 3.2.
BCa CIs as described in subection 3.4.

Notice: there is a warning message that the studentized intervals are not returned because bootstrap variances are required. These are the bootstrap-$t$ CIs described in subsection 3.3. We will see how we can obtain those in future practicals.

Task:

10. Use the command boot.ci to get the following four intervals for sepal length and petal length.

## Warning in boot.ci(myBootstrap, index = 2): bootstrap variances needed for
## studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = myBootstrap, index = 2)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 5.614,  6.010 )   ( 5.600,  6.000 )  
## 
## Level     Percentile            BCa          
## 95%   ( 5.6,  6.0 )   ( 5.6,  6.0 )  
## Calculations and Intervals on Original Scale

## Warning in boot.ci(myBootstrap, index = 3): bootstrap variances needed for
## studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = myBootstrap, index = 3)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 4.063,  4.704 )   ( 4.150,  4.700 )  
## 
## Level     Percentile            BCa          
## 95%   ( 4.00,  4.55 )   ( 4.00,  4.50 )  
## Calculations and Intervals on Original Scale

Practical 1: Bootstrap estimation

1 Iris dataset

2 Bootstrap manual implementation for the variable sepal length

2.1 Histograms and variance estimation

3 Computing manually bootstrap confidence intervals

3.1 Normal confidence intervals

3.2 Percentile confidence intervals

3.3 Bootstrap-t confidence intervals

3.4 Bias-corrected (BC) confidence intervals

4 Using the R function boot