First it is useful to remind that the datasets that are available in
the installed packages that you have in RStudio can be viewed via the
command data()
:
data() # datasets in all installed packages
data(package = 'MASS') # datasets in package MASS
Today we will work with the iris dataset which is a build-in dataset. First lets load the dataset and view some of its characteristics.
data('iris')
class(iris)
## [1] "data.frame"
dim(iris)
## [1] 150 5
head(iris, n=5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
help(iris)
## starting httpd help server ... done
The last command opens the help window where we can find information about these data. As we see it contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 150 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Initially, lets store the numerical data in an object y
,
set the sample size n
and produce some visualisations of
the data.
y = iris[, 1:4]
n = nrow(y)
par(mfrow = c(2,2))
for (i in 1:4) {
hist(y[,i], col=i+1, main = '', xlab = names(y)[i])
}
Suppose we are initially interested in the variable sepal length. Lets start with some bootstrap non-parametric data analysis of these variable.
Suppose, that we are initially interested in the median of variable sepal length and that we are interested in its sampling distribution and variance. We have our data \(x_1,\ldots,x_n\) and we know that \(n=150\) is even (see below) so we are interested in the following estimator:
sepal.length = y[,1]
n
## [1] 150
\[ \hat{m}_n = \frac{1}{2} (x_{(n/2)}+x_{(n/2+1)}) \]
In R we get
median.hat = median(sepal.length)
median.hat
## [1] 5.8
As we have seen in lectures we can manually program and find the bootstrap variance of the above estimator.
bootstrap_variance_median = function(X, B)
{
# X: Data
# B: Number of Monte Carlo samples
n = length(X)
mhat = numeric(B)
for (j in 1:B){
X.boot = sample(X,n,replace=TRUE) # Sample bootstrap data from Fn
mhat[j] = median(X.boot) # Median bootstrap samples
}
var = var(mhat) # Evaluate the bootstrap variance estimate
return(list(var = var, mhat = mhat))
}
B = 10000
set.seed(1)
results = bootstrap_variance_median(sepal.length,B)
var.boot = results$var
var.boot # the bootstrap variance
## [1] 0.01029736
mhat.boot = results$mhat # bootstrap median sample of size B
hist(mhat.boot, xlab='Bootstraped samples of the median.',
main='',col='lightblue2',breaks=20)
points(median.hat,0,pch = 22,col='red',bg='red')
In the lectures we have seen four types of bootstrap CIs. These are the following.
In this case we use the bootstrap standard error
\[ \mathrm{se}_{B}(\hat{\theta}_n)=\hat{\sigma}_{n,B} = \sqrt{\frac{1}{B} \sum_{j=1}^B(\hat{\theta}_n^{*(j)}-\bar{\theta}^*_n)^2} \]
In order to obtain the following CI:
\[ \left[\hat{\theta}_n-z_{1-\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n), \hat{\theta}_n-z_{\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n) \right] \]
Task:
1. Compute the above interval for the median of the
variable sepal length Using the R function
qnorm()
. Use a significance level \(\alpha = 0.01\). Store your R intervals as
normal.CI
. You should obtain the following result
normal.CI
## [1] 5.538615 6.061385
That is we get that \[ m\in [5.538128, 6.061872]\] or when rounded to one decimal point \[m \in [5.5, 6.1].\].
As we have discussed this approach is extremely simple and completely non-parametric; we simply consider the following CI \[[\hat{\kappa}_{\alpha/2}, \hat{\kappa}_{1-\alpha/2}], \] where \(\hat{\kappa}_{x}\) denotes the \(x\)-th quantile (or percentile) of the empirical distribution function of \(\hat{\theta}^*\).
Task:
2. Compute the above interval for the median of the
variable sepal length Using the R function
quantile()
again for a significance level \(\alpha = 0.01\). Name your interval in R
percentile.CI
. You should obtain the following result
percentile.CI
## 0.5% 99.5%
## 5.55 6.10
So in this we have \[ m\in [5.55, 6.1]\]
and as see this variable the two CIs are almost identical.
These CIs are of the following form \[\begin{equation*} \left[\hat{\theta}_n-\zeta_{1-\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n), \hat{\theta}_n-\zeta_{\alpha/2}\mathrm{se}_{B}(\hat{\theta}_n) \right], \end{equation*}\] where \(\zeta_x\) is the \(x\)-th quantile of the values \(\xi^{(j)}_n\) which are given according to the below formula \[\begin{equation*} \xi^{(j)}_n = \frac{\hat{\theta}_n^{*(j)}-\hat{\theta}_n}{\mathrm{se}(\hat{\theta}_n^{*})}, \end{equation*}\] for \(j = 1,\ldots, B\). Under this approach we have to estimate \(\mathrm{se}(\hat{\theta}_n^{*})\) which means that we have to of \(\hat{\theta}^*_n\) in order to get an estimate \(\widehat{\mathrm{se}}_B(\hat{\theta}_n^{*})\).
Tasks:
3. Using set.seed(1)
, create a sample
named mhat.double.boot
by sampling with replacement from
mhat.boot
\(B\) times.
Calculate the standard deviation of mhat.double.boot
. The
result of this should be
mhat.se
## [1] 0.1005184
Note that this approach is quite approximate for obtaining \(\widehat{\mathrm{se}}_B(\hat{\theta}_n^{*})\). A more correct approach would be to produce many bootstrap samples calculate the standard deviation of each bootstrap sample and then take the average over all standard deviations. The below code does this for 1000 samples.
set.seed(1)
B1 = 1000
mhat.se.boot = numeric()
for (j in 1:B1) {
mhat.double.boot = sample(mhat.boot,B,replace=TRUE)
mhat.se.boot[j] = sd(mhat.double.boot)
}
mhat.se = mean(mhat.se.boot)
mhat.se
## [1] 0.101479
In practice, for this example, we see that the difference between the two approaches is not significant.
4. Calculate the \(\xi\) values as
xi = (mhat.boot - median.hat)/mhat.se
5. Using the command quantiles
, store
the \(\zeta_{\alpha/2}\) and \(\zeta_{1-\alpha/2}\) percentiles from the
\(\xi\) values in R as
zeta.percentiles
. The result should be the following:
zeta.percentiles
## 0.5% 99.5%
## -2.463564 2.956277
6. Finally, return the bootstrap-\(t\) CI as an object named
bootstrap_t.CI
. The result should be
bootstrap_t.CI
## 99.5% 0.5%
## 5.500009 6.049992
Note that the interval itself is the correct one, but the names of
the lower and upper limits are wrong as they should be the other way
around (they were transfered from zeta.percentiles
). To
correct this
names(bootstrap_t.CI) = c('0.5%','99.5%')
bootstrap_t.CI
## 0.5% 99.5%
## 5.500009 6.049992
So in this case we have \[ m\in [5.50, 6.05]\].
These are defined as follows \[\begin{equation} [\hat{\kappa}_{p{_1}}, \hat{\kappa}_{p{_2}}], \end{equation}\] where \(p_1=\Phi(z_{\alpha/2}+2b_0)\) and \(p_2=\Phi(z_{1-\alpha/2}+2b_0)\) and \(\Phi(\cdot)\) is the standard normal distribution (Gaussian) function. In addition, \(\hat{\kappa}_x\) is the \(x\)-th quantile of the distribution of the bootstrap samples (similar to the notation for the percentile CI), while the quantity \(b_0\) is given by \[\begin{equation*} b_0= \Phi^{-1}\left( \frac{1}{B} \sum_{j=1}^B I\left(\hat{\theta}_n^{*(j)} \le \hat{\theta}_n\right) \right). \end{equation*}\]
Tasks:
7. Combining the R commans qnorm()
and
pnorm()
for \(\Phi(\cdot)^{-1}\) and \(\Phi(\cdot)\), respectively, calculate the
quantities b0
, p1
and p2
. The
results are the following:
b0; p1; p2
## [1] 0.7395057
## [1] 0.1363605
## [1] 0.9999749
8. Now find the BC interval by simply using the
command quantile()
:
bc.CI
## 13.63605% 99.99749%
## 5.7 6.2
So far we have practiced calculation of CIs manually. Fortunately, we do not need to perform such calculations manually as there are existing R packages available to us. In this practical we will see the use of the package boot. Type
library(boot)
to load the package. If that doesn’t work immediately you must
install the package first by using
install.packages('boot')
.
Suppose that we want use the package in order to find CIs for the medians of sepal length and petal length and the Spearman’s rank correlation coefficient (see Wikipedia page) between these two. First we have to create a function that calculates the statistic(s) of interest out of bootstrap resampled data. It should have at least two arguments: a dataset and a vector containing indices of elements from a dataset that were picked to create a bootstrap sample. If we wish to calculate CIs for more than one statistic at once, our function has to return them as a single vector.
For our example, it may look like this:
foo <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,3], method='s'),
median(dt[,1]),
median(dt[,3])
)
}
The function foo
chooses desired elements (which numbers
are stored in indices) from data and computes correlation coefficient of
first two columns (method='s'
is used to choose Spearman’s
rank coefficient, method='p'
will result to Pearson’s
coefficient) and their medians. Above dt
corresponds to the
data we and use dt[,1]
and dt[,3]
for
sepal length and petal length, respectively.
We can also add extra arguments; e.g., the type of correlation coefficient:
foo <- function(data, indices, cor.type){
dt<-data[indices,]
c(
cor(dt[,1], dt[,3], method=cor.type),
median(dt[,1]),
median(dt[,3])
)
}
Now, we can use the boot function. We have to supply it with the
dataset, the function that we have just created, the number of bootstrap
repetitions (R
) and any additional arguments of our
function (like cor.type
). Below, we use
set.seed
for reproducibility of this example.
set.seed(1)
myBootstrap <- boot(iris, foo, R=10000, cor.type='s')
The command returns an object of class boot
. It has two
interesting elements. The first, $t
contains R values of
our statistic(s) generated by the bootstrap procedure:
head(myBootstrap$t, n = 10)
## [,1] [,2] [,3]
## [1,] 0.8674138 5.70 4.20
## [2,] 0.8558414 5.70 4.20
## [3,] 0.8686652 5.90 4.50
## [4,] 0.9108656 5.70 4.20
## [5,] 0.8814744 5.70 4.10
## [6,] 0.8683065 5.90 4.50
## [7,] 0.8732391 5.85 4.40
## [8,] 0.8990328 5.90 4.30
## [9,] 0.8857161 5.75 4.25
## [10,] 0.8547038 5.70 4.25
The second, $t0
contains the estimates of the
statistic(s) from the original dataset:
myBootstrap$t0
## [1] 0.8818981 5.8000000 4.3500000
Printing the boot object to the console gives some further information:
myBootstrap
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = iris, statistic = foo, R = 10000, cor.type = "s")
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.8818981 -0.003292796 0.01928435
## t2* 5.8000000 -0.011935000 0.10113767
## t3* 4.3500000 -0.033425000 0.16331622
As we see this returns the sample estimates of the statistics of interest (the correlation and the two medians) as well as the bootstrap bias and standard error estimates! Recall from lectures that the bias estimate is \[\begin{equation*} \hat{b}_{n,B} = \left( \frac{1}{B} \sum_{j=1}^B\hat{\theta}_n^{*(j)}\right)-\hat{\theta}_n, \end{equation*}\] so this essentially obtained in the following way
colMeans(myBootstrap$t)-myBootstrap$t0
## [1] -0.003292796 -0.011935000 -0.033425000
while the bootstrap standard error can be calculated as
apply(myBootstrap$t,2,sd)
## [1] 0.01928435 0.10113767 0.16331622
Before we start with CI, it is always worth to take a look at the
distribution of bootstrap realizations. We can use plot
function, with index
telling at which of statistics
computed in foo
we wish to look. Here index=1
is a Spearman’s correlation coefficient between sepal length
and petal length.
plot(myBootstrap, index = 1)
Distribution of bootstrap correlation coefficients seems quite normal-like.
Task:
9. Produce the corresponding plots for the median of sepal length. The resulting plot for the median will look like this
Now let’s find CIs for the correlation coefficient. We can use
boot.ci
. It defaults to 95-percent CIs, but it can be
changed with the conf
parameter. Type ?boot.ci
to view the help file. Below we obtain 99% CIs.
boot.ci(myBootstrap, index=1, conf = 0.99)
## Warning in boot.ci(myBootstrap, index = 1, conf = 0.99): bootstrap variances
## needed for studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = myBootstrap, conf = 0.99, index = 1)
##
## Intervals :
## Level Normal Basic
## 99% ( 0.8355, 0.9349 ) ( 0.8420, 0.9401 )
##
## Level Percentile BCa
## 99% ( 0.8237, 0.9218 ) ( 0.8263, 0.9229 )
## Calculations and Intervals on Original Scale
We see four confidence intervals:
Notice: there is a warning message that the studentized intervals are not returned because bootstrap variances are required. These are the bootstrap-\(t\) CIs described in subsection 3.3. We will see how we can obtain those in future practicals.
Task:
10. Use the command boot.ci
to get the
following four intervals for sepal length and petal
length.
## Warning in boot.ci(myBootstrap, index = 2): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = myBootstrap, index = 2)
##
## Intervals :
## Level Normal Basic
## 95% ( 5.614, 6.010 ) ( 5.600, 6.000 )
##
## Level Percentile BCa
## 95% ( 5.6, 6.0 ) ( 5.6, 6.0 )
## Calculations and Intervals on Original Scale
## Warning in boot.ci(myBootstrap, index = 3): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = myBootstrap, index = 3)
##
## Intervals :
## Level Normal Basic
## 95% ( 4.063, 4.704 ) ( 4.150, 4.700 )
##
## Level Percentile BCa
## 95% ( 4.00, 4.55 ) ( 4.00, 4.50 )
## Calculations and Intervals on Original Scale