In this practical, we illustrate both parametric and non-parametric techniques for comparing two samples. Our objective will be to assess whether or not two samples can be regarded as being from the same population. We perform the parametric version first, which requires normality assumptions of the data, and then move on to the non-parametric form which requires no normality assumptions.
c
, seq
, length
mean
, sd
, var
, min
, max
plot
, a histogram with hist
, and a boxpot with boxplot
qqnorm
qnorm
and pnorm
rank
Eriksen, Björnstad an Götestam (1986) studied a social skills training program for alcoholics. Twenty-three alcohol-dependent male inpatients at an alcohol treatment centre were randomly assigned to two groups. The 12 control group patients were given a traditional treatment programme. The 11 treatment group patients were given the traditional treatment, plus a class in social skills training (“SST”). The patients were monitored for one year at 2-week intervals, and their total alcohol intake over the year was recorded (in cl pure alcohol).
A: Control | 1042 | 1617 | 1180 | 973 | 1552 | 1251 | 1151 | 1511 | 728 | 1079 | 951 | 1391 |
B: SST | 874 | 389 | 612 | 798 | 1152 | 893 | 541 | 741 | 1064 | 862 | 213 |
c
function to enter these data into R as two vectors called A
for the control group, and B
for the treatment group.summary(A)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 728 1025 1166 1202 1421 1617
summary(B)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 213.0 576.5 798.0 739.9 883.5 1152.0
boxplot(A,B)
par(mfrow=c(1,2))
hist(A)
hist(B)
A better graphical way to assess whether the data is distributed normally is to draw a normal quantile plot (or quantile-quantile, or QQ plot). We can do this in R using the qqnorm
function, where to draw the Normal quantile plot for a vector of data x
we use the command
qqnorm(x)
With this technique, the quantiles of the data (i.e. the ordered data values) are plotted against the quantiles which would be expected of a matching normal distribution of a normal distribution. If the data are normally distributed, then the points of the QQ plot will should lie on an approximately straight line. Deviations from a straight line suggest departures from the normal distribution.
qqnorm
. Do the samples look approximately Normally distributed? Do your conclusions agree with those made on the basis of the histograms?
We will use the independent sample \(t\)-test to compare these two samples. First, let’s recap the underlying theory (skip this if you remember the setup).
We assume that we have two random quantities \(X\sim{\mathcal{N}\left({\mu_X, \sigma^2}\right)}\) and \(Y \sim {\mathcal{N}\left({\mu_Y, \sigma^2}\right)}\). We obtain an i.i.d. sample of \(n\) observations of \(X\): \(X_1, \dots, X_n\); and a sample of size \(m\) from \(Y\): \(Y_1,\dots, Y_m\). We assume that each \(X_i\) is independent of each \(Y_j\). Note that we have assumed different means for \(X\) and \(Y\), but the same variances.
The question we are asking is: are the means of the two populations equal, i.e. is \(\mu_X=\mu_Y\), given the evidence we see in our samples?
Under the above assumptions, we have the following definitions and results:
The independent sample \(t\)-statistic is: \[ t = \frac{(\bar{x}-\bar{y}) - (\mu_X - \mu_Y)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}}}, \]
\(s_p^2\) is the pooled sample variance which is used to estimate the common variance \(\sigma^2\) by using data from both samples: \[ s_p^2 = \frac{(n-1)s_X^2 + (m-1)s_Y^2}{n+m-2} \]
The \(t\)-test reject \(H_0\) at level \(\alpha\) in favour of an alternative \(H_1: \mu_X\neq\mu_Y\) if \[ |t| > t^*_{(m+n-2), \alpha/2} \]
The corresponding \(100(1-\alpha)\%\) confidence interval for \(\mu_X - \mu_Y\) is then \[ (\bar{x}-\bar{y}) \pm t^*_{(m+n-2), \alpha/2} \ \ ~\ ~s_p ~\sqrt{\frac{1}{n} + \frac{1}{m}} \]
To compute the \(t\)-test statistic, we’re going to need to begin by finding some simple summaries of our two samples.
mean
, var
, and length
function to find the sample mean, sample variance, and sample size for the Treatmeant A group. Save these as abar
(\(\bar{a}\)), sa2
(\(s_a^2\)), and n
.bbar
(\(\bar{b}\)), sb2
(\(s_b^2\)), and m
.
sp2
(\(s_p^2\)).
t
. Check you get the same value as shown below.
df
.
## [1] 4.00757
R provides a range of functions to support calculations with standard probability distributions. In the previous practical, we encountered the normal density function dnorm
, as well as the random number generation functions for the uniform (runif
) and normal (rnorm
) distributions.
For every distribution there are four functions. The functions for each distribution begin with a particular letter to indicate the functionality (see table below) followed by the name of the distribution:
Letter | e.g. | Function |
---|---|---|
“d” | dnorm |
evaluates the probability density (or mass) function, \(f(x)\) |
“p” | pnorm |
evaluates the cumulative density function, \(F(x)=P[X \leq x]\), hence finds the probability the specified random variable is less than the given argument. |
“q” | qnorm |
evaluates the inverse cumulative density function (quantiles), \(F^{-1}(q)\) i.e. the value \(x\) such that \(P[X \leq x] = q\). Used to obtain critical values associated with particular probabilities \(q\). |
“r” | rnorm |
generates random numbers |
The appropriate functions for Normal, \(t\) and \(\chi^2\) distributions are given below, along with the optional parameter arguments.
dnorm
, pnorm
, qnorm
, rnorm
. Parameters: mean
(\(\mu\)) and sd
(\(\sigma\)).dt
, pt
, qt
, rt
. Parameter: df
.dchisq
, pchisq
, qchisq
, rchisq
. Parameter: df
.For a list of all the supported distributions, run the command help(Distributions)
qt
function function to find the critical \(t\) value for the test of \(H_0: \mu_X=\mu_Y\) against \(H_1: \mu_X\neq\mu_Y\) at the 5% level of significance when we have df
degrees of freedom? Check you get the same value as below.
## [1] 2.079614
t
, and hence perform the significance test.Nonparametric methods do not assume any distribution, and try to make only the weakest possible assumptions about the distributions that describe our data and are often used in such situations. The nonparametric equivalent to the independent sample \(t\)-test is the (Wilcoxon) rank-sum test. Nonparametric tests do not assume knowledge of the distributions of \(X\) and \(Y\), and are often based on the data ranks with tests comparing the data ranks of the two groups.
Given two samples \(X_1,\dots,X_{n}\), and \(Y_1,\dots,Y_{m}\), the rank-sum test procedure is as follows:
Note: our choice to base the test on the \(X\)s was entirely arbitrary; we could have used the \(Y\)s instead.
To find the ranks of a vector of data points, we can use the rank
function, e.g.
rank(x)
will return the sample ranks of the values in the vector x
. The rank function will automatically handle any ties by assigning the appropriate midrank value to all tied observations. See the R help on rank
for more details.
c
function to combine the two samples together into a larger vector composed of the elements of the A group, followed by the elements of the B group.rank
function on the combined sample to rank all the observations; call this rks
.rks
should now be the ranks \(r_i\) for the A group in the combined sample. Extract these from the larger vector, and save them as Aranks
.The exact critical value for the rank-sum test is a function of the sample sizes, \(n\) and \(m\), and the significance level, \(\alpha\). For a two-sided test at level \(\alpha\), we denote \(T_L\) and \(T_U\) as the lower and upper critical values, and reject \(H_0\) at level \(\alpha\) in favour of \(H_1:\) `Group 1 \(\neq\) Group 2’ if \(W \notin [T_L, T_U]\).
For particular values of \((n, m, \alpha)\), standard tables give values of the lower critical value \(T_U\) and we compute the lower critical value using the formula \(T_U = n(n+m+1) - T_L\).
Alternatively, we can use R to find the values for us, which is particularly useful as the ranges of \(n,m,\alpha\) on standard tables are quite limited. Run the following lines of code to create a function to return the rank-sum critical value for a given value of \((n, m, \alpha)\):
'qranksum' <- function(p,n,m){
return(qwilcox(p, n, m) + n*(n+1)/2)
}
We can now find the critical value \(T\) such that \(P[W \leq T]=p\) by using this new function as follows:
T <- qranksum(p, n, m)
For a two-sided test at level alpha
with sample sizes n
and m
, we would evaluate the above code at \(p=\frac{\alpha}{2}\) to find \(T_L\) and at \(p=1-\frac{\alpha}{2}\) for \(T_U\) (or we could apply the formula given above to find \(T_U\) from \(T_L\)).
qranksum
function defined above to find the exact critical values \(T_L\) and \(T_U\) for this problem.When the samples are both large, the distribution of the rank-sum statistic is approximately Normal. For large \(n\) and \(m\), under the null hypothesis of no group differences we have:
So for large samples, we can use these values to standardise \(W\) and use standard Normal tables to construct confidence intervals and test hypotheses.
One of the advantages of nonparametric tests is their robustness to outliers, which means the presence of surprisingly large or small values have little (if any) impact on the test or its conclusions. Conversely, parametric methods such as the \(t\)-test use summary statistics whose values can be substantially distorted by extreme values resulting in substantial implications for inference.
A2
that reflects this updated information. How does this change affect the rank of this data point in the combined sample? What is the impact on the mean and variance of the group A sample?If you finish the exercises above, have a look at some of these additional problems.