When a vector represents numerical data, there are a number of standard functions that will be useful for any statistical calculations:
sum
{#sum} sums all values in the vectormean
{#mean} computes the sample mean, i.e. \(\frac{1}{n} \sum_{i=1}^n x_i\)median
{#median}computes the sample median valuemin
and max
{#minmax} compute the sample
minimum and maximumrange
{#range} computes the sample range, i.e. the
difference between the max and min valuevar
and sd
{#var} compute the
sample variance and standard deviation, i.e. \(s^2=\frac{1}{n-1} \sum_{i=1}^n
(x_i-\bar{x})^2\)quantile
{#quantile} computes the min, max, median, and
lower and upper quartiles. Other quantiles can be computed using the
probs
argumentsummary
{#summary} calculates the min, max, mean,
median, and quantiles.For illustration, consider these 54 measurements of leaf biomass
leafbiomass <- c(0.430, 0.400, 0.450, 0.820, 0.520, 1.320, 0.900, 1.180, 0.480, 0.210,
0.270, 0.310, 0.650 ,0.180, 0.520, 0.300, 0.580, 0.480, 0.580, 0.580,
0.410, 0.480, 1.760, 1.210, 1.180, 0.830, 1.220, 0.770, 1.020, 0.130,
0.680, 0.610, 0.700, 0.820, 0.760, 0.770, 1.690, 1.480, 0.740, 1.240,
1.120, 0.750, 0.390, 0.870, 0.410, 0.560, 0.550, 0.670, 1.260, 0.965,
0.840, 0.970, 1.070, 1.220)
mean(leafbiomass) ## compute the sample mean
## [1] 0.7649074
We can check the mean
function is working by using
sum
and length
to directly calculate \(\frac{1}{n} \sum_{i=1}^n x_i\):
sum(leafbiomass)/length(leafbiomass)
## [1] 0.7649074
Similarly, for the standard deviation:
sd(leafbiomass)
## [1] 0.3780717
sqrt(sum((leafbiomass-mean(leafbiomass))^2)/(length(leafbiomass)-1))
## [1] 0.3780717
The other functions are fairly straightforward
min(leafbiomass)
## [1] 0.13
median(leafbiomass)
## [1] 0.72
quantile(leafbiomass)
## 0% 25% 50% 75% 100%
## 0.1300 0.4800 0.7200 1.0075 1.7600
summary(leafbiomass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1300 0.4800 0.7200 0.7649 1.0075 1.7600
R Help: sum, mean, median, min, max, range, sd, var, quantile, summary
Unsurprisingly, R provides a range of functions to support
calculations with standard probability distributions. There are a large
number of probability distributions available, but we will only need a
few. If you would like to know what distributions are available you can
do a search using the command help(Distributions)
.
For every distribution there are four functions. The functions for each distribution begin with a particular letter to indicate the functionality:
Letter | Function |
---|---|
“d” | evaluates the probability density (or mass) function, \(f(x)\) |
“p” | evaluates the cumulative density function, \(F(x)=P[X <= x]\), hence finds the probability the specified random variable is less than the given argument. |
“q” | evaluates the inverse cumulative density function (quantiles), \(F^{-1}(q)\) i.e. the value \(x\) such that \(P[X <= x] = q\). Used to obtain critical values associated with particular probabilities \(q\). |
“r” | generates random numbers |
The appropriate functions for common distributions are given below, along with the optional parameter arguments.
dnorm
, pnorm
,
qnorm
, rnorm
. Parameters: mean
(\(\mu\)) and sd
(\(\sigma\)).dt
, pt
, qt
, rt
.
Parameter: df
.dchisq
, pchisq
, qchisq
,
rchisq
. Parameter: df
.dchisq
, pchisq
,
qchisq
, rchisq
. Parameters: size
(\(n\)) and prob
(\(p\)).dpois
, ppois
, qpois
,
rpois
. Parameter: lambda
(\(\lambda\))dunif
, punif
, qunif
,
runif
. Parameters: min
, and
max
.dbeta
, pbeta
, qbeta
,
rbeta
. Parameters: shape1
(\(a\)), shape2
(\(b\)).dgamma
, pgamma
,
qgamma
, rgamma
. Parameters: shape
(\(\alpha\)), rate
(\(\beta\)).R also has distribution functions for the test statistics of
the rank
sum test (qwilcox
etc) and the signed
rank test (qsignrank
). See Practical 6 for more information on how to
use these.
We illustrate the four types of functions for distributions below in the context of the Normal distribution, but you can substitute the normal distribution for any of the distributions and functions listed above (though remember to change the parameters).
R Help: Available distributions,
For example, lets look at the functions for the Normal distribution.
The first function we look at is the density function,
dnorm
. Given a set of values it returns the value of the
Normal pdf at each point. If you only give the points it assumes you
want to use a mean of zero and standard deviation of one, i.e. the
standard Normal pdf \(\phi(z)\). To use
different values for the mean and standard deviation, we specify them in
the optional mean
and sd
arguments:
dnorm(0)
## [1] 0.3989423
dnorm(-3:3)
## [1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725 0.053990967
## [7] 0.004431848
dnorm(20, mean=10, sd=5)
## [1] 0.01079819
The second type of function is pnorm
which returns the
cumulative distribution function for a Normal density. Given a number or
a list it computes the probability that a normally distributed random
number will be less than that number. . It accepts the same options as
dnorm
and defaults to the standard Normal behaviour,
i.e. as \(\Phi(z)\):
pnorm(0) ## should be 0.5
## [1] 0.5
pnorm(1.96) ## should be ~0.975
## [1] 0.9750021
pnorm(20, mean=10, sd=5)
## [1] 0.9772499
pnorm
(and all the “p” functions) is particularly useful
when computing \(p\)-values in
significance tests
If we wish to find the probability that a number is
larger than the given number, so \(1-F(x)\) rather than \(F(x)\), you can set the
lower.tail
option to FALSE
:
pnorm(0,lower.tail=FALSE)
## [1] 0.5
pnorm(1,lower.tail=FALSE)
## [1] 0.1586553
The next type of function is qnorm
which is the inverse
of pnorm
, so qnorm
is \(F^{-1}(x)\). The idea behind
qnorm
is that you give it a probability value \(q\), and it returns the number \(x\) such that \(F(x) = P[X <= x] = q\). This is
particularly useful for finding critical values associated from a
distribution associated with a particular significance level.
qnorm(0.975) ## should be about 1.96
## [1] 1.959964
qnorm(0.5) ## should be 0
## [1] 0
qnorm(0.25,mean=2,sd=2)
## [1] 0.6510205
The last type of function we examine is the rnorm
function which generates random numbers whose distribution is normal.
Its argument is the number of random numbers that you want to generate,
and it has optional arguments to specify the mean and standard deviation
or other parameters:
rnorm(4)
## [1] 0.2167549 -0.5424926 0.8911446 0.5959806
rnorm(4,mean=3,sd=10)
## [1] 19.3561800 9.8927544 -9.8124663 0.8685548
mean(rnorm(1000,mean=3))
## [1] 2.96563