In this practical we use R to investigate the conjugate Bayesian analysis for Poisson data. We will also investigate using the effects of the prior distribution on the posterior, and the situation where our sample size grows large.
seq
plot
, and with lines
col
), line type (lty
), axis range (xlim
and ylim
) etcabline
function
sapply
to evaluate a function on every element of a vectordgamma
and dnorm
qgamma
, to produce exact credible intervals using standard distributionsOur dataset concerns the number of volcanoes that erupted each year from 2000 to 2018, taken from here. The data are given in the table below, where each volcanic eruption is counted in the year that the eruption started. Volcanic eruptions are split into groups depending on the severity of the eruption, as measured on the logarithmically-scaled Volcanic Explosivity Index (VEI). For example, the 2010 eruption of Eyjafjallajökull in Iceland had a VEI of 4, whereas the eruption of Krakatoa in 1883 was VEI6.
The eruptions are classified as Small
(VEI\(\leq 2\)), Medium
(VEI \(=3\)), and Large
(VEI \(\geq 4\)).
Year | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 |
Small | 30 | 25 | 30 | 32 | 29 | 29 | 34 | 29 | 25 | 28 |
Medium | 6 | 6 | 5 | 5 | 8 | 4 | 3 | 3 | 7 | 2 |
Large | 0 | 0 | 0 | 2 | 1 | 1 | 1 | 3 | 2 | 1 |
Year | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
Small | 39 | 30 | 40 | 40 | 41 | 26 | 37 | 30 | 34 |
Medium | 6 | 6 | 4 | 7 | 7 | 5 | 4 | 6 | 4 |
Large | 3 | 0 | 1 | 0 | 1 | 0 | 2 | 0 | 1 |
R
code below to input the data, and create a new data frame called volcano
, with columns Year
, Small
, Medium
, and Large
.plot
the number of Small
eruptions against the year using a vertical axis range (ylim
) of \([0,50]\). Add lines
of different colours to the same plot for the other two groups of eruptions. Do you see any apparent relationship between time and number of eruptions?Medium
eruptions - does it look like it could follow a Poisson distribution? How else could you quickly assess this?
volcano <- data.frame(Year = 2000:2018,
Small = c(30, 25, 30, 32, 29, 29, 34, 29, 25, 28, 39, 30,
40, 40, 41, 26, 37, 30, 34),
Medium = c(6, 6, 5, 5, 8, 4, 3, 3, 7, 2, 6, 6, 4, 7, 7,
5, 4, 6, 4),
Large = c(0, 0, 0, 2, 1, 1, 1, 3, 2, 1, 3, 0, 1, 0, 1,
0, 2, 0, 1))
A Poisson distribution is often used to model the counts of random events occurring within a fixed period of time at some rate \(\lambda\). For such a random variable \(X\) where \(X\sim\text{Po}(\lambda)\) , a Bayesian analysis will require a prior distribution for the unknown Poisson parameter \(\lambda\). We have seen/will see (Lecture 16) that the Gamma distribution provides a conjugate prior for this particular problem. Therefore, if our data \(X_1,\dots,X_n\) are \(\text{Po}(\lambda)\), and our prior distribution for \(\lambda\) is \(\text{Ga}(a,b)\). Then the posterior distribution for \(\lambda ~|~ X_1,\dots,X_n\) is given by: \[ \lambda ~|~ X_1,\dots,X_n \sim \text{Ga}(a+T, b+n), \] where \(T=\sum_{i=1}^n X_i= n\overline{X}\).
First, let’s use the Gamma distribution to identify an appropriate prior for our Poisson count data.
R
provides built-in functions to evaluate the PDF of standard distributions. The density function for a Gamma distribution \(\text{Ga}(a,b)\) is used as follows
dgamma(x, shape=a, rate=b)
where x
is the point (or vector of points) where we want to evaluate the PDF, and a
and b
are the usual Gamma parameter values.
seq
uence of 500 equally-space values over the interval \([0,10]\). Call this lambdavals
.lambdavals
using the gamma PDF dgamma
. Make a plot
of the pdf against \(\lambda\) as a solid black line.
An expert vulcanologist believes that the unknown rate of Medium
eruptions, \(\lambda\), is such that a range of \([1,6]\) would be plausible.
lambdavals
, and save it as cnjPrior
.cnjPrior
so that its values sum to 1.abline
using the v
argument) at the location of the prior expectation for \(\lambda\).
In addition to built-in functions for PDFs of standard distributions, R
also provides the quantile function for a pdf. The quantile function evaluates the inverse of the cumulative distribution function, \(F_X^{-1}(u)\). Given a probability value \(u\in[0,1]\), the quantile function returns the value \(x\) of \(X\) for which \(P[X\leq x]=u\), and so the quantile functions are particularly useful for finding critical values of distributions and for finding exact credible intervals.
The quantile function for a Gamma density \(X\sim \text{Ga}(\alpha,\beta)\) is used as follows
qgamma(alpha, shape=a, rate=b)
where alpha
is the lower tail probability (or vector of lower-tail probabilities) for which we want the corresponding value(s) of \(X\), and a
and b
are the usual Gamma parameter values.
qgamma
function to find a 95% equal-tailed prior credible interval for \(\lambda\) using the expert’s prior distribution above. Hint: find the values of \(\lambda\) with lower and upper tail probabilities of \(2.5\%\).The next ingredient in the Bayesian calculation requires us to capture the information contained in the data via the likelihood. In a Bayesian context, the likelihood is the conditional distribution of the data \(X_1,\dots, X_n\) given the parameter \(\lambda.\)
Let’s compute the likelihood given our data on Medium
volcanic eruptions, and add it to the plot.
poisLike
hat computes the Poisson likelihood for the sample of Medium
volcano eruptions. Your function should:
lambda
volcano$Medium
given the value of lambda
. Hint: the dpois
function evaluates Poisson probabilities for a vector of values x
and parameter lambda
.return
it.sapply
to evaluate the Poisson likelihood (poisLike
) for each of the values of \(\lambda\) in lambdavals
. Save this as like
.like
so that its values sum to \(1\).As we’re using the conjugate Gamma prior with a Poisson likelihood, then our posterior distribution for \(\lambda\) will also be a Gamma with parameter values as given in Section 2.1. Let’s verify that this is the case, and add the resulting density function to our plot.
postDirect
. Normalise the postDirect
so that it sums to \(1\). Hint: \(\text{Posterior} \propto \text{Likelihood} \times \text{Prior}\).aPost
and bPost
.dgamma
with the posterior parameter values to evaluate the posterior Gamma density at each of lambdavals
. Save these values to postConjugate
and normalise to sum to \(1\).purple
curve.To investigate how sensitive our results are to our choices for \(a\) and \(b\) in the prior Gamma distribution, we’re going to want to repeat our previous calculations for different values of the prior parameters. To make this easier, let’s wrap those calculations and plots in a new function that finds and draws the Gamma posterior for a given prior and data set:
gammaPoisson
which:
a
and b
, and the summary statistics T
and n
.lambdavals
.a
,b
,T
, and n
we used for the Medium
volcano data above.
a
and b
.a
and b
? How does this affect your results?
Theory in Lecture 18 will show that as the sample size grows large, the posterior distribution tends towards a Normal distribution and the posterior distribution becomes progressively less affected by the choice of prior distribution.
Let’s expand our data with the records of volcanic eruptions for the entire 20th century. As records are slightly incomplete, only the counts of Large
eruptions over this period can be considered trustworthy. For the entire 100 years, a total of 64 Large
(VEI\(\geq 4\)) eruptions were recorded
Large
volcanic eruptions from 1900-2018. Save these as Tbig
and Nbig
.The large sample of data is clearly very informative for \(\lambda\), resulting in a posterior distribution that is concentrated over a small range of possible \(\lambda\) values. Let’s refocus our plot on that sub-interval:
xlim
argument to plot
(or add it as an argument to your function).From theory seen in Lecture 18, if we suppose that \(X=(X_1,\dots,X_n)\) are an i.i.d. sample of size \(n\) from a distribution with pdf \(f(x~|~\theta)\) where \(X_i \perp X_j ~|~ \theta\) and \(f(x~|~\theta)\) is twice differentiable, then for \(n\) ‘large enough’ the posterior distribution for \(\theta ~|~ x\) is approximately \[ f(\theta ~|~ x) \approx \text{N}\left(\widehat{\theta}, \frac{1}{n\text{I}\left(\widehat{\theta}\right)}\right), \] where \(\widehat{\theta}\) is the MLE of the parameter \(\theta\), and \(\text{I}(\widehat{\theta})\) is the observed Fisher information for a sample of size 1.
For a Poisson distribution, we have:
Large
volcano eruption data.dnorm
, evaluate the approximating pdf for \(\lambda\) over lambdavals
. Normalise it, and add it to your plot using a thick dashed line.