In this practical we use R to perform a Bayesian statistical analysis to learn about an underlying population quantity using a sample and a discrete prior distribution. We will also investigate using the conjugate Beta prior distribution, and the effects of a large sample on the posterior distribution.
seq
and rep
plot
, and adding curves with lines
col
), line type (lty
), etcplot
function draw curvesxlim
and ylim
, and adding a legend to the plot with legend
function
sapply
to evaluate a function on every element of a vectorThe key idea behind Bayesian statistics is that we use probability to describe anything that’s uncertain, in particular both data, \(X\), and parameters, \(\theta\), are represented as random variables. Most Bayesian statistical questions about \(\theta\) can be answered by finding the conditional density (or probability): \[ f(\theta ~|~ x) = \frac{f(x~|~\theta)~f(\theta)}{f(x)} \]
where \(f(\theta)\) is our prior distribution for that describes our uncertainty in \(\theta\) before we see any data, \(f(x~|~\theta)\) is the likelihood of seeing the data given the model or parameter value \(\theta\), \(f(x)\) is the probability of seeing the data, and \(f(\theta~|~x)\) is the posterior distribution of the parameter \(\theta\) given what we have seen and learned from the data. In words, \[ \text{Posterior} = \frac{\text{Likelihood}\times\text{Prior}}{\text{Data probability}} \]
No matter how large the data set or complicated the distributions, all Bayesian approaches to statistical problems reduce to a calculation of this form.
In this practical, we will consider the hypothetical problem of assessing the proportion of people who vote for their current MP in an election, compared to those who vote for another candidate. Suppose we take a sample of \(n\) independent (and truthful) voters, and count the number, \(X\), who say they will vote for the MP. In this practical, our parameter of interest is the proportion of people, \(\theta\), in a group who will vote for the current MP.
Suppose we can only afford to make a small sample, and that 6 people from a sample of 10 support for the current MP. Our goal is to obtain the posterior distribution for the probability of support parameter, \(\theta\), given the sample, and investigate the effects of different prior distributions.
Let’s begin with a simple discrete distribution for \(\theta\). Suppose the proportion of supporters, \(\theta\), can only take one of the discrete values \(\theta_i\), where \[\theta_1 =\frac{0}{20}=0, \theta_2=\frac{1}{20}, \dots, \theta_{21}=\frac{20}{20}=1.\] Our Bayesian calculations will evaluate the prior probability, likelihood, and posterior probability for each of the possibile values of the parameter \(\theta\).
n
and x
and assign them the values of \(n\) and \(x\) for this problem.thetavals
. (Hint: use the seq
function with the by
argument to create a sequence of numbers with the same increment)
To begin with, let us suppose our prior beliefs consider it more likely that a current MP is re-elected than otherwise. One way of representing this, is to assign the following probability for the possible values \(\theta_i\) where \(i=1,\dots,21\): \[ P[\theta=\theta_i] = \frac{i}{231} \]
priorA
which contains the above prior probabilities for \(\theta\).priorA
vertically against thetavals
using plot
, connecting the values with red lines and setting the range of the vertical axis to be \([0,0.16]\). Hint: see the Technique below.When plotting data with plot
, we can change how plot draws the data by passing a type
argument into the function. The type argument can take a number of different values to produce different types of plot
type="p"
- draws a standard scatterplot with a “p”oint for every (x,y) pairtype="l"
- connects adjacent (x,y) pairs with straight “l”ines, does not draw points. Note: l
is a lowercase L, not the number 1.type="b"
- draws “b”oth points and connecting line segmentstype="s"
- connects points with “s”teps, rather than straight linesWe can further customize the appearance of the plot by supplying further arguments to plot
:
col
sets the colour, e.g. col="red"
lwd
sets the line width, e.g. lwd=2
doubles the line width.lty
sets the line type, e.g. lty=2
draws a dashed line.xlim
and ylim
set the range of the \(x\) and \(y\) axes respectively, e.g. xlim=c(0,10)
If \(X\) is the number of people who support the current MP in a sample of size \(n\) then \[ X \sim \text{Bin}(n,\theta) \] where \(\theta\) is the probability of a voter voting for the MP, and \[ P[X=x~|~\theta]~=~\binom{n}{x} \theta^x (1-\theta)^{n-x} \] This is our likelihood for \(x\) supporters from a sample of size \(n\).
Recall from last term, we can create our own functions using the function
syntax.
A function needs to have a name, usually at least one argument (although it doesn’t have to), a body of code that performs some computation, and usually returns an object or value. For example, a simple function that takes one argument \(x\) and computes and returns \(x^2\) is:
squareit <- function(x) {
sq <- x^2 # square x
return(sq) # return value
}
Note that in R, you need to enclose the blocks of your code (a function body, the content of loops and if statements, etc) in curly braces {
and }
, whereas in Python you use
We can use sapply
to apply a given function to every element of a vector. For example, we can compute the squares of the first 10 integers using the above squareit
function by evaluating the command
sapply(1:10, squareit)
If you’re familiar with Python’s list comprehensions, you’ll find that sapply
performs a similar role in R
.
like
with one argument, theta
, that computes and returns the binomial likelihood for a given value of \(\theta\) using the values of \(x\) and \(n\) for the data in our example. Hint: You can use the function choose
to calculate \(\binom{a}{b}\).sapply
to evaluate your likelihood function for all the possible values thetavals
, and save this as likeTheta
.likeTheta
so they sum to 1 (as a valid probability distribution should). Replace likeTheta
by these normalised values.likeTheta
) to your plot using thick black lines.
pData
. Hint: use the sum
function.postThetaA
. \[
P[\theta~|~x] = \frac{P[X~|~\theta] P[\theta]}{P[X]}
\]
The simple prior for \(\theta\) has been transformed into something more interesting after updating by the data. It’s no longer a simple straight line, but has a similar shape to the likelihood albeit shifted slightly to the right. The posterior is a compromise between the prior information and the data information in the likelihood.
Suppose now we consider two alternative prior distributions for \(\theta\):
priorB
and priorC
which contains the prior probabilities for \(\theta\) under both of these cases.lines
function to add priorB
and priorC
to your plot using blue and green lines.priorB
and priorC
to look like? How will they differ to the posterior obtained above?
Now, let’s compute the posteriors under these new priors:
priorB
and priorC
as the prior distribution of \(\theta\). You will need to re-calculate \(P[X]\) and \(P[\theta~|~X]\) in both cases. Save the new posterior distributions as postThetaB
and postThetaC
.
The legend
function adds a legend to a plot. The syntax is legend(location, legend, ...)
, and the required arguments are:
location
: Where in the plot to draw the legend. The easier way to do this is using a keyword: e.g. "topleft"
, "top"
, "topright"
, "right"
, etc.legend
: A vector of strings with the labels for each item in the legend.We can then draw a legend with labels to explain the different lines, points, and colours we are using in the plot, as appropriate. For example, the command
legend("topright",legend=c('Red Squares','Blue Circles'), pch=c(15,16),col=c('red','blue'))
will draw a legend on an existing plot on the top right of the plot with red squares appearing next to the text ‘Red Squares’ and blue circles next to the text ‘Blue Squares’.
postThetaB
and postThetaC
to your plot as blue and green dashed lines respectively.legend
to your plot to label the likelihood and the three different cases of prior and posterior. You will need to set lty=1
to show lines in the legend, and use the col
argument to specify the colours of the lines.