We have seen in lectures that the MLEs for the parameters of several common distributions can be found in closed form. However, in general, the problem is not guaranteed to be so easy with a simple closed-form result. Therefore, in these situations, we will have to find and maximise the likelihood numerically.

For this practical, we will focus on the Poisson distribution, which is particularly appropriate for counts of events occuring randomly in time or space. Suppose we have a simple problem with one piece of data, \(X\), and we know that \(X\sim\text{Po}(\lambda)\) for some unknown parameter value \(\lambda\).

  • You will need the following skills from previous practicals:
    • Basic R skills with arithmetic, functions, etc
    • Manipulating and creating vectors: c, seq,
    • Calculating data summaries: mean, sum, length
    • Plotting with plot and hist, and adding lines to plots with lines and abline
    • Repeatedly applying a function to a vector using sapply
  • New R techniques:
    • Optimising a function with optim
    • Extracting named elements of a list using the $ operator

1 The Poisson Likelihood

Our likelihood calculations begin with writing down the likelihood function of our data. For a single observation \(X=x\) from a Poisson distribution, the likelihood is the the probability of observing \(X=x\) written as a function of the parameter \(\lambda\): \[ \ell(\lambda) = \mathbb{P}[X=x~|~\lambda] = \frac{e^{-\lambda}\lambda^x}{x!}. \]

To begin with, suppose we observe a count of \(X=5\).

  • Write a simple function called poisson that takes arguments lambda which evaluates the Poisson probability \(\mathbb{P}[X=5~|~\lambda]\). Hint: You will need the exp and factorial functions.
  • Evaluate the Poisson probability for \(X=5\) when \(\lambda=7\) - you should get a value of 0.1277167.

What we have just created is the likelihood function, \(\ell(\lambda)\), for the Poisson parameter \(\lambda\) given a single observation of \(X=5\).

Before we consider a larger data problem, let’s briefly explore this simplest case of a Poisson likelihood.

  • Create a sequence of length 200 from 0.1 to 20. Save this as lambdavals.
  • Use sapply to evaluate the Poisson likelihood function for every value of \(\lambda\) in lambdavals. Save this as like5.
  • Draw a plot of the Poisson likelihood as a function of lambda. Use ty='l' to draw a line graph.
  • By eye, what is the value \(\hat{\lambda}\) of \(\lambda\) which maximises the likelihood?

Typically, we usually work with the log-likelihood \(\mathcal{L}(\lambda)=\log\ell(\lambda)\), rather than the likelihood itself. This can often make the mathematical maximisation simpler, the numerical computations more stable, and it is also intrinsically connected to the idea of information (which we will consider later).

  • Modify your poisson function to compute the logarithm of the Poisson probability, and call this logPoisson. You will need to use the log function.
  • Evaluate the log likelihood function for all \(\lambda\) in lambdavals and plot the log-likelihood against \(\lambda\). Do you find the same maximum value \(\widehat{\lambda}\)?

2 Likelihood methods for real data

The Poisson distribution has been used by traffic engineers as a model for light traffic. This assumption is based on the rationale that if the rate is approximately constant and the traffic is light (so the individual cars move independently of each other), then the distribution of counts of cars in a given time interval or space area should be nearly Poisson (Gerlough and Schuhl 1955).

The following table shows the number of right turns during 300 3-minute intervals at a specific junction.

N 0 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 14 30 36 68 43 43 30 14 10 6 4 1 1
traffic <- rep(0:12, c(14,30,36,68,43,43,30,14,10,6,4,1,1))
  • Evaluate the piece of code above to create the vector traffic corresponding to the above data set.
  • Plot a histogram of the data. Does an assumption of a Poisson distribution appear reasonable?

If we suppose a Poisson distribution might be a good choice of distribution for this dataset, we still need to find out which particular Poisson distribution is best by estimating the parameter \(\lambda\).

Let’s begin by constructing the likelihood and log-likelihood functions for this data set of 300 observations. Assuming that the observations \(x_1, \dots, x_{300}\) can be treated as 300 i.i.d. values from \(\text{Po}(\lambda)\), the log-likelihood function for the entire sample is: \[ \mathcal{L}(\lambda) = -n\lambda +\left(\sum_{i=1}^n x_i \right) \log \lambda, \] up to an additive constant from the factorial terms (which we can ignore).

  • Write a new functions logLike with a single parameter lambda, which evaluates the log-likelihood expression above for the traffic data and the supplied values of lambda. Hint: Use length and sum to compute the summary statistics.
  • Evaluate the data log-likelihood for all \(\lambda\) in lambdavals and plot the log-likelihood against \(\lambda\). Without further calculations, use the plot to have a guess at the maximum likelihood estimate \(\widehat{\lambda}\) – we’ll check this later!

2.1 Maximising the likelihood

The maximum likelihood estimate (MLE) \(\widehat{\lambda}\) for the \(\lambda\) parameter of the Poisson distribution is the value of \(\lambda\) which maximises the likelihood \[ \widehat{\lambda}=\operatorname{argmax}_{\lambda\in\Omega} \ell(\lambda)= \operatorname{argmax}_{\lambda\in\Omega} \mathcal{L}(\lambda). \]

Therefore to find the MLE, we must maximise \(\ell\) or \(\mathcal{L}\). We know we can do this analytically for certain common distributions, but in general we’ll have to optimise the function numerically instead. For such problems, R provides various functions to perform a numerical optimisation of a give function (even one we define ourself). The function we will focus on is optim:

R offers several optimisation functions, however the one we shall be using today is optim which is one of the most general optimisation functions. This generality comes at the price of having a lot of optional parameters that need to be specified.

For our purposes, we’re doing a 1-dimensional optimisation of a given function named logLike. So the syntax we will use to maximise this function is as follows

optim(start, logLike, lower=lwr, upper=upr, method='Brent', control=list(fnscale=-1))

The arguments are:

  • start : an initial value for \(\lambda\) at which to start the optimisation. Replace this with a sensible value.
  • logLike : this is the function we’re optimising. In general, we can replace this with any other function.
  • lower=lwr, upper=upr : lower and upper bounds to our search for \(\widehat{\lambda}\). Replace lwr and upr with appropriate numerical values; an obvious choice for this problem is the min and max of our data.
  • method='Brent' : this selects an appropriate optimisation algorithm for a 1-D optimisation.
  • control=list(fnscale=-1) : this looks rather strange, but is simply telling optim to maximise the given function, rather than minimise (which is the default).

optim returns a number of values in the form of a list. The relevant components are:

  • par : the optimal value of the parameter
  • value : the optimum of the function being optimised.

  • Use optim to maximise the logLike function. You should choose an appropriate initial value for \(\lambda\), as well as sensible upper and lower bounds.
  • What value of \(\widehat{\lambda}\) do you obtain? How does this compare to your guess at the MLE?
  • We know from lectures that the \(\widehat{\lambda}=\bar{X}\) – evaluate the sample mean of the traffic data. Does this agree with your results from directly maximising the log-likelihood?

The output of optim is in the form of a list. Unlike vectors, lists can combine many different variables together (numbers, strings, matrices, etc) and each element of a list can be given a name to make it easier to access.

For example, we can create our own lists using the list function:

test <- list(Name='Donald', IQ=-10, Vegetable='Potato')

Once created, we can access the named elements of the list using the $ operator. For example,

test$IQ

will return the value saved to IQ in the list, i.e. -10. RStudio will show auto-complete suggestions for the elements of the list after you press $, which can help make life easier.

We can use this to extract the MLE value from optim by first saving the results to a variable, called results say, and then extracting the list element named par.

  • Extract the optimal value of \(\lambda\) from the output of optim and save it to a new variable called mle.
  • Re-draw your plot of the log-likelihood function, and use abline to:
    • Add a red solid vertical line at \(\lambda=\widehat{\lambda}\)
    • Add a blue dashed vertical line at \(\lambda=\bar{x}\).
  • Do the results agree with each other, and with your guess at \(\widehat{\lambda}\)?

Let’s return to the original distribution of the data. The results of our log-likelihood analysis suggest that a Poisson distribution with parameter \(\widehat{\lambda}\) is the best choice of Poisson distributions for this data set.

  • Re-draw the histogram of the traffic data.
  • For \(x=0,1,\dots,16\) evaluate the Poisson probability \(\mathbb{P}[X=x]\) when \(X\sim\text{Po}(\widehat{\lambda})\). Save these probabilities as probs.
  • Multiply probs by the sample size to obtain the expected Poisson counts under this \(\text{Po}(\widehat{\lambda})\) distribution.
  • Use the lines function to overlay the expected \(\text{Po}(\widehat{\lambda})\) counts as a curve on top of your histogram. Does this look like a good fit to the data?

We will return to the question of fitting distributions to data and the “goodness of fit” later next term.

2.2 Information and inference

For an iid sample \(X_1,\dots,X_n\) from a sufficiently nicely-behaved density function \(f(x_i~|~\theta)\) with unknown parameter \(\theta\), the sampling distribution of the MLE \(\widehat{\theta}\) is such that \[ \widehat{\theta} \rightsquigarrow \text{N}\left(\theta, \frac{1}{\mathcal{I}_n(\theta)}\right), \] as \(n \rightarrow\infty\). Whilst we may be used to finding Normal distributions in unusual places, this result is noteworthy as we never assumed any particular distribution of our data here. This result therefore provides us with a general method for making inferences about parameters of distributions and their MLEs for problems involving large samples!

In particular, we can apply this to our Poisson data problem above to construct a large-sample approximate \(100(1-\alpha)\%\) confidence interval for \(\lambda\) in the form: \[ \widehat{\lambda} \pm z^*_{\alpha/2} \frac{1}{\sqrt{-\mathcal{L}''(\widehat{\lambda})}}, \] where \(z^*_{\alpha/2}\) is the \(\alpha/2\) critical value of a standard Normal distribution and we have estimated the expected Fisher information \(\mathcal{I}_n(\theta)\) by the observed information \(I(\widehat{\lambda})=-\mathcal{L}''(\widehat{\lambda})\).

  • We can request that R computes the second derivative at the maximum, \(\mathcal{L}''(\widehat{\lambda})\), as part of the maximisation process. To do this, re-run your optimisation, but add the additional argument hessian=TRUE.
  • The output from optim will now contain an additional element named hessian. Extract this element, multiply by -1, and save it as obsInfo.
  • Irritatingly, obsInfo will be in the form of a \(1\times 1\) matrix rather than a scalar (as the Hessian matrix is a matrix of the second order partial derivatives). Run the following code to flatten it into a scalar so we can use it in our confidence interval: obsInfo <- c(obsInfo)
  • Use qnorm or your standard Normal tables to find the critical \(z^*_{\alpha/2}\) value for a 95% confidence interval. Save this as zcrit.
  • Combine mle, zcrit, and obsInfo to evaluate an approximate 95% confidence interval for \(\lambda\).
  • Return (and possibly re-draw) the plot of your log-likelihood function, with the MLE indicated as a vertical line. Add the limits of the confidence interval to your plot as vertical green lines.
  • Comment on the reliability of your estimate.