1 Iris dataset

Iris flower from the versicolor species

Today we will continue working with the iris dataset which is a build-in dataset. Lets load the dataset again.

data('iris')
class(iris)

## [1] "data.frame"

dim(iris)

## [1] 150   5

head(iris, n=5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

?iris

We recall that the dataset contains measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 150 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Initially, lets store the numerical data in an object y, set the sample size n and produce some visualisations of the data.

y = iris[, 1:4]
n = nrow(y)

par(mfrow = c(2,2))
for (i in 1:4) {
  hist(y[,i], col=i+1, main = '', xlab = names(y)[i])
}

pairs(y)

We see that the distributions of petal length and petal width seem to be bimodal. We also see that there is a clear indication for existence of two groups. Lets start by estimating non-parametrically the density of petal length.

2 Package BNPmix

2.1 Loading the package

If not installed already, we need to install the package first before loading the library.

install.packages('BNPmix')

Now, load the BNPmix library:

library(BNPmix)

2.2 Model for univariate density estimation

The package BNPmix uses a very efficient C++ implementation for posterior sampling (Gibbs algorithms) from complex Dirichlet process mixtures. In fact, BNPmix uses the Pitman-Yor (PY) process which is a generalization of the DP process, which is a special case. We will not have time to learn about this process, but in practical terms it is sufficient to know that in the notation we use in the lectures we would assume that \[ G \sim \mathrm{PY}(\beta,\alpha,G_0), \] where \(\beta = [0,1)\) is called the discount parameter, \(\alpha > -\beta\) the strength parameter and \(G_0\) is the usual centering measure. When the discount is zero, i.e. \(\beta=0\), we have a \(\mathrm{DP}(\alpha,G_0)\) as we know it from lectures.

For univariate DPM models the package has the capability to fit infinite mixtures of normal densities via the stick-breaking representation \[\begin{equation} p(y_i) = \sum_{k=1}^\infty\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2), \label{eq1} \end{equation}\] with prior distributions for \(\mu_k\) and \(\sigma_k^2\) based on a conditional normal/inverse-gamma centering measure \(G_0(\mu_k,\sigma_k^2)=G_0(\mu_k ~|~ \sigma_k^2)G_0(\sigma_k^2)\) as follows \[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~ \sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0),\] where \(a_0>0, b_0>0\) are the scale and rate parameters, respectively, so that \(\mathbb{E}[\sigma_k^2]=\displaystyle\frac{b_0}{a_0-1}\), \(a_0>1\), and \(\mathbb{V}\mathrm{ar}[\sigma_k^2]=\displaystyle\frac{b_0^2}{(a_0-1)^2(a_0-2)}\) for \(a_0>2\). The prior on \(\mu_k\) is just a little different than the prior we presented in lectures as it includes a scalar \(k_0\) in its variance.

2.3 Basic structure and functionality

BNPmix uses two basic functions. The first is PYcalibrate which requires specification of three basic arguments:

PYcalibrate(Ek, n, discount = 0)

Ek This is a prior expectation for the number of clusters. This must be specified. For instance if we expect to have 4 clusters we can set it equal to 4 (or close to 4). If we have no expectation at all we can set it equal to some large number.
n The sample size.
discount The discount parameter (default is 0, resulting in a DP process).

The function for density estimation via posterior sampling is PYdensity. Typing ?PYdensity will open the help window. This function requires specification of 4 arguments

PYdensity(y, mcmc = list(), prior = list(), output = list())

Lets go through the above one by one and see what are the most important arguments we need to define.

y The data vector (for univariate estimation) of matrix (for multivariate estimation)
mcmc List with arguments for MCMC implementation:
- niter number of MCMC iterations.
- nburn number of initial iterations to discard (the “burn-in” period).
prior List with arguments for prior specification:
- discount The discount parameter (here it must be set to 0 for a DP).
- strength The strength parameter of the PY process or corresponding precision of the DP.
- hyper Logical, when TRUE (which is the default option) a hierarchical DPM model (seen later in lectures) is implemented, when FALSE a standard DPM model is implemented.
output List with arguments for the returning output of the function:
- grid Important as it defines the grid of points at which to evaluate the estimated posterior mean density. For instance, if \(y\) is univariate, we can define the grid ranging from \(\min(y)\) to \(\max(y)\) and also to have a specific length (e.g., 100).
- out_param Logical with default equal to FALSE. Recommended to set it to TRUE: this returns the full set of parameters in the model; e.g., also the \(\mu_k\), \(\sigma_k^2\) etc.

3 Univariate application to Iris data

3.1 Estimating the posterior density

Lets apply a DPM model for variable Petal Length from the Iris dataset, whose density certainly does not appear unimodal. So, in R we are going to use the vector y[,3].

Tasks:

1. In PYcalibrate set the number of expected clusters equal to 3 and also the sample size.

2. Define an object mcmc as a list with elements niter equal to 11000 and nburn equal to 1000.

3. Define an object prior as a list with elements strength equal to 1, discount equal to 0 and hyper equal to FALSE.

4. Define an object output as a list with element grid defined as a sequence from min of y[,3] to max of y[,3] with length equal to 100 (type ?seq for help) and element out_param set equal to TRUE.

5. Then, set.seed(1) (for reproducibility) and execute fit1 = PYdensity(y = y[,3], mcmc = mcmc, prior = prior, output = output).

If everything is done correctly, the command summary() will return the following output from fit1:

summary(fit1)

## PYdensity function call:
##  1000    burn-in iterations
##  11000   iterations 
##  Global estimation time: 0.93 seconds
##  Average number of groups:  3.5286 
##  Min number of groups:  2 ; Max number of groups:  10

This provides information about runtime and the number of iterations and burn-in period used. Importantly, it tells us that over the 10000 MCMC iterations the minimum number of clusters formed was 2 and the maximum was 10 with the average cluster number being 4 (when rounded).

If we want to overlay the mean posterior density on top of the histogram and also show the 95% posterior credible bands for the density we can use the line of code below. Overall, we see that the 95% credible bands capture almost everywhere the uncertainty in the data.

plot(fit1, band = TRUE, show_hist = TRUE, xlab = "Petal length", ylab = 'Density')

Note that BNPmix transfers arguments from the package function plot.BNPdens to the generic R function plot. Standard arguments within plot such as col, xlab etc. can be modified (but not all). Typing ?plot.BNPdens reveals the list of arguments. We can see, for instance, that we can suppress the credible bands or change them (as we see the default option is conf_level = c(0.025, 0.975)). We can also add the average posterior cluster assignment for the observations with coloured points on the \(x\)-axis as follows.

plot(fit1, band = TRUE, show_clust = TRUE, xlab = "Petal length", ylab = 'Density')

3.2 Finding the maximum a posteriori number of clusters

We saw previously that the command summary(fit1) will provide us the average number of clusters. However, what about the model with most frequently visited number of clusters? We are more interested in this quantity as it is the maximum a posteriori (MAP) estimate of the number of clusters. Unfortunately, BNPmix does not provide an immediate command for that, but there is a workaround with some additional coding. First, lets have a look at the names of the outputs that fit1 contains.

names(fit1)

##  [1] "density"    "data"       "grideval"   "grid_x"     "grid_y"    
##  [6] "clust"      "mean"       "beta"       "sigma2"     "probs"     
## [11] "niter"      "nburn"      "tot_time"   "univariate" "regression"
## [16] "dep"        "group_log"  "group"      "wvals"

We are interested in output clust which contains the cluster label for each observation for each of the 10000 posterior samples. Lets have a closer look.

cluster.labels = fit1$clust
class(cluster.labels)

## [1] "matrix" "array"

dim(cluster.labels)

## [1] 10000   150

cluster.labels[1,]         # labels from 1st MCMC sample, 3 clusters in total are visited

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0

Below, n.cluster is

In first line: a list (of length 10000) containing the unique elements for each vector in cluster.labels
In second line: a list (of length 10000) with the number of clusters in each vector in cluster.labels
In third line: unlist just tranform the class of n.cluster from list to vector of integers (as verified from line 4).

n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
length(n.cluster); class(n.cluster)

## [1] 10000

## [1] "integer"

Almost there! Now we can just use the command table() and plot to create a (relatively) nice figure for the posterior probabilities of clusters.

counts = table(n.cluster)
counts

## n.cluster
##    2    3    4    5    6    7    8    9   10 
## 2239 3302 2394 1352  494  164   44    8    3

plot(counts/sum(counts), col = 1:length(counts),  bty = 'l', xlab = 'Number of clusters',
     ylab = 'Posterior probability',  main = 'alpha = 1')

As seen, the most visited model is the one with \(K=3\), although \(K=2\) also has a relatively high posterior probability (but so does the model with \(K=4\)). This is not exactly what we expected based on the exploratory visualisations of the dataset at the beginning.

We may wish to improve this aspect of the model (although that is not very important for density estimation). May it be that results are affected from setting the expected number of clusters equal to 3? We did this by setting Ek = 3 in the function PYcalibrate. You may consider changing this option to Ek = 2 (left as an exercise), but as you will see it will not affect things (at least under set.seed(1)).

What we can consider, however, is to change the prior precision of the DP (strength in terms of the PY process). Previously, we defined the the prior through

prior = list(strength = 1, discount = 0, hyper = 'FALSE')

BNPmix has an internal procedure for automatically selecting the strength parameter based on the sample size and the discount parameter.We can find this value as follows:

DPprior = PYcalibrate(Ek = 3, n = n, discount = 0)
DPprior

## $strength
## [1] 0.3938936
## 
## $discount
## [1] 0

Task:

6. Refit the model (with the same seed) using the above value for strength and plot the resulting posterior probabilities of the cluster for this model.

If this is done correctly, the results should be as shown below. Now the MAP estimate for the number of clusters is, indeed, \(K=2\).

counts

## n.cluster
##    2    3    4    5    6    7 
## 5472 3236 1022  228   38    4

plot(counts/sum(counts), col = 1:length(counts),  bty = 'l', xlab = 'Number of clusters',
     ylab = 'Posterior probability', main = 'alpha = 0.3938936')

3.3 Posterior inference for the parameters

Above, in 3.1 we estimated the posterior density and its related uncertainty and in 3.2 we inferred about the most likely a posteriori number of clusters. These two, are typically the main goals of nonparametric analyses. However, what about posterior inference for the parameters, is that still possible and does BNPmix provide this extra inferential option?

The answer is yes, but of course the only way posterior summaries of parameters would make sense is when we work conditional upon a given \(K\). A sensible option is to select the MAP estimate, which is \(K=2\) in this application. So we will basically infer about the 2-component mixture of the form

\[ p(y_i) = \sum_{k=1}^2\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2). \] Clarification note: The results below are not equivalent to the results obtained from using a 2-component Bayesian normal mixture model as a starting point (even if the prior specification is similar in form). If the main aim is inference about parameters, then a more disciplined approach would be to first determine the number of clusters (as we did above) and then infer (using MCMC) from a finite 2-component mixture. That being said, in practical terms the results will likely look quite similar!

Lets see how we can extract posterior summaries for the means \(\mu_1\) amd \(\mu_2\). The key thing is to remember that we can see the objects contained in fit1 by names(fit1) as we have seen above. The object of interest in this case is mean. Below we store this information in a new objectmeans; this is a list: means[[1]] and means[[2]] show the resulting means from the first and last MCMC sample (the burn-in samples are removed automatically).

means = fit1$mean
means[[1]]; means[[10000]]

##          [,1]
## [1,] 5.095031
## [2,] 1.426748
## [3,] 2.687789
## [4,] 1.651989

##           [,1]
## [1,] 4.8656932
## [2,] 1.4430385
## [3,] 1.3996998
## [4,] 1.3686386
## [5,] 0.9059466

Now, we are only interested in the posterior samples of the first two components, so lets initially create a matrix of NAs with 10000 (MCMC sample size) rows and \(K=2\) columns. Next, we impute matrix post.means row-wise with the first column being filled with the samples of \(\mu_1\) and the second with samples of \(\mu_2\).

post.means = matrix(NA, nrow = 10000, ncol = 2)
for(i in 1:10000){
  post.means[i,1] = means[[i]][1]
  post.means[i,2] = means[[i]][2]
}

We now have everything we require and can evaluate posterior summaries of \((\mu_1,\mu_2)\).

summary(post.means)

##        V1              V2        
##  Min.   :3.527   Min.   :0.9085  
##  1st Qu.:4.850   1st Qu.:1.4430  
##  Median :4.910   Median :1.4622  
##  Mean   :4.912   Mean   :1.4830  
##  3rd Qu.:4.970   3rd Qu.:1.4822  
##  Max.   :5.792   Max.   :5.5915

We see, for instance, that the posterior means are \(\hat{\mu_1} = 4.909\) and \(\hat{\mu_2} = 1.4697\). Does this generally make sense? It seems so, if we have a look again at the histogram of variable Petal length one group mode seems to be around 5 while the other group mode somewhere around 1.5 (approximately). We can also produce relatively nice-looking density plots.

par(mfrow = c(1,2))
plot(density(post.means[,1]), main = '', lwd = 2, ylab = 'Posterior density',
     xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means[,2]), main = '', lwd = 2, ylab = 'Posterior density',
     xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3)

We observe that the posterior density of \(\mu_2\) is certainly peaked around the posterior mode but it is also quite spread, taking values in extreme regions with very low probability density. We can improve this plot by controlling the \(x\)-axis range via argument xlim within plot as done below.

par(mfrow = c(1,2))
plot(density(post.means[,1]), main = '', lwd = 2, ylab = 'Posterior density',
     xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means[,2]), main = '', lwd = 2, ylab = 'Posterior density',
     xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3, xlim = c(1,2))

Tasks:

7. Produce similar posterior summaries and density plots for \(\sigma_1^2\) and \(\sigma_2^2\) (hint: for the plot of \(\sigma_2^2\) use xlim = c(0,0.2)).

8. Produce similar posterior summaries and density plots for \(\pi_1\) and \(\pi_2\)

If done correctly, you should get the following results and figures:

##        V1                V2         
##  Min.   :0.08956   Min.   :0.01327  
##  1st Qu.:0.60264   1st Qu.:0.03177  
##  Median :0.66451   Median :0.03729  
##  Mean   :0.67357   Mean   :0.04093  
##  3rd Qu.:0.73744   3rd Qu.:0.04389  
##  Max.   :1.24008   Max.   :5.38477

##        V1               V2          
##  Min.   :0.0101   Min.   :0.002686  
##  1st Qu.:0.6273   1st Qu.:0.298695  
##  Median :0.6583   Median :0.325601  
##  Mean   :0.6497   Mean   :0.321325  
##  3rd Qu.:0.6866   3rd Qu.:0.353695  
##  Max.   :0.7967   Max.   :0.513049

3.4 Controlling the specification of prior hyper-parameters

Finally, let us discuss the aspect of setting the hyper-parameters of the model. As mentioned above the prior assumptions under BNPmix are the following: \[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~ \sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0).\] The default options used by BNPmix are as follows \[m_0 = \bar{y}= \frac{1}{n}\sum_iy_i, ~k_0 =1, ~a_0 =2, ~b_0 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2.\] So as wee, BNPmix actually uses an empirical Bayes procedure where prior hyper-parameters \(m_0\) and \(b_0\) are estimated from the data! Practically, this is fine and quite understandable from the side of the developers as this is a generally “safe” option so that the algorithmic procedures implemented “under the hood” will provide stable estimates (or at least, not crash completely) across different types of datasets.

That being acknowledged, it should be mentioned that hard-core Bayesians would not like this approach at all! Even more moderate Bayesian statisticians would like to have some control over this and potentially apply a fully Bayesian approach. So the question is: can we change the prior specification?

The answer is yes. Setting of prior hyper-parameters can be controlled via the prior list, which, apart from the arguments listed above in Section 2.3, can take the following additional arguments:

prior List with arguments for prior specification:
- discount The discount parameter (here it must be set to 0 for a DP).
- strength The strength parameter of the PY process or corresponding precision of the DP.
- hyper Provides control for choosing a DPM or a hierarchical DPM.
- m0 The prior mean of each \(\mu_k\).
- k0 The scalar \(k_0>0\) controlling the variance of each \(\mu_k\).
- a0 The shape \(a_0>0\) of the inverse-gamma distribution of each \(\sigma_k^2\).
- b0 The scale \(b_0>0\) of the inverse-gamma distribution of each \(\sigma_k^2\).

Suppose we want to adopt a non-informative prior such that \[\mu_k ~|~\sigma_k^2\sim\mathrm{N}(0,\sigma^2_k) ~\mbox{and}~\sigma_k^2\sim\mathrm{InvGamma}(1,1000).\] Note that the mean and the variance for \(a_0=1\) are not available in closed form, but we can still check empirically that this inverse-gamma has a very large mean and variance; for instance:

set.seed(1)
mean(1/rgamma(1000, shape = 1, rate = 1000))

## [1] 450650.4

var(1/rgamma(1000, shape = 1, rate = 1000))

## [1] 694001710

Above we made use of the following relationship: if \(X\sim \mathrm{Gamma}(a,b)\) (where \(b\) is the rate parameter), then \(1/X\sim \mathrm{InvGamma}(a,b)\).

Tasks:

9. Using the automatic BNPmix option for setting strength, while discount should always be equal to 0 and hyper be equal to FALSE, define the prior list so that \(\mu_0=0\), \(k_0=1\) (this is actually the default) and \(a_0=1, ~ b_0=1000\). Then run the new model (without changing the rest of the previous specifications) using again set.seed(1) and call this fit2.

10. Produce the two density plots we saw above, now for for fit2 (change the colour; e.g., use col = 2 within the plot).

11. Visualise, as done before, the posterior probabilities for the number of clusters under model fit2.

12. Calculate posterior summaries and produce posterior density plots for the means under the MAP model resulting from fit2.

Results should look as follows:

## n.cluster
##    2    3    4    5    6    7    8 
## 5196 3441 1070  253   36    1    3

Did results change under the modified prior specification?

In this application the related inference was basically unaffected. So, we avoided using empirical Bayes and went with a fully Bayesian approach which, at the end, provided more or less the same results.

4. Empirical observations, further experiments and practical advice

Sometimes things can go wrong for reasons that are obscure. For instance, let’s say that we want to analyse Petal width (the 4th column of y). In this case

prior = list(strength = DPprior$strength, discount = 0, hyper = 'FALSE')
output = list(grid = seq(min(y[,4]), max(y[,4]), length.out = 100),
              out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y[,4], mcmc = mcmc, prior = prior, output = output)

If you execute this code you will receive the following error message:

Error in cICS(y, grid_use, niter, nburn, m0, k0, a0, b0, m1, s21, tau1, : randg(): incorrect distribution parameters; a and b must be greater than zero

Now, the fact that error starts with cICS gives reason to suspect that there is something going wrong with the MCMC sampler used. BNPmix can implement three algorithms: an Importance Conditional Sampler (ICS), a Marginal Sampler (MAR) and a Slice Sampler (SLI). These can be controlled via the mcmc list through the extra argument method. The default option is method = 'ICS'.

However, before changing the algorithm lets consider first a practical solution that might. In such types of applications sometimes it is better to scale the data first (for each column subtract its mean and divide by its variance). So, lets try this

y.scaled = scale(y)
apply(y.scaled,2,mean); apply(y.scaled,2,var)

##  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width 
## -4.484318e-16  2.034094e-16 -2.895326e-17 -3.663049e-17

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##            1            1            1            1

output = list(grid = seq(min(y.scaled[,4]), max(y.scaled[,4]), length.out = 100),
              out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y.scaled[,4], mcmc = mcmc, prior = prior, output = output)
summary(fit3)

## PYdensity function call:
##  1000    burn-in iterations
##  11000   iterations 
##  Global estimation time: 1.12 seconds
##  Average number of groups:  5 
##  Min number of groups:  5 ; Max number of groups:  5

plot(fit3, band = TRUE, show_hist = TRUE, xlab = "Petal width",
     ylab = 'Density', col = 3)

Well, in this case we managed to run the sampler, but the results are awfull! So, lets change the sampler and use MAR instead.

mcmc = list(niter = 11000, nburn = 1000, method = 'MAR')
output = list(grid = seq(min(y[,4]), max(y[,4]), length.out = 100), 
              out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y[,4], mcmc = mcmc, prior = prior, output = output)

## Completed:   1100/11000 - in 0.124563 sec
## Completed:   2200/11000 - in 0.254853 sec
## Completed:   3300/11000 - in 0.38127 sec
## Completed:   4400/11000 - in 0.518943 sec
## Completed:   5500/11000 - in 0.637626 sec
## Completed:   6600/11000 - in 0.765306 sec
## Completed:   7700/11000 - in 0.898981 sec
## Completed:   8800/11000 - in 1.02484 sec
## Completed:   9900/11000 - in 1.16216 sec
## Completed:   11000/11000 - in 1.29038 sec
## 
## Estimation done in 1.29039 seconds

summary(fit3)

## PYdensity function call:
##  1000    burn-in iterations
##  11000   iterations 
##  Global estimation time: 1.29 seconds
##  Average number of groups:  3.1109 
##  Min number of groups:  2 ; Max number of groups:  30

plot(fit3, band = TRUE, show_hist = TRUE, xlab = "Petal width", 
     ylab = 'Density', col = 3)

This worked and without scaling the data.

Warning note: If you try to run this with method = 'SLI' RStudio will crash for some reason!

Practical 3: Univariate analysis with Dirichlet process mixture models

1 Iris dataset

2 Package BNPmix

2.1 Loading the package

2.2 Model for univariate density estimation

2.3 Basic structure and functionality

3 Univariate application to Iris data

3.1 Estimating the posterior density

3.2 Finding the maximum a posteriori number of clusters

3.3 Posterior inference for the parameters

3.4 Controlling the specification of prior hyper-parameters

4. Empirical observations, further experiments and practical advice