1 Iris dataset

Iris flower from the sibirica species

We will continue working with the iris dataset, using package BNPmix. So, first lets load the data and the library again.

data('iris')
class(iris)

## [1] "data.frame"

dim(iris)

## [1] 150   5

head(iris, n=5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

y = iris[, 1:4]
n = nrow(y)

library(BNPmix)

General note about random seeds: Apparently, even when using set.seed() the results can be different for different operating systems (e.g., Linux, Windows, Mac). Based on some (minimal) research on this, it is recommended to use the following custom function with fixed default options in set.seed():

my.set.seed = function(n) set.seed(n, kind = "Mersenne-Twister", normal.kind = "Inversion")

2 Hierarchical univariate DPM model in BNPmix

We have seen in the first practical class that the package exploits the stick-breaking representation for fitting DPM models. Specifically, we have
\[\begin{equation} p(y_i) = \sum_{k=1}^\infty\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2), \end{equation}\] with prior distributions for $\mu_k$ and $\sigma_k^2$ based on a conditional conjugate normal/inverse-gamma centering measure, such that \[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~ \sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0),\] where $a_0$ and $b_0$ are the shape and scale parameters of the inverse-gamma distribution, respectively. We have also seen that the default options in BNPmix are: \[m_0 = \bar{y}= \frac{1}{n}\sum_iy_i, ~k_0 =1, ~a_0 =2, ~b_0 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2.\] As discussed, this an empirical Bayes approach, which can work well in practice. Of course, as we have demonstrated the above can be customised by changing the arguments m0, k0, a0 and b0 in the prior list.

Now in hierarchical DPMs the above parameters are random, so we have a further layer of hierarchy. The hyper-priors used by BNPmix are

\[ m_0 \sim \mathrm{N}(m_1,\sigma_1^2), ~ k_0 \sim \mathrm{Gamma}(\tau_1,\zeta_1), ~ b_0 \sim \mathrm{Gamma}(a_1,b_1), \] where $\tau_1$ and $a_1$ correspond to shape parameters, $\zeta_1$ and $b_1$ correspond to rate parameters in the Gamma distributions. As we see BNPmix does not allow placing a hyper-prior on $a_0$ (reason unknown). Apart from that, the hierarchical model formulation is the same as the one presented in lectures, albeit with a different notation for the parameters.

Under the hierarchical DPM it is the hyper-parameters $m_1, ~ s_1^2, ~ \zeta_1, ~ \tau_1, ~ a_1$ and $b_1$ that are fixed. The default options in BNPmix are \[m_1 = \bar{y}, ~ \sigma_1^2 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2, ~\tau_1=\zeta_1 =1, ~a_1 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2, ~b_1 = 1. \] As seen, in this case the emprirical approach is transferred to some of the hyper-parameters. Once again the model can be customised as the above hyper-parameters can be changed within the prior list. It will be useful to have a look again into this list and all of its relevant arguments:

prior List with arguments for prior specification:
- discount The discount parameter (here it must be set to 0 for a DP).
- strength The strength parameter of the PY process or corresponding precision of the DP.
- hyper Provides control for choosing a DPM or a hierarchical DPM.
- m0 The prior mean of each $\mu_k$ (DPM).
- k0 The scalar $k_0>0$ controlling the variance of each $\mu_k$ (DPM).
- a0 The shape $a_0>0$ of the inverse-gamma distribution of each $\sigma_k^2$ (DPM and HDPM).
- b0 The scale $b_0>0$ of the inverse-gamma distribution of each $\sigma_k^2$ (DPM).
- m1 The hyper-prior mean of $m_0$ (HDPM).
- s21The hyper-prior variance of $m_0$ (HDPM).
- tau1 The shape $\tau_1>0$ of the gamma distribution on $k_0$ (HDPM).
- zeta1 The rate $\zeta_1>0$ of the gamma distribution on $k_0$ (HDPM).
- a1 The shape $a_1>0$ of the gamma distribution on $b_0$ (HDPM).
- b1 The rate $b_1>0$ of the gamma distribution on $b_0$ (HDPM).

3 A HDPM model for variable petal length in Iris dataset

We will continue working with variable Petal Length from this dataset. The code is presented below. We set Ek = 2 (our prior expectation for the number of clusters) in PYcalibrate and use the automatic procedure of BNPmix to set the strength of the PY process (precision of the DP); this is done by setting strength = DPprior$strength in the 4th line. In order to fit the HDPM the only thing we need to do is to set hyper = TRUE in line 4 (actually, this is the default option so this not required, but it is better to always explicitly state this argument).

DPprior = PYcalibrate(Ek = 2, n = n, discount = 0)
DPprior

## $strength
## [1] 0.1882315
## 
## $discount
## [1] 0

mcmc = list(niter = 11000, nburn = 1000)
prior = list(strength =  DPprior$strength, discount = 0, hyper = TRUE)
output = list(grid = seq(min(y[,3]), max(y[,3]), length.out = 100), out_param = TRUE)
my.set.seed(1)
fit4 = PYdensity(y = y[,3], mcmc = mcmc, prior = prior, output = output)

## Completed:   1100/11000 - in 0.207 sec
## Completed:   2200/11000 - in 0.407 sec
## Completed:   3300/11000 - in 0.626 sec
## Completed:   4400/11000 - in 0.832 sec
## Completed:   5500/11000 - in 1.035 sec
## Completed:   6600/11000 - in 1.25 sec
## Completed:   7700/11000 - in 1.46 sec
## Completed:   8800/11000 - in 1.661 sec
## Completed:   9900/11000 - in 1.868 sec
## Completed:   11000/11000 - in 2.072 sec
## 
## Estimation done in 2.072 seconds

summary(fit4)

## PYdensity function call:
##  1000    burn-in iterations
##  11000   iterations 
##  Global estimation time: 2.07 seconds
##  Average number of groups:  2.2217 
##  Min number of groups:  2 ; Max number of groups:  5

In this case, the HDPM visits in general fewer models (in comparison to the DPM model used in Practical 1) as the minimum number of clusters is 2 and the maximum number of clusters is 5 (with the average being around 2). Below we see the posterior mean and the 95% credible bands of the density, and the average posterior cluster assignments of the observations overlaid on the histogram.

plot(fit4, band = TRUE, show_hist = TRUE, show_clust = TRUE, xlab = "Petal length", ylab = 'Density')

Assume now that we want to avoid using empirical Bayes and prefet to define our model based on the following hyper-priors \[ m_0 \sim \mathrm{N}(0,1000), ~ b_0 \sim \mathrm{Gamma}(0.001,0,001),\] keeping the rest fixed at $k_0 = 1$ and $a_0 = 3$. Above we have a diffuse non-informative hyper-prior for $m_0$ (with large variance) and a non-informative hyper-prior for $b_0$, since for the given values the mean ($\mathbb{E}[b_0]=a_1/b_1$) is 1 and the variance ($\mathbb{V}\mathrm{ar}[b_0]=a_1/b_1^2$) is 1000. Note that, the ``non-informativeness’’ is transferred to the inverse-gamma prior of $\sigma_k^2$ since we have that \[\mathbb{E}[\sigma_k^2] = \frac{b_0}{a_0-1} = \frac{b_0}{2} ~ \mbox{and} ~ \mathbb{V}\mathrm{ar}[\sigma_k^2] = \frac{b_0^2}{(a_0-1)^2(a_0-2)} = \frac{b_0^2}{4},\] where $b_0$ will take very large values. The specification in BNPmix is as follows:

prior = list(strength =  DPprior$strength, discount = 0, hyper = TRUE, m1 = 0, s21 = 1000, k0 = 1, a0 = 3, a1 = 0.001, b1 = 0.001)
mcmc = list(niter = 11000, nburn = 1000, print_message = FALSE)
my.set.seed(1)
fit5 = PYdensity(y = y[,3], mcmc = mcmc, prior = prior, output = output)

The print_message = FALSE argument in mcmc just suppresses the MCMC iterations from being printed out. As usual we can look at the summary and the posterior density plot.

summary(fit5)

## PYdensity function call:
##  1000    burn-in iterations
##  11000   iterations 
##  Global estimation time: 2.18 seconds
##  Average number of groups:  2.3001 
##  Min number of groups:  2 ; Max number of groups:  6

plot(fit5, band = TRUE, show_hist = TRUE, show_clust = TRUE, xlab = "Petal length", ylab = 'Density', col = 7)

Finally, as demonstrated in the 1st practical class we can infer about the MAP estimate for the number of clusters and also infer about the posterior distributions of all relevant parameters. For instance, concerning the former, we have the following result:

cluster.labels = fit5$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts

## n.cluster
##    2    3    4    5    6 
## 7407 2224  334   31    4

plot(counts/sum(counts), col = 1:length(counts),  bty = 'l', xlab = 'Number of clusters', ylab = 'Distribution of clusters')

4 Multivariate DPM and HDPM models with BNPmix

We have seen the setting of multivariate DPMs and HDPMs in lectures, now lets see how these are treated in BNPmix and the corresponding notation used. In this setting, we will assume that we have a data matrix $\mathbf{Y}$ of dimensionality $n\times q$. The stick-breaking representation is
\[\begin{equation} p(\mathbf{y}_i) = \sum_{k=1}^\infty\pi_k \mathrm{N}_q(\mathbf{y}_i ~|~ \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), \end{equation}\] for $i\in \{1,\ldots,n\}$ and $\mathbf{y}_i\in \mathbb{R}^q$, with prior distributions for $\boldsymbol{\mu}_k$ and $\boldsymbol{\Sigma}_k$ based on the multivariate conditional conjugate normal/inverse-Wishart design as a centering measure. Specifically, \[\boldsymbol{\mu}_k ~|~ \boldsymbol{\Sigma}_k\sim G_0(\boldsymbol{\mu}_k ~|~ \boldsymbol{\Sigma}_k)\equiv \mathrm{N}(\mathbf{m}_0, \boldsymbol{\Sigma}_k/k_0) \mbox{ and } \boldsymbol{\Sigma}_k\sim G_0(\boldsymbol{\Sigma}_k) \equiv \mathrm{InvWishart}(\boldsymbol{\Sigma_0},\nu_0),\] where $\boldsymbol{\Sigma_0}$ and $\nu_0$ are the respective scale matrix and degrees-of-freedom parameter of the inverse-Wishart distribution. The default options in BNPmix are: \[\mathbf{m}_0 = \bar{\mathbf{y}}= (\bar{y}_1,\ldots,\bar{y}_q)^T, ~k_0 =1, ~\nu_0 =q+2, ~\boldsymbol{\Sigma}_0 = \hat{\boldsymbol{\Sigma}}_\mathbf{y},\] where $\hat{\boldsymbol{\Sigma}}_\mathbf{y}$ is the sample variance-covariance estimate of $\mathbf{y}$. Customisation as we will see is again possible, but first lets see the corresponding hierarchical setting.

Similarly to the univariate case, here we can also have deeper layers of hierarchy. The hyper-priors used by BNPmix for multivariate HDPMs are as follows

\[ \mathbf{m}_0 \sim \mathrm{N}_q(\mathbf{m}_1, \mathbf{S}_1), ~ k_0 \sim \mathrm{Gamma}(\tau_1,\zeta_1), ~ \boldsymbol{\Sigma}_0 \sim \mathrm{InvWishart}(\boldsymbol{\Sigma}_1,\nu_1).\] As seen, similarly to the univariate case for parameter $a_0$, BNPmix does not allow a hyper-prior for parameter $\nu_0$ in the multivariate case. The default BNPmix options for the hyper-parameters $\mathbf{m}_1, ~ s_1^2, ~ \zeta_1, ~ \tau_1, ~ a_1$ and $b_1$ are: \[\mathbf{m}_1 = \bar{\mathbf{y}}= (\bar{y}_1,\ldots,\bar{y}_q)^T, ~ \mathbf{S}_1 = \hat{\boldsymbol{\Sigma}}_\mathbf{y}, ~ \tau_1=\zeta_1 =1, ~\nu_1 = q+2, ~ \boldsymbol{\Sigma}_1 = (q+2)^{-1}\hat{\boldsymbol{\Sigma}}_\mathbf{y}. \]

Now let us have a look at the prior list and the relevant arguments for the multivariate case:

prior List with arguments for prior specification:
- discount The discount parameter (here it must be set to 0 for a DP).
- strength The strength parameter of the PY process or corresponding precision of the DP.
- hyper Provides control for choosing a DPM or a hierarchical DPM.
- m0 The prior mean-vector of each $\boldsymbol{\mu}_k$ (DPM).
- k0 The scalar $k_0>0$ controlling the variance-covariance of each $\boldsymbol{\mu}_k$ (DPM).
- n0 The degrees-of-freedom parameter $\nu_0>q-1$ of the inverse-Wishart distribution of each $\boldsymbol{\Sigma}_k^2$ (DPM and HDPM).
- Sigma0 The variance-covariance $\boldsymbol{\Sigma}_0$ of the inverse-Wishart distribution of each $\boldsymbol{\Sigma}_k$ (DPM).
- m1 The hyper-prior mean-vector of $\mathbf{m}_0$ (HDPM).
- S1The hyper-prior variance-covariance of $\mathbf{m}_0$ (HDPM).
- tau1 The shape $\tau_1>0$ of the gamma distribution on $k_0$ (HDPM).
- zeta1 The rate $\zeta_1>0$ of the gamma distribution on $k_0$ (HDPM).
- n1 The degrees-of-freedom parameter $\nu_0>q-1$ of the inverse-Wishart distribution on $\boldsymbol{\Sigma}_0$ (HDPM).
- Sigma1 TThe variance-covariance $\boldsymbol{\Sigma}_1$ of the inverse-Wishart distribution on $\boldsymbol{\Sigma}_0$ (HDPM).

5 Multivariate application to Iris data

Now we will consider all four variables (Sepal Length, Sepal Width, Peta Length, Petal Width) in the Iris dataset for our Bayesian nonparametric analysis. In general, multivariate data analysis has a much higher computational toll, so in practice we must reduce the “demands” in terms of the input to BNPmix (at least for initial exploratory purposes).

Important points:

(1) The length of the grid support for evaluating the posterior multivariate density should be reduced drastically. Below we use a grid of length equal to 20 for each column of y:

grid.length = 20
grid1 = seq(min(y[,1]), max(y[,1]), length.out = grid.length)
grid2 = seq(min(y[,2]), max(y[,2]), length.out = grid.length)
grid3 = seq(min(y[,3]), max(y[,3]), length.out = grid.length)
grid4 = seq(min(y[,4]), max(y[,4]), length.out = grid.length)

Then, we must create a data frame from all possible combinations of the vectors grid1, grid2, grid3 and grid4 via the command grid.expand (type ?grid.expand for more information):

grid = expand.grid(grid1, grid2, grid3, grid4)
dim(grid)

## [1] 160000      4

20^4 # this checks!

## [1] 160000

(2) Another realistic “sacrifice” we need to make for practical purposes is to reduce drastically the total number of MCMC iterations (again, at least for initial exploratory purposes). Below we set the total MCMC sample size equal to 2000 and the burn-in sample size equal to 1000.

mcmc.total.n = 2000
mcmc.burnin = 1000

5.1 Fitting multivariate DPMs

With all of that being said, lets fit a DPM to the entire Iris dataset! Initially, we set the prior expected number of clusters equal to 2 (based on the previous analyses). Below we make explicitly clear that the Gibbs sampler used is the importance conditional sampler through the argument method = 'ICS' in mcmc (albeit, this being the default option).

DPprior = PYcalibrate(Ek = 2, n = n, discount = 0)
mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'ICS', print_message = FALSE)
prior = list(strength =  DPprior$strength, discount = 0, hyper = FALSE)
output = list(grid = grid, out_param = TRUE)
my.set.seed(1)
mult.fit1 = PYdensity(y = y, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit1)

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 46.76 seconds
##  Average number of groups:  1.001 
##  Min number of groups:  1 ; Max number of groups:  2

The result above is not very promising, given our previous findings. Under the current configuration only models with $K=1$ and $K=2$ clusters are visited, with the average being $\hat{K}\approx 1$!. This indicates that it would be better to increase the strength (DP’s precision); currently, we are using $\alpha = 0.1882315$ as shown below, which is probably too small.

DPprior

## $strength
## [1] 0.1882315
## 
## $discount
## [1] 0

There are two ways of increasing the strength: (i) manually override this and set, for example strength = 0.5 in the prior list or (ii) increase the number of prior expected cluster controlled by Ek. Lets have a go at the second option and set Ek equal to 4 instead of 2.

DPprior = PYcalibrate(Ek = 4, n = n, discount = 0)
DPprior

## $strength
## [1] 0.6157496
## 
## $discount
## [1] 0

prior = list(strength =  DPprior$strength, discount = 0, hyper = FALSE)
my.set.seed(1)
mult.fit1 = PYdensity(y = y, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit1)

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 76.54 seconds
##  Average number of groups:  2.004 
##  Min number of groups:  2 ; Max number of groups:  3

cluster.labels = mult.fit1$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts

## n.cluster
##   2   3 
## 996   4

This looks more reasonable (based on what we already know) with the MAP estimate of $K$ being equal to 2. We could proceed to model and parameter inference with posterior visualisations based on this model. However, lets first investigate whether we can abandon the data-dependent prior assumptions, without affecting inference about the number of clusters.

Specifically, lets consider the model with \[\mathbf{m}_0 = (0,\ldots,0)^T, ~k_0 =1, ~\nu_0 =q+2, ~\boldsymbol{\Sigma}_0 = \mathrm{diag}(1000,\ldots, 1000),\] which is clearly not data-dependent. The change we have to make in order to run this model is simple and is as follows

prior = list(strength =  DPprior$strength, discount = 0, hyper = FALSE, m0 = rep(0,4), Sigma0 = diag(1000,4))
my.set.seed(1)
mult.fit2 = PYdensity(y = y, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit2)

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 76.01 seconds
##  Average number of groups:  2.004 
##  Min number of groups:  2 ; Max number of groups:  3

5.2 Fitting multivariate HDPMs

We will not delve into much detail about applying multivariate HDPMs in this application. It suffices to note that a default HDPM model can be fit by simply using prior = list(strength = DPprior$strength, discount = 0, hyper = TRUE). If you run this model, you will see that the results from the default HDPM model are basically more or less the same as those from the default DPM model (which is simpler). In addition, creating custom non-informative multivariate HDPMs is more complicated due to the added layers of hierarchy.

5.3 Multivariate data visualisations for the Iris dataset

Now lets use the output from the customised fully Bayesian model stored in mult.fit2 to explore and visualise the various inferences that can be drawn from the model.

Density estimation and clustering

The Iris dataset contains four variables; therefore, we can potentially plot four marginal univariates densities and twelve bivariate densities. For instance, if we want to plot the marginal density estimate of Sepal length and the bivariate density estimate of (Sepal length, Sepal width) we can use the following a code:

p1 = plot(mult.fit2, dim = c(1, 1), xlab = "Sepal length", ylab = "Density")
p12 = plot(mult.fit2, dim = c(1, 2), show_clust = TRUE, xlab = "Sepal length", ylab = "Sepal width")
p1

p12

When we are dealing with many variables it is sometimes more convenient to combine the various plots and present them in one plot. First, lets define all the plots that we want.

p1 = plot(mult.fit2, dim = c(1, 1), xlab = "Sepal length", ylab = "Density", col = 2)
p2 = plot(mult.fit2, dim = c(2, 2), xlab = "Sepal width", ylab = "Density", col = 3)
p3 = plot(mult.fit2, dim = c(3, 3), xlab = "Petal length", ylab = "Density", col = 4)
p4 = plot(mult.fit2, dim = c(4, 4), xlab = "Petal width", ylab = "Density", col = 5)

p12 = plot(mult.fit2, dim = c(1, 2), show_clust = TRUE, xlab = "Sepal length", ylab = "Sepal width", col = 2)
p13 = plot(mult.fit2, dim = c(1, 3), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal length", col = 2)
p14 = plot(mult.fit2, dim = c(1, 4), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal width", col = 2)

p21 = plot(mult.fit2, dim = c(2, 1), show_clust = TRUE, xlab = "Sepal width", ylab = "Sepal length", col = 3)
p23 = plot(mult.fit2, dim = c(2, 3), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal length", col = 3)
p24 = plot(mult.fit2, dim = c(2, 4), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal width", col = 3)

p31 = plot(mult.fit2, dim = c(3, 1), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal length", col = 4)
p32 = plot(mult.fit2, dim = c(3, 2), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal width", col = 4)
p34 = plot(mult.fit2, dim = c(3, 4), show_clust = TRUE, xlab = "Petal length", ylab = "Petal width", col = 4)

p41 = plot(mult.fit2, dim = c(4, 1), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal length", col = 5)
p42 = plot(mult.fit2, dim = c(4, 2), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal width", col = 5)
p43 = plot(mult.fit2, dim = c(4, 3), show_clust = TRUE, xlab = "Petal width", ylab = "Petal length", col = 5)

Now we can combine them using the command grid.arrange from R package gridExtra (which we need to install or just load first).

#install.packages('gridExtra')
library(gridExtra)
grid.arrange(p1, p12, p13, p14, p21, p2, p23, p24, p31, p32, p3, p34, p41, p42, p43, p4, layout_matrix = matrix(1:16, 4, 4))

The above code created a $4\times4$ grid on the graphical device and placed the marginal density estimates on the main diagonal and the pairs of the bivariate density estimates on the off-diagonal panels. Note that the bivariate plots also include the average posterior cluster assignment of the observations distinguished by different colours. Of course, we may choose to present results in different ways; for instance the following would produce a plot of the marginal univariate distributions first, and a plot of the pairs of bivariate distributions second.

grid.arrange(p1, p2, p3, p4, layout_matrix = matrix(1:4, 1, 4))
grid.arrange(p12, p13, p14, p21, p23, p24, p31, p32, p34, p41, p42, p43, layout_matrix = matrix(1:12, 3, 4))

Note that for multivariate models we cannot apply the extra options of overlaying densities on histograms, credible bands and so forth. However, basic generic plot arguments like xlab, ylab, col etc. can be changed.

Posterior inference

We already know from previously that the most likely a-posteriori estimate of the number of clusters is $\hat{K}=2$. Below we find its posterior probability to be $\mathbb{P}[K =2~|~\mathbf{Y}] = 0.996$.

cluster.labels = mult.fit2$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts/sum(counts)

## n.cluster
##     2     3 
## 0.996 0.004

So, we will proceed to posterior inference conditional on $K=2$. Similarly to what we have done in Practical 1, we can extract posterior samples of the cluster means and calculate posterior summaries and/or plot the posterior distributions. The only difference is that we now have $q = 4$ variables.

names(mult.fit2)

##  [1] "density"    "data"       "grideval"   "grid_x"     "grid_y"    
##  [6] "clust"      "mean"       "beta"       "sigma2"     "probs"     
## [11] "niter"      "nburn"      "tot_time"   "univariate" "regression"
## [16] "dep"        "group_log"  "group"      "wvals"

means = mult.fit2$mean
length(means)          # 1000 posterior samples

## [1] 1000

means[[1]]             # The 1st sample for the 4 variables

##           [,1]      [,2]       [,3]       [,4]
## [1,]  5.044807 3.4904806   1.512368  0.2594124
## [2,]  6.219309 2.8771094   4.864570  1.6685257
## [3,] -3.409511 0.4874954 -11.463251 -2.5813150

means[[1]][,1][1:2]    # The 1st samples of mu1 and mu2 for the 1st variable

## [1] 5.044807 6.219309

Below we create 4 matrices of dimensionality $1000\times2$ with initial entries NA in order to store the samples that we want in these matrices. We then calculate location posterior summaries and the variances for the cluster means $\boldsymbol{\mu}_1=(\mu_{11},\mu_{12})^T$ of the first variable (Sepal Length) and proceed to produce density plots of their marginal posterior distributions.

post.means1 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)
post.means2 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)
post.means3 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)
post.means4 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)

for(i in 1:(mcmc.total.n-mcmc.burnin)){
  post.means1[i,] = means[[i]][,1][1:2]
  post.means2[i,] = means[[i]][,2][1:2]
  post.means3[i,] = means[[i]][,2][1:2]
  post.means4[i,] = means[[i]][,4][1:2]
}

summary(post.means1)

##        V1              V2       
##  Min.   :4.859   Min.   :6.056  
##  1st Qu.:4.968   1st Qu.:6.217  
##  Median :5.007   Median :6.263  
##  Mean   :5.006   Mean   :6.264  
##  3rd Qu.:5.042   3rd Qu.:6.312  
##  Max.   :5.195   Max.   :6.463

apply(post.means1,2,var)

## [1] 0.002823882 0.004743266

par(mfrow = c(1,2))
plot(density(post.means1[,1]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means1[,2]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Group means for Sepal length', side = 3, line = -2, outer = TRUE, cex = 1.5)

The last command mtext produces a common main title for the two plots; this is slightly tricky as we need to experiment with argument line inside mtext to place the main title appropriately.

Alternatively, we can produce more “fancy” plots with the R package ggplot2. Lets see a demonstration below. First we need to load the package and then we create a data.frame of dimensionality $(2\times1000)\times 2$ with the first column containing a factor variable with cluster indicators and the second column containing the posterior samples for each cluster (we are still working with the 1st variable, Sepal length).

library(ggplot2)
all.means1 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means1[,1],post.means1[,2]))
head(all.means1)

##    cluster  samples
## 1 Cluster1 5.044807
## 2 Cluster1 5.008073
## 3 Cluster1 5.059295
## 4 Cluster1 5.007089
## 5 Cluster1 4.874071
## 6 Cluster1 5.066373

We can then plot the two densities on the same plot as follows (controlling some extra arguments about titles, font sizes etc.):

plot.m1 = ggplot(all.means1, aes(x=samples, color = cluster)) +
  geom_density(size = 1.3) +
  theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
  xlab(expression(mu)) + ylab('Posterior density') + 
  ggtitle('Sepal length') 
plot.m1

Now lets do the same for the remaining 3 variables and combine all four plots in one, using once again grid.arrange.

all.means2 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means2[,1],post.means2[,2]))
all.means3 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means3[,1],post.means3[,2]))
all.means4 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means4[,1],post.means4[,2]))

plot.m2 = ggplot(all.means2, aes(x=samples, color = cluster)) +
  geom_density(size = 1.3) +
  theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
  xlab(expression(mu)) + ylab('Posterior density') + 
  ggtitle('Sepal width') 

plot.m3 = ggplot(all.means3, aes(x=samples, color = cluster)) +
  geom_density(size = 1.3) +
  theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
  xlab(expression(mu)) + ylab('Posterior density') + 
  ggtitle('Petal length') 

plot.m4 = ggplot(all.means4, aes(x=samples, color = cluster)) +
  geom_density(size = 1.3) +
  theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
  xlab(expression(mu)) + ylab('Posterior density') + 
  ggtitle('Petal width')

grid.arrange(plot.m1, plot.m2, plot.m3, plot.m4, layout_matrix = matrix(1:4, 2, 2))

The posterior covariance matrices of the four clusters are also available in the output of BNPmix. These are included in the mult.fit2 value sigma2. Lets see how these look and store the covariances for $K=2$ to two list objects named Covs1 and Covs2.

covariances = mult.fit2$sigma2
covariances[[1]]              # The covariances from the first MCMC iteration

## , , 1
## 
##            [,1]        [,2]       [,3]        [,4]
## [1,] 0.17145706 0.117678500 0.02550586 0.011923931
## [2,] 0.11767850 0.137602437 0.01172007 0.007278956
## [3,] 0.02550586 0.011720067 0.04075428 0.011942423
## [4,] 0.01192393 0.007278956 0.01194242 0.012536570
## 
## , , 2
## 
##           [,1]       [,2]      [,3]       [,4]
## [1,] 0.3660926 0.11377478 0.3784312 0.15499073
## [2,] 0.1137748 0.11243277 0.1418176 0.08740352
## [3,] 0.3784312 0.14181755 0.5679394 0.24556245
## [4,] 0.1549907 0.08740352 0.2455624 0.16349126
## 
## , , 3
## 
##            [,1]       [,2]       [,3]       [,4]
## [1,] 0.22527375 0.09720058 0.07532264 0.04596721
## [2,] 0.09720058 0.10131210 0.05659000 0.03946564
## [3,] 0.07532264 0.05659000 0.21359707 0.07664245
## [4,] 0.04596721 0.03946564 0.07664245 0.04072829

Covs1 = list()
Covs2 = list()
for(i in 1 : c(mcmc.total.n - mcmc.burnin)){
  Covs1[[i]] = covariances[[i]][,,1]
  Covs2[[i]] = covariances[[i]][,,2]
}

There are many things we could do with those. For instance, we could focus on the main diagonal elements (i.e., the variance vectors $\boldsymbol{\sigma}_k^2=(\sigma_{k1}^2,\ldots,\sigma_{k4}^2)^T$ for $k\in\{1,2\}$), store them to cluster-specific matrices of dimensionality $1000\times 2$ and proceed to produce numerical summaries and visualisations (just as we did with the posterior means above).

Here, we will do something different. Specifically, we will first find the average posterior covariances of the means of the two groups, using the R command Reduce (very useful for R lists) and assign appropriate column/row names.

post.mean.cov1 = Reduce('+', Covs1)/(mcmc.total.n - mcmc.burnin)
post.mean.cov2 = Reduce('+', Covs2)/(mcmc.total.n - mcmc.burnin)
colnames(post.mean.cov1) = c('Sepal length', 'Sepal width', 'Petal length', 'Petal width')
colnames(post.mean.cov2) = c('Sepal length', 'Sepal width', 'Petal length', 'Petal width')
rownames(post.mean.cov1) = colnames(post.mean.cov1)
rownames(post.mean.cov2) = colnames(post.mean.cov2)
post.mean.cov1

##              Sepal length Sepal width Petal length Petal width
## Sepal length   0.13842434  0.10401508   0.02324501  0.01200375
## Sepal width    0.10401508  0.15026208   0.01320414  0.01036686
## Petal length   0.02324501  0.01320414   0.04278083  0.01011754
## Petal width    0.01200375  0.01036686   0.01011754  0.01466866

post.mean.cov2

##              Sepal length Sepal width Petal length Petal width
## Sepal length    0.4395680  0.12441492    0.4481754  0.16516684
## Sepal width     0.1244149  0.11529567    0.1411903  0.07958529
## Petal length    0.4481754  0.14119032    0.6741754  0.28467213
## Petal width     0.1651668  0.07958529    0.2846721  0.17866540

They look quite different, which is interesting; hovever, covariance matrices are difficult to interpret. Better to work with the average posterior correlation matrices, using the command cov2cor:

post.mean.cor1 = cov2cor(post.mean.cov1)
post.mean.cor2 = cov2cor(post.mean.cov2)
post.mean.cor1

##              Sepal length Sepal width Petal length Petal width
## Sepal length    1.0000000   0.7212160    0.3020639   0.2663886
## Sepal width     0.7212160   1.0000000    0.1646876   0.2208143
## Petal length    0.3020639   0.1646876    1.0000000   0.4038824
## Petal width     0.2663886   0.2208143    0.4038824   1.0000000

post.mean.cor2

##              Sepal length Sepal width Petal length Petal width
## Sepal length    1.0000000   0.5526535    0.8232817   0.5893721
## Sepal width     0.5526535   1.0000000    0.5064213   0.5545058
## Petal length    0.8232817   0.5064213    1.0000000   0.8202349
## Petal width     0.5893721   0.5545058    0.8202349   1.0000000

This posterior output is indeed informative. Interestingly, in Cluster 1 the correlation matrix has a diagonal “block-type” structure with stronger correlations within the blocks of the Sepal and Petal variables (especially for the former pair). On the other hand, in Cluster 2 all correlations generally increase, while Sepal length is more correlated to Petal length, rather than to Sepal width. We can visualise the average posterior correlations of the cluster-specific means using different R commands from different packages. Below we consider two options.

First, we use R package pheatmap:

#install.packages('pheatmap') 
library(pheatmap)
pheatmap(post.mean.cor1, display_numbers = TRUE, cluster_rows = FALSE, cluster_cols = FALSE, fontsize_number = 15, main = 'Average posterior correlations of means in Cluster 1')

pheatmap(post.mean.cor2, display_numbers = TRUE, cluster_rows = FALSE, cluster_cols = FALSE, fontsize_number = 15, main = 'Average posterior correlations of means in Cluster 2')

Plots from pheatmap are generally not compatible to generic R plot configurations such as par(mfrow = c(,)) or split.screen()

Second, we can use R package corrplot, for instance:

#install.packages('corrplot')
library(corrplot)
corrplot.mixed(post.mean.cor1, lower = "number", upper = "ellipse")
mtext('Average posterior correlations of means in Cluster 1', side = 2, line = -4, outer = TRUE)

corrplot.mixed(post.mean.cor2, lower = "number", upper = "ellipse")
mtext('Average posterior correlations of means in Cluster 2', side = 2, line = -4, outer = TRUE)

Plot functions of corrplot are generally compatible to generic plot configurations as par(mfrow = c(,)) or split.screen(), but the result is not always optimal! Also, it is better to place a main title via the extra command mtext as done above (instead of using main = '' inside corrplot.mixed).

Informative vignettes on pheatmap and corrplot are available here and here, respectively. Of course, there is also the standard R heatmap function (type ?heatmap) as well as the options of ggplot2 and ggcorrplot (see e.g. here and here). So, there are many available options!

Finally, we can have a look at the posterior summaries and densities of $\pi_1$ and $\pi_2$. This is quite easy and exactly the same as we have already seen in Practical 1:

pi = mult.fit2$probs
post.pi = matrix(NA, nrow = mcmc.total.n - mcmc.burnin, ncol = 2)
for(i in 1:(mcmc.total.n - mcmc.burnin)){
  post.pi[i,1] = pi[[i]][1]
  post.pi[i,2] = pi[[i]][2]
}
summary(post.pi)

##        V1               V2        
##  Min.   :0.2263   Min.   :0.5105  
##  1st Qu.:0.3062   1st Qu.:0.6398  
##  Median :0.3307   Median :0.6652  
##  Mean   :0.3322   Mean   :0.6641  
##  3rd Qu.:0.3574   3rd Qu.:0.6902  
##  Max.   :0.4879   Max.   :0.7736

par(mfrow = c(1,2))
plot(density(post.pi[,1]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(pi[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.pi[,2]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(pi[2]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('2-component mixture weights', side = 3, line = -2, outer = TRUE, cex = 1.5)

6 Last considerations and conclusions

So far, based on all the results from Practical 1 and from this session as well, we have pretty much determined the existence of 2 clusters in the Iris dataset and have inferred about density estimation, clustering and the posterior distributions of the parameters under univariate as well as multivariate DPM models.

We could go a bit further and try to find the underlying reason for the apparent existence of two clusters, in an attempt to explain the results. First, we know that in this dataset the measurements come from 3 different species of the Iris flower (setosa, versicolor, virginica). So perhaps, it would be initially useful to see whether different clusters are apparent within each species of Iris flowers. The below code and results are for the setosa species (note that now we use method = 'MAR' in mcmc list as ICS tends to fail).

y.setosa = y[iris[,5] == 'setosa', ]
n.setosa = nrow(y.setosa)
n.setosa

## [1] 50

grid1.setosa = seq(min(y.setosa[,1]), max(y.setosa[,1]), length.out = grid.length)
grid2.setosa = seq(min(y.setosa[,2]), max(y.setosa[,2]), length.out = grid.length)
grid3.setosa = seq(min(y.setosa[,3]), max(y.setosa[,3]), length.out = grid.length)
grid4.setosa = seq(min(y.setosa[,4]), max(y.setosa[,4]), length.out = grid.length)
grid.setosa = expand.grid(grid1.setosa, grid2.setosa, grid3.setosa, grid4.setosa)

DPprior = PYcalibrate(Ek = 4, n = n.setosa, discount = 0)
DPprior

## $strength
## [1] 0.830749
## 
## $discount
## [1] 0

mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'MAR', print_message = FALSE)
prior = list(strength =  DPprior$strength, discount = 0, hyper = FALSE, m0 = rep(0,4), Sigma0 = diag(1000,4))
output = list(grid = grid.setosa, out_param = TRUE)
my.set.seed(1)
mult.fit1.setosa = PYdensity(y = y.setosa, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit1.setosa)

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 30.49 seconds
##  Average number of groups:  1.002 
##  Min number of groups:  1 ; Max number of groups:  2

cluster.labels = mult.fit1.setosa$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts

## n.cluster
##   1   2 
## 998   2

p1 = plot(mult.fit1.setosa, dim = c(1, 1), xlab = "Sepal length", ylab = "Density", col = 2)
p2 = plot(mult.fit1.setosa, dim = c(2, 2), xlab = "Sepal width", ylab = "Density", col = 3)
p3 = plot(mult.fit1.setosa, dim = c(3, 3), xlab = "Petal length", ylab = "Density", col = 4)
p4 = plot(mult.fit1.setosa, dim = c(4, 4), xlab = "Petal width", ylab = "Density", col = 5)

p12 = plot(mult.fit1.setosa, dim = c(1, 2), show_clust = TRUE, xlab = "Sepal length", ylab = "Sepal width", col = 2)
p13 = plot(mult.fit1.setosa, dim = c(1, 3), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal length", col = 2)
p14 = plot(mult.fit1.setosa, dim = c(1, 4), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal width", col = 2)

p21 = plot(mult.fit1.setosa, dim = c(2, 1), show_clust = TRUE, xlab = "Sepal width", ylab = "Sepal length", col = 3)
p23 = plot(mult.fit1.setosa, dim = c(2, 3), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal length", col = 3)
p24 = plot(mult.fit1.setosa, dim = c(2, 4), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal width", col = 3)

p31 = plot(mult.fit1.setosa, dim = c(3, 1), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal length", col = 4)
p32 = plot(mult.fit1.setosa, dim = c(3, 2), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal width", col = 4)
p34 = plot(mult.fit1.setosa, dim = c(3, 4), show_clust = TRUE, xlab = "Petal length", ylab = "Petal width", col = 4)

p41 = plot(mult.fit1.setosa, dim = c(4, 1), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal length", col = 5)
p42 = plot(mult.fit1.setosa, dim = c(4, 2), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal width", col = 5)
p43 = plot(mult.fit1.setosa, dim = c(4, 3), show_clust = TRUE, xlab = "Petal width", ylab = "Petal length", col = 5)

grid.arrange(p1, p12, p13, p14, p21, p2, p23, p24, p31, p32, p3, p34, p41, p42, p43, p4, layout_matrix = matrix(1:16, 4, 4))

Based on the above results, it seems that there are no clusters within the species Setosa. If we perform the same analysis for the species Versicolor and Virginica (left as an exercise) the results look more or less the same and, thus, there is no indication of separate clusters within species.

So, whatever is driving the formation of the two clusters must be a trait occurring across species. Lets get back to the data and plot the sample means per species. The plot looks as follows.

y.versicolor = y[iris[,5] == 'versicolor', ]
y.virginica = y[iris[,5] == 'virginica', ]

par(mfrow = c(1,3))
plot(as.table(apply(y.setosa,2, mean)), ylim = c(0,7), col = c(2:5), lwd = 5, ylab = 'Centimeters', main = 'Setosa')
plot(as.table(apply(y.versicolor,2, mean)), ylim = c(0,7), col = c(2:5), lwd = 5, ylab = 'Centimeters', main = 'Versicolor')
plot(as.table(apply(y.virginica,2, mean)), ylim = c(0,7), col = c(2:5), lwd = 5, ylab = 'Centimeters', main = 'Virginica')

Interestingly, we observe the following:

(1) In all three species sepal/petal length is on average greater than their width.
(2) In the species Versicolor and Virgina the average petal size is quite larger than the average petal size in species Setosa.
(3) In the species Versicolor and Virgina the average petal length is greater than the average sepal width, which is not the case for species Setosa.

So, it seems that what the multivariate DPM model is identifying is one cluster primarily including the data points from the Setosa species and another cluster formed by the Versicolor and Virginica species. This also explains why in the correlation matrix of the second cluster, presented above, the average posterior correlation between the means of Sepal length and Petal length is so high (equal to 0.82). In fact, based on the above results for the probabilities $\pi_1$ and $\pi_2$ we see that the posterior means are $\hat{\pi_1}\approx 0.33$ and $\hat{\pi_2}\approx 0.66$ in analogy to the proportions of the sample sizes $n_{\mathrm{setosa}}/n = 50/150$ and $(n_{\mathrm{versicolor}}+n_{\mathrm{virginica}})/n = (50+50)/150$.

apply(post.pi, 2, mean)

## [1] 0.3321509 0.6641143

nrow(y.setosa)/n

## [1] 0.3333333

(nrow(y.versicolor)+nrow(y.virginica))/n

## [1] 0.6666667

Application: Hierarchical and multivariate Dirichlet process mixture models