We will continue working with the iris dataset, using package BNPmix. So, first lets load the data and the library again.
data('iris')
class(iris)
## [1] "data.frame"
dim(iris)
## [1] 150 5
head(iris, n=5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
y = iris[, 1:4]
n = nrow(y)
library(BNPmix)
General note about random seeds: Apparently, even
when using set.seed()
the results can be different for
different operating systems (e.g., Linux, Windows, Mac). Based on some
(minimal) research on this, it is recommended to use the following
custom function with fixed default options in
set.seed()
:
my.set.seed = function(n) set.seed(n, kind = "Mersenne-Twister", normal.kind = "Inversion")
We have seen in the first practical class that the package exploits
the stick-breaking representation for fitting DPM models. Specifically,
we have
\[\begin{equation}
p(y_i) = \sum_{k=1}^\infty\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2),
\end{equation}\] with prior distributions for \(\mu_k\) and \(\sigma_k^2\) based on a conditional
conjugate normal/inverse-gamma centering measure, such that
\[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~
\sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and }
\sigma_k^2\sim G_0(\sigma_k^2) \equiv
\mathrm{InvGamma}(a_0,b_0),\] where \(a_0\) and \(b_0\) are the shape and scale
parameters of the inverse-gamma distribution, respectively. We have also
seen that the default options in BNPmix are: \[m_0 = \bar{y}= \frac{1}{n}\sum_iy_i, ~k_0 =1,
~a_0 =2, ~b_0 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2.\] As
discussed, this an empirical Bayes approach, which can
work well in practice. Of course, as we have demonstrated the above can
be customised by changing the arguments m0
,
k0
, a0
and b0
in the
prior
list.
Now in hierarchical DPMs the above parameters are random, so we have a further layer of hierarchy. The hyper-priors used by BNPmix are
\[ m_0 \sim \mathrm{N}(m_1,\sigma_1^2), ~ k_0 \sim \mathrm{Gamma}(\tau_1,\zeta_1), ~ b_0 \sim \mathrm{Gamma}(a_1,b_1), \] where \(\tau_1\) and \(a_1\) correspond to shape parameters, \(\zeta_1\) and \(b_1\) correspond to rate parameters in the Gamma distributions. As we see BNPmix does not allow placing a hyper-prior on \(a_0\) (reason unknown). Apart from that, the hierarchical model formulation is the same as the one presented in lectures, albeit with a different notation for the parameters.
Under the hierarchical DPM it is the hyper-parameters \(m_1, ~ s_1^2, ~ \zeta_1, ~ \tau_1, ~ a_1\)
and \(b_1\) that are fixed. The default
options in BNPmix are \[m_1 = \bar{y}, ~
\sigma_1^2 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2, ~\tau_1=\zeta_1
=1, ~a_1 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2, ~b_1 = 1. \] As
seen, in this case the emprirical approach is transferred to some of the
hyper-parameters. Once again the model can be customised as the above
hyper-parameters can be changed within the prior
list. It
will be useful to have a look again into this list and all of its
relevant arguments:
prior
List with arguments for prior specification:
discount
The discount parameter (here it must be set to
0 for a DP).strength
The strength parameter of the PY process or
corresponding precision of the DP.hyper
Provides control for choosing a DPM or a
hierarchical DPM.m0
The prior mean of each \(\mu_k\) (DPM).k0
The scalar \(k_0>0\) controlling the variance of each
\(\mu_k\) (DPM).a0
The shape \(a_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\)
(DPM and HDPM).b0
The scale \(b_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\)
(DPM).m1
The hyper-prior mean of \(m_0\) (HDPM).s21
The hyper-prior variance of \(m_0\) (HDPM).tau1
The shape \(\tau_1>0\) of the gamma distribution on
\(k_0\) (HDPM).zeta1
The rate \(\zeta_1>0\) of the gamma distribution on
\(k_0\) (HDPM).a1
The shape \(a_1>0\) of the gamma distribution on
\(b_0\) (HDPM).b1
The rate \(b_1>0\) of the gamma distribution on
\(b_0\) (HDPM).We will continue working with variable Petal Length from
this dataset. The code is presented below. We set Ek = 2
(our prior expectation for the number of clusters) in
PYcalibrate
and use the automatic procedure of BNPmix to
set the strength of the PY process (precision of the DP); this is done
by setting strength = DPprior$strength
in the 4th line. In
order to fit the HDPM the only thing we need to do is to set
hyper = TRUE
in line 4 (actually, this is the default
option so this not required, but it is better to always explicitly state
this argument).
DPprior = PYcalibrate(Ek = 2, n = n, discount = 0)
DPprior
## $strength
## [1] 0.1882315
##
## $discount
## [1] 0
mcmc = list(niter = 11000, nburn = 1000)
prior = list(strength = DPprior$strength, discount = 0, hyper = TRUE)
output = list(grid = seq(min(y[,3]), max(y[,3]), length.out = 100), out_param = TRUE)
my.set.seed(1)
fit4 = PYdensity(y = y[,3], mcmc = mcmc, prior = prior, output = output)
## Completed: 1100/11000 - in 0.207 sec
## Completed: 2200/11000 - in 0.407 sec
## Completed: 3300/11000 - in 0.626 sec
## Completed: 4400/11000 - in 0.832 sec
## Completed: 5500/11000 - in 1.035 sec
## Completed: 6600/11000 - in 1.25 sec
## Completed: 7700/11000 - in 1.46 sec
## Completed: 8800/11000 - in 1.661 sec
## Completed: 9900/11000 - in 1.868 sec
## Completed: 11000/11000 - in 2.072 sec
##
## Estimation done in 2.072 seconds
summary(fit4)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 2.07 seconds
## Average number of groups: 2.2217
## Min number of groups: 2 ; Max number of groups: 5
In this case, the HDPM visits in general fewer models (in comparison to the DPM model used in Practical 1) as the minimum number of clusters is 2 and the maximum number of clusters is 5 (with the average being around 2). Below we see the posterior mean and the 95% credible bands of the density, and the average posterior cluster assignments of the observations overlaid on the histogram.
plot(fit4, band = TRUE, show_hist = TRUE, show_clust = TRUE, xlab = "Petal length", ylab = 'Density')
Assume now that we want to avoid using empirical Bayes and prefet to define our model based on the following hyper-priors \[ m_0 \sim \mathrm{N}(0,1000), ~ b_0 \sim \mathrm{Gamma}(0.001,0,001),\] keeping the rest fixed at \(k_0 = 1\) and \(a_0 = 3\). Above we have a diffuse non-informative hyper-prior for \(m_0\) (with large variance) and a non-informative hyper-prior for \(b_0\), since for the given values the mean (\(\mathbb{E}[b_0]=a_1/b_1\)) is 1 and the variance (\(\mathbb{V}\mathrm{ar}[b_0]=a_1/b_1^2\)) is 1000. Note that, the ``non-informativeness’’ is transferred to the inverse-gamma prior of \(\sigma_k^2\) since we have that \[\mathbb{E}[\sigma_k^2] = \frac{b_0}{a_0-1} = \frac{b_0}{2} ~ \mbox{and} ~ \mathbb{V}\mathrm{ar}[\sigma_k^2] = \frac{b_0^2}{(a_0-1)^2(a_0-2)} = \frac{b_0^2}{4},\] where \(b_0\) will take very large values. The specification in BNPmix is as follows:
prior = list(strength = DPprior$strength, discount = 0, hyper = TRUE, m1 = 0, s21 = 1000, k0 = 1, a0 = 3, a1 = 0.001, b1 = 0.001)
mcmc = list(niter = 11000, nburn = 1000, print_message = FALSE)
my.set.seed(1)
fit5 = PYdensity(y = y[,3], mcmc = mcmc, prior = prior, output = output)
The print_message = FALSE
argument in mcmc
just suppresses the MCMC iterations from being printed out. As usual we
can look at the summary and the posterior density plot.
summary(fit5)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 2.18 seconds
## Average number of groups: 2.3001
## Min number of groups: 2 ; Max number of groups: 6
plot(fit5, band = TRUE, show_hist = TRUE, show_clust = TRUE, xlab = "Petal length", ylab = 'Density', col = 7)
Finally, as demonstrated in the 1st practical class we can infer about the MAP estimate for the number of clusters and also infer about the posterior distributions of all relevant parameters. For instance, concerning the former, we have the following result:
cluster.labels = fit5$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts
## n.cluster
## 2 3 4 5 6
## 7407 2224 334 31 4
plot(counts/sum(counts), col = 1:length(counts), bty = 'l', xlab = 'Number of clusters', ylab = 'Distribution of clusters')
We have seen the setting of multivariate DPMs and HDPMs in lectures,
now lets see how these are treated in BNPmix and the corresponding
notation used. In this setting, we will assume that we have a data
matrix \(\mathbf{Y}\) of dimensionality
\(n\times q\). The stick-breaking
representation is
\[\begin{equation}
p(\mathbf{y}_i) = \sum_{k=1}^\infty\pi_k \mathrm{N}_q(\mathbf{y}_i ~|~
\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k),
\end{equation}\] for \(i\in
\{1,\ldots,n\}\) and \(\mathbf{y}_i\in
\mathbb{R}^q\), with prior distributions for \(\boldsymbol{\mu}_k\) and \(\boldsymbol{\Sigma}_k\) based on the
multivariate conditional conjugate
normal/inverse-Wishart design as a centering measure. Specifically,
\[\boldsymbol{\mu}_k ~|~
\boldsymbol{\Sigma}_k\sim G_0(\boldsymbol{\mu}_k ~|~
\boldsymbol{\Sigma}_k)\equiv \mathrm{N}(\mathbf{m}_0,
\boldsymbol{\Sigma}_k/k_0) \mbox{ and }
\boldsymbol{\Sigma}_k\sim G_0(\boldsymbol{\Sigma}_k) \equiv
\mathrm{InvWishart}(\boldsymbol{\Sigma_0},\nu_0),\] where \(\boldsymbol{\Sigma_0}\) and \(\nu_0\) are the respective scale matrix and
degrees-of-freedom parameter of the inverse-Wishart distribution. The
default options in BNPmix are: \[\mathbf{m}_0
= \bar{\mathbf{y}}= (\bar{y}_1,\ldots,\bar{y}_q)^T, ~k_0 =1, ~\nu_0
=q+2, ~\boldsymbol{\Sigma}_0 =
\hat{\boldsymbol{\Sigma}}_\mathbf{y},\] where \(\hat{\boldsymbol{\Sigma}}_\mathbf{y}\) is
the sample variance-covariance estimate of \(\mathbf{y}\). Customisation as we will see
is again possible, but first lets see the corresponding hierarchical
setting.
Similarly to the univariate case, here we can also have deeper layers of hierarchy. The hyper-priors used by BNPmix for multivariate HDPMs are as follows
\[ \mathbf{m}_0 \sim \mathrm{N}_q(\mathbf{m}_1, \mathbf{S}_1), ~ k_0 \sim \mathrm{Gamma}(\tau_1,\zeta_1), ~ \boldsymbol{\Sigma}_0 \sim \mathrm{InvWishart}(\boldsymbol{\Sigma}_1,\nu_1).\] As seen, similarly to the univariate case for parameter \(a_0\), BNPmix does not allow a hyper-prior for parameter \(\nu_0\) in the multivariate case. The default BNPmix options for the hyper-parameters \(\mathbf{m}_1, ~ s_1^2, ~ \zeta_1, ~ \tau_1, ~ a_1\) and \(b_1\) are: \[\mathbf{m}_1 = \bar{\mathbf{y}}= (\bar{y}_1,\ldots,\bar{y}_q)^T, ~ \mathbf{S}_1 = \hat{\boldsymbol{\Sigma}}_\mathbf{y}, ~ \tau_1=\zeta_1 =1, ~\nu_1 = q+2, ~ \boldsymbol{\Sigma}_1 = (q+2)^{-1}\hat{\boldsymbol{\Sigma}}_\mathbf{y}. \]
Now let us have a look at the prior
list and the
relevant arguments for the multivariate case:
prior
List with arguments for prior specification:
discount
The discount parameter (here it must be set to
0 for a DP).strength
The strength parameter of the PY process or
corresponding precision of the DP.hyper
Provides control for choosing a DPM or a
hierarchical DPM.m0
The prior mean-vector of each \(\boldsymbol{\mu}_k\) (DPM).k0
The scalar \(k_0>0\) controlling the
variance-covariance of each \(\boldsymbol{\mu}_k\) (DPM).n0
The degrees-of-freedom parameter \(\nu_0>q-1\) of the inverse-Wishart
distribution of each \(\boldsymbol{\Sigma}_k^2\) (DPM and
HDPM).Sigma0
The variance-covariance \(\boldsymbol{\Sigma}_0\) of the
inverse-Wishart distribution of each \(\boldsymbol{\Sigma}_k\) (DPM).m1
The hyper-prior mean-vector of \(\mathbf{m}_0\) (HDPM).S1
The hyper-prior variance-covariance of \(\mathbf{m}_0\) (HDPM).tau1
The shape \(\tau_1>0\) of the gamma distribution on
\(k_0\) (HDPM).zeta1
The rate \(\zeta_1>0\) of the gamma distribution on
\(k_0\) (HDPM).n1
The degrees-of-freedom parameter \(\nu_0>q-1\) of the inverse-Wishart
distribution on \(\boldsymbol{\Sigma}_0\) (HDPM).Sigma1
TThe variance-covariance \(\boldsymbol{\Sigma}_1\) of the
inverse-Wishart distribution on \(\boldsymbol{\Sigma}_0\) (HDPM).Now we will consider all four variables (Sepal Length, Sepal Width, Peta Length, Petal Width) in the Iris dataset for our Bayesian nonparametric analysis. In general, multivariate data analysis has a much higher computational toll, so in practice we must reduce the “demands” in terms of the input to BNPmix (at least for initial exploratory purposes).
Important points:
(1) The length of the grid support for evaluating
the posterior multivariate density should be reduced drastically. Below
we use a grid of length equal to 20 for each column of
y
:
grid.length = 20
grid1 = seq(min(y[,1]), max(y[,1]), length.out = grid.length)
grid2 = seq(min(y[,2]), max(y[,2]), length.out = grid.length)
grid3 = seq(min(y[,3]), max(y[,3]), length.out = grid.length)
grid4 = seq(min(y[,4]), max(y[,4]), length.out = grid.length)
Then, we must create a data frame from all possible combinations of
the vectors grid1
, grid2
, grid3
and grid4
via the command grid.expand
(type
?grid.expand
for more information):
grid = expand.grid(grid1, grid2, grid3, grid4)
dim(grid)
## [1] 160000 4
20^4 # this checks!
## [1] 160000
(2) Another realistic “sacrifice” we need to make for practical purposes is to reduce drastically the total number of MCMC iterations (again, at least for initial exploratory purposes). Below we set the total MCMC sample size equal to 2000 and the burn-in sample size equal to 1000.
mcmc.total.n = 2000
mcmc.burnin = 1000
With all of that being said, lets fit a DPM to the entire Iris
dataset! Initially, we set the prior expected number of clusters equal
to 2 (based on the previous analyses). Below we make explicitly clear
that the Gibbs sampler used is the importance conditional sampler
through the argument method = 'ICS'
in mcmc
(albeit, this being the default option).
DPprior = PYcalibrate(Ek = 2, n = n, discount = 0)
mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'ICS', print_message = FALSE)
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE)
output = list(grid = grid, out_param = TRUE)
my.set.seed(1)
mult.fit1 = PYdensity(y = y, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit1)
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 46.76 seconds
## Average number of groups: 1.001
## Min number of groups: 1 ; Max number of groups: 2
The result above is not very promising, given our previous findings. Under the current configuration only models with \(K=1\) and \(K=2\) clusters are visited, with the average being \(\hat{K}\approx 1\)!. This indicates that it would be better to increase the strength (DP’s precision); currently, we are using \(\alpha = 0.1882315\) as shown below, which is probably too small.
DPprior
## $strength
## [1] 0.1882315
##
## $discount
## [1] 0
There are two ways of increasing the strength: (i)
manually override this and set, for example strength = 0.5
in the prior
list or (ii) increase the
number of prior expected cluster controlled by Ek
. Lets
have a go at the second option and set Ek
equal to 4
instead of 2.
DPprior = PYcalibrate(Ek = 4, n = n, discount = 0)
DPprior
## $strength
## [1] 0.6157496
##
## $discount
## [1] 0
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE)
my.set.seed(1)
mult.fit1 = PYdensity(y = y, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit1)
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 76.54 seconds
## Average number of groups: 2.004
## Min number of groups: 2 ; Max number of groups: 3
cluster.labels = mult.fit1$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts
## n.cluster
## 2 3
## 996 4
This looks more reasonable (based on what we already know) with the MAP estimate of \(K\) being equal to 2. We could proceed to model and parameter inference with posterior visualisations based on this model. However, lets first investigate whether we can abandon the data-dependent prior assumptions, without affecting inference about the number of clusters.
Specifically, lets consider the model with \[\mathbf{m}_0 = (0,\ldots,0)^T, ~k_0 =1, ~\nu_0 =q+2, ~\boldsymbol{\Sigma}_0 = \mathrm{diag}(1000,\ldots, 1000),\] which is clearly not data-dependent. The change we have to make in order to run this model is simple and is as follows
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE, m0 = rep(0,4), Sigma0 = diag(1000,4))
my.set.seed(1)
mult.fit2 = PYdensity(y = y, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit2)
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 76.01 seconds
## Average number of groups: 2.004
## Min number of groups: 2 ; Max number of groups: 3
We will not delve into much detail about applying multivariate HDPMs
in this application. It suffices to note that a default HDPM model can
be fit by simply using
prior = list(strength = DPprior$strength, discount = 0, hyper = TRUE)
.
If you run this model, you will see that the results from the default
HDPM model are basically more or less the same as those from the default
DPM model (which is simpler). In addition, creating custom
non-informative multivariate HDPMs is more complicated due to the added
layers of hierarchy.
Now lets use the output from the customised fully Bayesian model
stored in mult.fit2
to explore and visualise the various
inferences that can be drawn from the model.
Density estimation and clustering
The Iris dataset contains four variables; therefore, we can potentially plot four marginal univariates densities and twelve bivariate densities. For instance, if we want to plot the marginal density estimate of Sepal length and the bivariate density estimate of (Sepal length, Sepal width) we can use the following a code:
p1 = plot(mult.fit2, dim = c(1, 1), xlab = "Sepal length", ylab = "Density")
p12 = plot(mult.fit2, dim = c(1, 2), show_clust = TRUE, xlab = "Sepal length", ylab = "Sepal width")
p1
p12
When we are dealing with many variables it is sometimes more convenient to combine the various plots and present them in one plot. First, lets define all the plots that we want.
p1 = plot(mult.fit2, dim = c(1, 1), xlab = "Sepal length", ylab = "Density", col = 2)
p2 = plot(mult.fit2, dim = c(2, 2), xlab = "Sepal width", ylab = "Density", col = 3)
p3 = plot(mult.fit2, dim = c(3, 3), xlab = "Petal length", ylab = "Density", col = 4)
p4 = plot(mult.fit2, dim = c(4, 4), xlab = "Petal width", ylab = "Density", col = 5)
p12 = plot(mult.fit2, dim = c(1, 2), show_clust = TRUE, xlab = "Sepal length", ylab = "Sepal width", col = 2)
p13 = plot(mult.fit2, dim = c(1, 3), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal length", col = 2)
p14 = plot(mult.fit2, dim = c(1, 4), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal width", col = 2)
p21 = plot(mult.fit2, dim = c(2, 1), show_clust = TRUE, xlab = "Sepal width", ylab = "Sepal length", col = 3)
p23 = plot(mult.fit2, dim = c(2, 3), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal length", col = 3)
p24 = plot(mult.fit2, dim = c(2, 4), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal width", col = 3)
p31 = plot(mult.fit2, dim = c(3, 1), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal length", col = 4)
p32 = plot(mult.fit2, dim = c(3, 2), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal width", col = 4)
p34 = plot(mult.fit2, dim = c(3, 4), show_clust = TRUE, xlab = "Petal length", ylab = "Petal width", col = 4)
p41 = plot(mult.fit2, dim = c(4, 1), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal length", col = 5)
p42 = plot(mult.fit2, dim = c(4, 2), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal width", col = 5)
p43 = plot(mult.fit2, dim = c(4, 3), show_clust = TRUE, xlab = "Petal width", ylab = "Petal length", col = 5)
Now we can combine them using the command grid.arrange
from R package gridExtra
(which we need to install or just
load first).
#install.packages('gridExtra')
library(gridExtra)
grid.arrange(p1, p12, p13, p14, p21, p2, p23, p24, p31, p32, p3, p34, p41, p42, p43, p4, layout_matrix = matrix(1:16, 4, 4))
The above code created a \(4\times4\) grid on the graphical device and placed the marginal density estimates on the main diagonal and the pairs of the bivariate density estimates on the off-diagonal panels. Note that the bivariate plots also include the average posterior cluster assignment of the observations distinguished by different colours. Of course, we may choose to present results in different ways; for instance the following would produce a plot of the marginal univariate distributions first, and a plot of the pairs of bivariate distributions second.
grid.arrange(p1, p2, p3, p4, layout_matrix = matrix(1:4, 1, 4))
grid.arrange(p12, p13, p14, p21, p23, p24, p31, p32, p34, p41, p42, p43, layout_matrix = matrix(1:12, 3, 4))
Note that for multivariate models we cannot apply the extra options
of overlaying densities on histograms, credible bands and so forth.
However, basic generic plot
arguments like
xlab
, ylab
, col
etc. can be
changed.
Posterior inference
We already know from previously that the most likely a-posteriori estimate of the number of clusters is \(\hat{K}=2\). Below we find its posterior probability to be \(\mathbb{P}[K =2~|~\mathbf{Y}] = 0.996\).
cluster.labels = mult.fit2$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts/sum(counts)
## n.cluster
## 2 3
## 0.996 0.004
So, we will proceed to posterior inference conditional on \(K=2\). Similarly to what we have done in Practical 1, we can extract posterior samples of the cluster means and calculate posterior summaries and/or plot the posterior distributions. The only difference is that we now have \(q = 4\) variables.
names(mult.fit2)
## [1] "density" "data" "grideval" "grid_x" "grid_y"
## [6] "clust" "mean" "beta" "sigma2" "probs"
## [11] "niter" "nburn" "tot_time" "univariate" "regression"
## [16] "dep" "group_log" "group" "wvals"
means = mult.fit2$mean
length(means) # 1000 posterior samples
## [1] 1000
means[[1]] # The 1st sample for the 4 variables
## [,1] [,2] [,3] [,4]
## [1,] 5.044807 3.4904806 1.512368 0.2594124
## [2,] 6.219309 2.8771094 4.864570 1.6685257
## [3,] -3.409511 0.4874954 -11.463251 -2.5813150
means[[1]][,1][1:2] # The 1st samples of mu1 and mu2 for the 1st variable
## [1] 5.044807 6.219309
Below we create 4 matrices of dimensionality \(1000\times2\) with initial entries
NA
in order to store the samples that we want in these
matrices. We then calculate location posterior summaries and the
variances for the cluster means \(\boldsymbol{\mu}_1=(\mu_{11},\mu_{12})^T\)
of the first variable (Sepal Length) and proceed to produce
density plots of their marginal posterior distributions.
post.means1 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)
post.means2 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)
post.means3 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)
post.means4 = matrix(NA, nrow = mcmc.total.n-mcmc.burnin, ncol = 2)
for(i in 1:(mcmc.total.n-mcmc.burnin)){
post.means1[i,] = means[[i]][,1][1:2]
post.means2[i,] = means[[i]][,2][1:2]
post.means3[i,] = means[[i]][,2][1:2]
post.means4[i,] = means[[i]][,4][1:2]
}
summary(post.means1)
## V1 V2
## Min. :4.859 Min. :6.056
## 1st Qu.:4.968 1st Qu.:6.217
## Median :5.007 Median :6.263
## Mean :5.006 Mean :6.264
## 3rd Qu.:5.042 3rd Qu.:6.312
## Max. :5.195 Max. :6.463
apply(post.means1,2,var)
## [1] 0.002823882 0.004743266
par(mfrow = c(1,2))
plot(density(post.means1[,1]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means1[,2]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Group means for Sepal length', side = 3, line = -2, outer = TRUE, cex = 1.5)
The last command mtext
produces a common main title for
the two plots; this is slightly tricky as we need to experiment with
argument line
inside mtext
to place the main
title appropriately.
Alternatively, we can produce more “fancy” plots with the R package
ggplot2. Lets see a demonstration below. First we need to load the
package and then we create a data.frame
of dimensionality
\((2\times1000)\times 2\) with the
first column containing a factor
variable with cluster
indicators and the second column containing the posterior samples for
each cluster (we are still working with the 1st variable, Sepal
length).
library(ggplot2)
all.means1 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means1[,1],post.means1[,2]))
head(all.means1)
## cluster samples
## 1 Cluster1 5.044807
## 2 Cluster1 5.008073
## 3 Cluster1 5.059295
## 4 Cluster1 5.007089
## 5 Cluster1 4.874071
## 6 Cluster1 5.066373
We can then plot the two densities on the same plot as follows (controlling some extra arguments about titles, font sizes etc.):
plot.m1 = ggplot(all.means1, aes(x=samples, color = cluster)) +
geom_density(size = 1.3) +
theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
xlab(expression(mu)) + ylab('Posterior density') +
ggtitle('Sepal length')
plot.m1
Now lets do the same for the remaining 3 variables and combine all
four plots in one, using once again grid.arrange
.
all.means2 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means2[,1],post.means2[,2]))
all.means3 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means3[,1],post.means3[,2]))
all.means4 = data.frame(cluster = factor(rep(c('Cluster1','Cluster2'), each = mcmc.total.n-mcmc.burnin)), samples = c(post.means4[,1],post.means4[,2]))
plot.m2 = ggplot(all.means2, aes(x=samples, color = cluster)) +
geom_density(size = 1.3) +
theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
xlab(expression(mu)) + ylab('Posterior density') +
ggtitle('Sepal width')
plot.m3 = ggplot(all.means3, aes(x=samples, color = cluster)) +
geom_density(size = 1.3) +
theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
xlab(expression(mu)) + ylab('Posterior density') +
ggtitle('Petal length')
plot.m4 = ggplot(all.means4, aes(x=samples, color = cluster)) +
geom_density(size = 1.3) +
theme(plot.title = element_text(size=18), axis.title=element_text(size=15), axis.text=element_text(size=10)) +
xlab(expression(mu)) + ylab('Posterior density') +
ggtitle('Petal width')
grid.arrange(plot.m1, plot.m2, plot.m3, plot.m4, layout_matrix = matrix(1:4, 2, 2))
The posterior covariance matrices of the four clusters are also
available in the output of BNPmix. These are included in the
mult.fit2
value sigma2
. Lets see how these
look and store the covariances for \(K=2\) to two list objects named
Covs1
and Covs2
.
covariances = mult.fit2$sigma2
covariances[[1]] # The covariances from the first MCMC iteration
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 0.17145706 0.117678500 0.02550586 0.011923931
## [2,] 0.11767850 0.137602437 0.01172007 0.007278956
## [3,] 0.02550586 0.011720067 0.04075428 0.011942423
## [4,] 0.01192393 0.007278956 0.01194242 0.012536570
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 0.3660926 0.11377478 0.3784312 0.15499073
## [2,] 0.1137748 0.11243277 0.1418176 0.08740352
## [3,] 0.3784312 0.14181755 0.5679394 0.24556245
## [4,] 0.1549907 0.08740352 0.2455624 0.16349126
##
## , , 3
##
## [,1] [,2] [,3] [,4]
## [1,] 0.22527375 0.09720058 0.07532264 0.04596721
## [2,] 0.09720058 0.10131210 0.05659000 0.03946564
## [3,] 0.07532264 0.05659000 0.21359707 0.07664245
## [4,] 0.04596721 0.03946564 0.07664245 0.04072829
Covs1 = list()
Covs2 = list()
for(i in 1 : c(mcmc.total.n - mcmc.burnin)){
Covs1[[i]] = covariances[[i]][,,1]
Covs2[[i]] = covariances[[i]][,,2]
}
There are many things we could do with those. For instance, we could focus on the main diagonal elements (i.e., the variance vectors \(\boldsymbol{\sigma}_k^2=(\sigma_{k1}^2,\ldots,\sigma_{k4}^2)^T\) for \(k\in\{1,2\}\)), store them to cluster-specific matrices of dimensionality \(1000\times 2\) and proceed to produce numerical summaries and visualisations (just as we did with the posterior means above).
Here, we will do something different. Specifically, we will first
find the average posterior covariances of the means of the two groups,
using the R command Reduce
(very useful for R lists) and
assign appropriate column/row names.
post.mean.cov1 = Reduce('+', Covs1)/(mcmc.total.n - mcmc.burnin)
post.mean.cov2 = Reduce('+', Covs2)/(mcmc.total.n - mcmc.burnin)
colnames(post.mean.cov1) = c('Sepal length', 'Sepal width', 'Petal length', 'Petal width')
colnames(post.mean.cov2) = c('Sepal length', 'Sepal width', 'Petal length', 'Petal width')
rownames(post.mean.cov1) = colnames(post.mean.cov1)
rownames(post.mean.cov2) = colnames(post.mean.cov2)
post.mean.cov1
## Sepal length Sepal width Petal length Petal width
## Sepal length 0.13842434 0.10401508 0.02324501 0.01200375
## Sepal width 0.10401508 0.15026208 0.01320414 0.01036686
## Petal length 0.02324501 0.01320414 0.04278083 0.01011754
## Petal width 0.01200375 0.01036686 0.01011754 0.01466866
post.mean.cov2
## Sepal length Sepal width Petal length Petal width
## Sepal length 0.4395680 0.12441492 0.4481754 0.16516684
## Sepal width 0.1244149 0.11529567 0.1411903 0.07958529
## Petal length 0.4481754 0.14119032 0.6741754 0.28467213
## Petal width 0.1651668 0.07958529 0.2846721 0.17866540
They look quite different, which is interesting; hovever, covariance
matrices are difficult to interpret. Better to work with the average
posterior correlation matrices, using the command
cov2cor
:
post.mean.cor1 = cov2cor(post.mean.cov1)
post.mean.cor2 = cov2cor(post.mean.cov2)
post.mean.cor1
## Sepal length Sepal width Petal length Petal width
## Sepal length 1.0000000 0.7212160 0.3020639 0.2663886
## Sepal width 0.7212160 1.0000000 0.1646876 0.2208143
## Petal length 0.3020639 0.1646876 1.0000000 0.4038824
## Petal width 0.2663886 0.2208143 0.4038824 1.0000000
post.mean.cor2
## Sepal length Sepal width Petal length Petal width
## Sepal length 1.0000000 0.5526535 0.8232817 0.5893721
## Sepal width 0.5526535 1.0000000 0.5064213 0.5545058
## Petal length 0.8232817 0.5064213 1.0000000 0.8202349
## Petal width 0.5893721 0.5545058 0.8202349 1.0000000
This posterior output is indeed informative. Interestingly, in Cluster 1 the correlation matrix has a diagonal “block-type” structure with stronger correlations within the blocks of the Sepal and Petal variables (especially for the former pair). On the other hand, in Cluster 2 all correlations generally increase, while Sepal length is more correlated to Petal length, rather than to Sepal width. We can visualise the average posterior correlations of the cluster-specific means using different R commands from different packages. Below we consider two options.
First, we use R package pheatmap
:
#install.packages('pheatmap')
library(pheatmap)
pheatmap(post.mean.cor1, display_numbers = TRUE, cluster_rows = FALSE, cluster_cols = FALSE, fontsize_number = 15, main = 'Average posterior correlations of means in Cluster 1')
pheatmap(post.mean.cor2, display_numbers = TRUE, cluster_rows = FALSE, cluster_cols = FALSE, fontsize_number = 15, main = 'Average posterior correlations of means in Cluster 2')
Plots from pheatmap
are generally not compatible to generic
R plot configurations such as par(mfrow = c(,))
or
split.screen()
Second, we can use R package corrplot
, for instance:
#install.packages('corrplot')
library(corrplot)
corrplot.mixed(post.mean.cor1, lower = "number", upper = "ellipse")
mtext('Average posterior correlations of means in Cluster 1', side = 2, line = -4, outer = TRUE)
corrplot.mixed(post.mean.cor2, lower = "number", upper = "ellipse")
mtext('Average posterior correlations of means in Cluster 2', side = 2, line = -4, outer = TRUE)
Plot functions of corrplot
are generally compatible to
generic plot configurations as par(mfrow = c(,))
or
split.screen()
, but the result is not always optimal! Also,
it is better to place a main title via the extra command
mtext
as done above (instead of using
main = ''
inside corrplot.mixed
).
Informative vignettes on pheatmap
and
corrplot
are available here and here,
respectively. Of course, there is also the standard R
heatmap
function (type ?heatmap
) as well as
the options of ggplot2
and ggcorrplot
(see
e.g. here
and here).
So, there are many available options!
Finally, we can have a look at the posterior summaries and densities of \(\pi_1\) and \(\pi_2\). This is quite easy and exactly the same as we have already seen in Practical 1:
pi = mult.fit2$probs
post.pi = matrix(NA, nrow = mcmc.total.n - mcmc.burnin, ncol = 2)
for(i in 1:(mcmc.total.n - mcmc.burnin)){
post.pi[i,1] = pi[[i]][1]
post.pi[i,2] = pi[[i]][2]
}
summary(post.pi)
## V1 V2
## Min. :0.2263 Min. :0.5105
## 1st Qu.:0.3062 1st Qu.:0.6398
## Median :0.3307 Median :0.6652
## Mean :0.3322 Mean :0.6641
## 3rd Qu.:0.3574 3rd Qu.:0.6902
## Max. :0.4879 Max. :0.7736
par(mfrow = c(1,2))
plot(density(post.pi[,1]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(pi[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.pi[,2]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(pi[2]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('2-component mixture weights', side = 3, line = -2, outer = TRUE, cex = 1.5)
So far, based on all the results from Practical 1 and from this session as well, we have pretty much determined the existence of 2 clusters in the Iris dataset and have inferred about density estimation, clustering and the posterior distributions of the parameters under univariate as well as multivariate DPM models.
We could go a bit further and try to find the underlying reason for
the apparent existence of two clusters, in an attempt to explain the
results. First, we know that in this dataset the measurements come from
3 different species of the Iris flower (setosa,
versicolor, virginica). So perhaps, it would be
initially useful to see whether different clusters are apparent within
each species of Iris flowers. The below code and results are for the
setosa species (note that now we use
method = 'MAR'
in mcmc
list as
ICS
tends to fail).
y.setosa = y[iris[,5] == 'setosa', ]
n.setosa = nrow(y.setosa)
n.setosa
## [1] 50
grid1.setosa = seq(min(y.setosa[,1]), max(y.setosa[,1]), length.out = grid.length)
grid2.setosa = seq(min(y.setosa[,2]), max(y.setosa[,2]), length.out = grid.length)
grid3.setosa = seq(min(y.setosa[,3]), max(y.setosa[,3]), length.out = grid.length)
grid4.setosa = seq(min(y.setosa[,4]), max(y.setosa[,4]), length.out = grid.length)
grid.setosa = expand.grid(grid1.setosa, grid2.setosa, grid3.setosa, grid4.setosa)
DPprior = PYcalibrate(Ek = 4, n = n.setosa, discount = 0)
DPprior
## $strength
## [1] 0.830749
##
## $discount
## [1] 0
mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'MAR', print_message = FALSE)
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE, m0 = rep(0,4), Sigma0 = diag(1000,4))
output = list(grid = grid.setosa, out_param = TRUE)
my.set.seed(1)
mult.fit1.setosa = PYdensity(y = y.setosa, mcmc = mcmc, prior = prior, output = output)
summary(mult.fit1.setosa)
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 30.49 seconds
## Average number of groups: 1.002
## Min number of groups: 1 ; Max number of groups: 2
cluster.labels = mult.fit1.setosa$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts
## n.cluster
## 1 2
## 998 2
p1 = plot(mult.fit1.setosa, dim = c(1, 1), xlab = "Sepal length", ylab = "Density", col = 2)
p2 = plot(mult.fit1.setosa, dim = c(2, 2), xlab = "Sepal width", ylab = "Density", col = 3)
p3 = plot(mult.fit1.setosa, dim = c(3, 3), xlab = "Petal length", ylab = "Density", col = 4)
p4 = plot(mult.fit1.setosa, dim = c(4, 4), xlab = "Petal width", ylab = "Density", col = 5)
p12 = plot(mult.fit1.setosa, dim = c(1, 2), show_clust = TRUE, xlab = "Sepal length", ylab = "Sepal width", col = 2)
p13 = plot(mult.fit1.setosa, dim = c(1, 3), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal length", col = 2)
p14 = plot(mult.fit1.setosa, dim = c(1, 4), show_clust = TRUE, xlab = "Sepal length", ylab = "Petal width", col = 2)
p21 = plot(mult.fit1.setosa, dim = c(2, 1), show_clust = TRUE, xlab = "Sepal width", ylab = "Sepal length", col = 3)
p23 = plot(mult.fit1.setosa, dim = c(2, 3), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal length", col = 3)
p24 = plot(mult.fit1.setosa, dim = c(2, 4), show_clust = TRUE, xlab = "Sepal width", ylab = "Petal width", col = 3)
p31 = plot(mult.fit1.setosa, dim = c(3, 1), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal length", col = 4)
p32 = plot(mult.fit1.setosa, dim = c(3, 2), show_clust = TRUE, xlab = "Petal length", ylab = "Sepal width", col = 4)
p34 = plot(mult.fit1.setosa, dim = c(3, 4), show_clust = TRUE, xlab = "Petal length", ylab = "Petal width", col = 4)
p41 = plot(mult.fit1.setosa, dim = c(4, 1), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal length", col = 5)
p42 = plot(mult.fit1.setosa, dim = c(4, 2), show_clust = TRUE, xlab = "Petal width", ylab = "Sepal width", col = 5)
p43 = plot(mult.fit1.setosa, dim = c(4, 3), show_clust = TRUE, xlab = "Petal width", ylab = "Petal length", col = 5)
grid.arrange(p1, p12, p13, p14, p21, p2, p23, p24, p31, p32, p3, p34, p41, p42, p43, p4, layout_matrix = matrix(1:16, 4, 4))
Based on the above results, it seems that there are no clusters within the species Setosa. If we perform the same analysis for the species Versicolor and Virginica (left as an exercise) the results look more or less the same and, thus, there is no indication of separate clusters within species.
So, whatever is driving the formation of the two clusters must be a trait occurring across species. Lets get back to the data and plot the sample means per species. The plot looks as follows.
y.versicolor = y[iris[,5] == 'versicolor', ]
y.virginica = y[iris[,5] == 'virginica', ]
par(mfrow = c(1,3))
plot(as.table(apply(y.setosa,2, mean)), ylim = c(0,7), col = c(2:5), lwd = 5, ylab = 'Centimeters', main = 'Setosa')
plot(as.table(apply(y.versicolor,2, mean)), ylim = c(0,7), col = c(2:5), lwd = 5, ylab = 'Centimeters', main = 'Versicolor')
plot(as.table(apply(y.virginica,2, mean)), ylim = c(0,7), col = c(2:5), lwd = 5, ylab = 'Centimeters', main = 'Virginica')
Interestingly, we observe the following:
So, it seems that what the multivariate DPM model is identifying is one cluster primarily including the data points from the Setosa species and another cluster formed by the Versicolor and Virginica species. This also explains why in the correlation matrix of the second cluster, presented above, the average posterior correlation between the means of Sepal length and Petal length is so high (equal to 0.82). In fact, based on the above results for the probabilities \(\pi_1\) and \(\pi_2\) we see that the posterior means are \(\hat{\pi_1}\approx 0.33\) and \(\hat{\pi_2}\approx 0.66\) in analogy to the proportions of the sample sizes \(n_{\mathrm{setosa}}/n = 50/150\) and \((n_{\mathrm{versicolor}}+n_{\mathrm{virginica}})/n = (50+50)/150\).
apply(post.pi, 2, mean)
## [1] 0.3321509 0.6641143
nrow(y.setosa)/n
## [1] 0.3333333
(nrow(y.versicolor)+nrow(y.virginica))/n
## [1] 0.6666667