First it is useful to remind that the datasets that are available in
the installed packages that you have in RStudio can be viewed via the
command data()
:
data() # datasets in all installed packages
data(package = 'MASS') # datasets in package MASS
Today we will work with the iris dataset which is a build-in dataset. First lets load the dataset and view some of its characteristics.
data('iris')
class(iris)
## [1] "data.frame"
dim(iris)
## [1] 150 5
head(iris, n=5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
?iris
The last command opens the help window where we can find information about these data. As we see it contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 150 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Initially, lets store the numerical data in an object y
,
set the sample size n
and produce some visualisations of
the data.
y = iris[, 1:4]
n = nrow(y)
par(mfrow = c(2,2))
for (i in 1:4) {
hist(y[,i], col=i+1, main = '', xlab = names(y)[i])
}
pairs(y)
We see that the distributions of petal length and petal width seem to be bimodal. We also see that there is a clear indication for existence of two groups. Lets start by estimating non-parametrically the density of petal length.
Load the BNPmix library
library(BNPmix)
## Warning: package 'BNPmix' was built under R version 4.2.3
This will likely work immediately in a lab PC. If not install the package first (as below) and then call the library (likewise, if you are working on your laptop and have not installed the package already):
install.packages('BNPmix')
The package BNPmix uses a very efficient C++ implementation for posterior sampling (Gibbs algorithms) from complex Dirichlet process mixtures. In fact, BNPmix uses the Pitman-Yor (PY) process which is a generalization of the DP process, which is a special case. We will not have time to learn about this process, but in practical terms it is sufficient to know that in the notation we use in the lectures we would assume that \[ G \sim \mathrm{PY}(\beta,\alpha,G_0), \] where \(\beta = [0,1)\) is called the discount parameter, \(\alpha > -\beta\) the strength parameter and \(G_0\) is the usual centering measure. When the discount is zero, i.e. \(\beta=0\), we have a \(\mathrm{DP}(\alpha,G_0)\) as we know it from lectures.
For univariate DPM models the package has the capability to fit infinite mixtures of normal densities via the stick-breaking representation \[\begin{equation} p(y_i) = \sum_{k=1}^\infty\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2), \label{eq1} \end{equation}\] with prior distributions for \(\mu_k\) and \(\sigma_k^2\) based on a conditional normal/inverse-gamma centering measure \(G_0(\mu_k,\sigma_k^2)=G_0(\mu_k ~|~ \sigma_k^2)G_0(\sigma_k^2)\) as follows \[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~ \sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0),\] where \(a_0>0, b_0>0\) are the scale and rate parameters, respectively, so that \(\mathbb{E}[\sigma_k^2]=\displaystyle\frac{b_0}{a_0-1}\), \(a_0>1\), and \(\mathbb{V}\mathrm{ar}[\sigma_k^2]=\displaystyle\frac{b_0^2}{(a_0-1)^2(a_0-2)}\) for \(a_0>2\). The prior on \(\mu_k\) is a little different than the prior we presented in lectures as it is conditional on \(\sigma_k^2\), but this is also a common option.
BNPmix uses two basic functions. The first is
PYcalibrate
which requires specification of three basic
arguments:
PYcalibrate(Ek, n, discount = 0)
Ek
This is a prior expectation for the number of
clusters. This must be specified. For instance if we expect to have 4
clusters we can set it equal to 4 (or close to 4). If we have no
expectation at all we can set it equal to some large number.n
The sample sizediscount
The discount parameter (default is 0,
resulting in a DP process).The function for density estimation via posterior sampling is
PYdensity
. Typing ?PYdensity
will open the
help window. This function requires specification of 4 arguments
PYdensity(y, mcmc = list(), prior = list(), output = list())
Lets go through the above one by one and see what are the most important arguments we need to define.
y
The data vector (for univariate estimation) of matrix
(for multivariate estimation)mcmc
List with arguments for MCMC implementation:
niter
number of MCMC iterations.nburn
number of initial iterations to discard (the
“burn-in” period).prior
List with arguments for prior specification:
discount
The discount parameter (here it must be set to
0 for a DP).strength
The strength parameter of the PY process or
corresponding precision of the DP.hyper
Logical, when TRUE
(which is the
default option) a hierarchical DPM model (seen later) is implemented,
when FALSE
a standard DPM model is implemented.output
List with arguments for the returning output of
the function:
grid
Important as it defines the grid of points at
which to evaluate the estimated posterior mean density. For instance, if
\(y\) is univariate, we can define the
grid ranging from \(\min(y)\) to \(\max(y)\) and also to have a specific
length (e.g., 100).out_param
Logical with default equal to FALSE.
Recommended to set it to TRUE: this returns the full set of parameters
in the model; e.g., also the \(\mu_k\),
\(\sigma_k^2\) etc.Lets apply a DPM model for variable Petal Length from the
Iris dataset, whose density certainly does not appear unimodal. So, in R
we are going to use the vector y[,3]
.
Tasks:
1, In PYcalibrate
set the number of
expected clusters equal to 3 and also the sample size.
2, Define an object mcmc
as a list with
elements niter
equal to 11000 and nburn
equal
to 1000.
3, Define an object prior
as a list
with elements strength
equal to 1, discount
equal to 0 and hyper
equal to FALSE
.
4, Define an object output
as a list
with element grid
defined as a sequence from min of
y[,3]
to max of y[,3]
with length equal to 100
(type ?seq
for help) and element out_param
set
equal to TRUE.
5, Then, set.seed(1)
(for
reproducibility) and execute
fit1 = PYdensity(y = y[,3], mcmc = mcmc, prior = prior, output = output)
.
If everything is done correctly, the command summary()
will return the following output from fit1
:
summary(fit1)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 5.23 seconds
## Average number of groups: 3.5399
## Min number of groups: 2 ; Max number of groups: 12
This provides information about runtime and the number of iterations and burn-in period used. Importantly, it tells us that over the 10000 MCMC iterations the minimum number of clusters formed was 2 and the maximum was 10 with the average cluster number being 4 (when rounded).
If we want to overlay the mean posterior density on top of the histogram and also show the 95% posterior credible bands for the density we can use the line of code below. Overall, we see that the 95% credible bands capture almost everywhere the uncertainty in the data.
plot(fit1, band = TRUE, show_hist = TRUE, xlab = "Petal length", ylab = 'Density')
Note that BNPmix transfers arguments from the package function
plot.BNPdens
to the generic R function plot
.
Standard arguments within plot
such as col
,
xlab
etc. can be modified (but not all). Typing
?plot.BNPdens
reveals the list of arguments. We can see,
for instance, that we can suppress the credible bands or change them (as
we see the default option is conf_level = c(0.025, 0.975)). We can also
add the average posterior cluster assignment for the observations with
coloured points on the \(x\)-axis as
follows.
plot(fit1, band = TRUE, show_clust = TRUE, xlab = "Petal length", ylab = 'Density')
We saw previously that the command summary(fit1)
will
provide us the average number of clusters. However, what about the model
with most frequently visited number of clusters? We are more interested
in this quantity as it is the maximum a posteriori
(MAP) estimate of the number of clusters. Unfortunately, BNPmix
does not provide an immediate command for that, but there is a
workaround with some additinal coding. First, lets have a look at the
names of the outputs that fit1
contains.
names(fit1)
## [1] "density" "data" "grideval" "grid_x" "grid_y"
## [6] "clust" "mean" "beta" "sigma2" "probs"
## [11] "niter" "nburn" "tot_time" "univariate" "regression"
## [16] "dep" "group_log" "group" "wvals"
We are interested in output clust
which contains the
cluster label for each observation forr each of the 10000 posterior
samples. Lets have a closer look.
cluster.labels = fit1$clust
class(cluster.labels)
## [1] "matrix" "array"
dim(cluster.labels)
## [1] 10000 150
cluster.labels[1,] # labels from 1st MCMC sample, 3 clusters in total are visited
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 3 0 5 3 0 3 0 3 0
## [112] 0 0 0 0 3 0 3 3 0 0 0 3 0 4 0 0 0 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 4 0 0 3 0
## [149] 0 0
Below, n.cluster
is
cluster.labels
cluster.labels
unlist
just tranform the class of
n.cluster
from list to vector of integers (as verified from
line 4).n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
length(n.cluster); class(n.cluster)
## [1] 10000
## [1] "integer"
Almost there! Now we can just use the command table()
and plot
to create a (relatively) nice figure for the
posterior probabilities of clusters.
counts = table(n.cluster)
counts
## n.cluster
## 2 3 4 5 6 7 8 9 10 11 12
## 2368 3143 2364 1339 523 191 52 15 1 1 3
plot(counts/sum(counts), col = 1:length(counts), bty = 'l', xlab = 'Number of clusters', ylab = 'Posterior probability', main = 'alpha = 1')
As seen, the most visited model is the one with \(K=3\), although \(K=2\) also has a relatively high posterior probability (but so does the model with \(K=4\)). This is not exactly what we expected based on the exploratory visualisations of the dataset at the beginning.
We may wish to improve this aspect of the model (although that is not
very important for density estimation). May it be that results are
affected from setting the expected number of clusters equal to 3? We did
this by setting Ek = 3
in the function
PYcalibrate
. You may consider changing this option to
Ek = 2
(left as an exercise), but as you will see it will
not affect things (at least under set.seed(1)
).
What we can consider, however, is to change the prior precision of the DP (strength in terms of the PY process). Previously, we defined the the prior through
prior = list(strength = 1, discount = 0, hyper = 'FALSE')
BNPmix has an internal procedure for automatically selecting the strength parameter based on the sample size and the discount parameter.We can find this value as follows:
DPprior = PYcalibrate(Ek = 3, n = n, discount = 0)
DPprior
## $strength
## [1] 0.3938936
##
## $discount
## [1] 0
Task:
6. Refit the model (with the same seed) using the above value for strength and plot the resulting posterior probabilities of the cluster for this model.
If this is done correctly, the results should be as shown below. Now the MAP estimate for the number of clusters is, indeed, \(K=2\).
counts
## n.cluster
## 2 3 4 5 6 7
## 5630 3219 899 207 42 3
plot(counts/sum(counts), col = 1:length(counts), bty = 'l', xlab = 'Number of clusters', ylab = 'Posterior probability', main = 'alpha = 0.3938936')
Above, in 3.1 we estimated the posterior density and its related uncertainty and in 3.2 we inferred about the most likely a posteriori number of clusters. These two, are typically the main goals of nonparametric analyses. However, what about posterior inference for the parameters, is that still possible and does BNPmix provide this extra inferential option?
The answer is yes, but of course the only way posterior summaries of parameters would make sense is when we work conditional upon a given \(K\). A sensible option is to select the MAP estimate, which is \(K=2\) in this application. So we will basically infer about the 2-component mixture of the form
\[ p(y_i) = \sum_{k=1}^2\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2). \] Clarification note: The results below are not equivalent to the results obtained from using a 2-component Bayesian normal mixture model as a starting point (even if the prior specification is similar in form). If the main aim is inference about parameters, then a more disciplined approach would be to first determine the number of clusters (as we did above) and then infer (using MCMC) from a finite 2-component mixture. That being said, in practical terms the results will likely look quite similar!
Lets see how we can extract posterior summaries for the means \(\mu_1\) amd \(\mu_2\). The key thing is to remember that
we can see the objects contained in fit1
by
names(fit1)
as we have seen above. The object of interest
in this case is mean
. Below we store this information in a
new objectmeans
; this is a list: means[[1]]
and means[[2]]
show the resulting means from the first and
last MCMC sample (the burn-in samples are removed automatically).
means = fit1$mean
means[[1]]; means[[10000]]
## [,1]
## [1,] 4.985267
## [2,] 1.453260
## [3,] 1.871344
## [4,] 7.050652
## [5,] 3.817440
## [6,] 14.606507
## [,1]
## [1,] 5.065751
## [2,] 1.467006
## [3,] 1.333473
Now, we are only interested in the posterior samples of the first two
components, so lets initially create a matrix of NAs with 10000 (MCMC
sample size) rows and \(K=2\) columns.
Next, we impute matrix post.means
row-wise with the first
column being filled with the samples of \(\mu_1\) and the second with samples of
\(\mu_2\).
post.means = matrix(NA, nrow = 10000, ncol = 2)
for(i in 1:10000){
post.means[i,1] = means[[i]][1]
post.means[i,2] = means[[i]][2]
}
We now have everything we require and can evaluate posterior summaries of \((\mu_1,\mu_2)\).
summary(post.means)
## V1 V2
## Min. :3.825 Min. :-0.9906
## 1st Qu.:4.846 1st Qu.: 1.4433
## Median :4.906 Median : 1.4627
## Mean :4.909 Mean : 1.4697
## 3rd Qu.:4.968 3rd Qu.: 1.4816
## Max. :6.743 Max. : 6.6708
We see, for instance, that the posterior means are \(\hat{\mu_1} = 4.909\) and \(\hat{\mu_2} = 1.4697\). Does this generally make sense? It seems so, if we have a look again at the histogram of variable Petal length one group mode seems to be around 5 while the other group mode somewhere around 1.5 (approximately). We can also produce relatively nice-looking density plots.
par(mfrow = c(1,2))
plot(density(post.means[,1]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means[,2]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3)
We observe that the posterior density of \(\mu_2\) is certainly peaked around the
posterior mode but it is also quite spread, taking values in extreme
regions with very low probability density. We can improve this plot by
controlling the \(x\)-axis range via
argument xlim
within plot
as done below.
par(mfrow = c(1,2))
plot(density(post.means[,1]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means[,2]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3, xlim = c(1,2))
Tasks:
7. Produce similar posterior summaries and density
plots for \(\sigma_1^2\) and \(\sigma_2^2\) (hint: for the plot of \(\sigma_2^2\) use
xlim = c(0,0.2)
).
8. Produce similar posterior summaries and density plots for \(\pi_1\) and \(\pi_2\)
If done correctly, you should get the following results and figures:
## V1 V2
## Min. :0.07962 Min. :0.006171
## 1st Qu.:0.59904 1st Qu.:0.031904
## Median :0.66386 Median :0.037288
## Mean :0.66643 Mean :0.038879
## 3rd Qu.:0.73513 3rd Qu.:0.043799
## Max. :1.27629 Max. :0.523315
## V1 V2
## Min. :0.009637 Min. :0.001542
## 1st Qu.:0.625312 1st Qu.:0.301021
## Median :0.656984 Median :0.327965
## Mean :0.642240 Mean :0.327050
## 3rd Qu.:0.685708 3rd Qu.:0.355353
## Max. :0.809523 Max. :0.485146
Finally, let us discuss the aspect of setting the hyper-parameters of the model. As mentioned above the prior assumptions under BNPmix are the following: \[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~ \sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0).\] The default options used by BNPmix are as follows \[m_0 = \bar{y}= \frac{1}{n}\sum_iy_i, ~k_0 =1, ~a_0 =2, ~b_0 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2.\] So as wee, BNPmix actually uses an empirical Bayes procedure where prior hyper-parameters \(m_0\) and \(b_0\) are estimated from the data! Practically, this is fine and quite understandable from the side of the developers as this is a generally “safe” option so that the algorithmic procedures implemented “under the hood” will provide stable estimates (or at least, not crash completely) across different types of datasets.
That being acknowledged, it should be mentioned that hard-core Bayesians would not like this approach at all! Even more moderate Bayesian statisticians would like to have some control over this and potentially apply a fully Bayesian approach. So the question is: can we change the prior specification?
The answer is yes. Setting of prior hyper-parameters can be
controlled via the prior
list, which, apart from the
arguments listed above in Section 2.3, can take the
following additional arguments:
prior
List with arguments for prior specification:
discount
The discount parameter (here it must be set to
0 for a DP).strength
The strength parameter of the PY process or
corresponding precision of the DP.hyper
Provides control for choosing a DPM or a
hierarchical DPM.m0
The prior mean of each \(\mu_k\).k0
The scalar \(k_0>0\) controlling the variance of each
\(\mu_k\).a0
The shape \(a_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\).b0
The scale \(b_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\).Suppose we want to adopt a non-informative prior such that \[\mu_k ~|~\sigma_k^2\sim\mathrm{N}(0,\sigma^2_k) ~\mbox{and}~\sigma_k^2\sim\mathrm{InvGamma}(1,1000).\] Note that the mean and the variance for \(a_0=1\) are not available in closed form, but we can still check empirically that this inverse-gamma has a very large mean and variance; for instance:
set.seed(1)
mean(1/rgamma(1000, shape = 1, rate = 1000))
## [1] 450650.4
var(1/rgamma(1000, shape = 1, rate = 1000))
## [1] 694001710
Above we made use of the following relationship: if \(X\sim \mathrm{Gamma}(a,b)\) (where \(b\) is the rate parameter), then \(1/X\sim \mathrm{InvGamma}(a,b)\).
Tasks:
9. Using the automatic BNPmix option for setting
strength
, while discount
should always be
equal to 0 and hyper
be equal to FALSE
, define
the prior list so that \(\mu_0=0\),
\(k_0=1\) (this is actually the
default) and \(a_0=1, ~ b_0=1000\).
Then run the new model (without changing the rest of the previous
specifications) using again set.seed(1)
and call this
fit2
.
10. Produce the two density plots we saw above, now
for for fit2
(change the colour; e.g., use
col = 2
within the plot).
11. Visualise, as done before, the posterior
probabilities for the number of clusters under model
fit2
.
12. Calculate posterior summaries and produce
posterior density plots for the means under the MAP model resulting from
fit2
(use again xlim=c(1,2)
for the second
mean component).
Results should look as follows:
## n.cluster
## 2 3 4 5 6 7
## 5530 3309 950 168 42 1
Did results change under the modified prior specification?
In this application the related inference was basically unaffected. So, we avoided using empirical Bayes and went with a fully Bayesian approach which, at the end, provided more or less the same results.
1) The suggestion above was to use \(a_0=1\) and \(b_0=1000\); these result in a diffuse prior with large mean and very large variance. This was after performing some own experiments using a more common approach; namely, setting \(a_0=b_0\) to be very small. It turned out that values smaller than 1 either provide unstable estimates for parameters (e.g., for \(a_0=b_0=0.1\)) or result in a complete crash (e.g., for \(a_0=b_0=0.001\))! So, it seems it is better to create diffuse inverse-gamma prior by setting \(a_0\) small and \(b_0\) large with this package (this is something you can explore further in other applications).
2) Sometimes things can go wrong for reasons that
are obscure. For instance, let’s say that we want to analyse Petal
width (the 4th column of y
). In this case
prior = list(strength = DPprior$strength, discount = 0, hyper = 'FALSE')
output = list(grid = seq(min(y[,4]), max(y[,4]), length.out = 100), out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y[,4], mcmc = mcmc, prior = prior, output = output)
If you execute this code you will receive the following error message:
Error in cICS(y, grid_use, niter, nburn, m0, k0, a0, b0, m1, s21, tau1, : randg(): incorrect distribution parameters; a and b must be greater than zero
Now, the fact that error starts with cICS gives reason to suspect
that there is something going wrong with the MCMC sampler used. BNPmix
can implement three algorithms: an Importance Conditional Sampler (ICS),
a Marginal Sampler (MAR) and a Slice Sampler (SLI). These can be
controlled via the mcmc
list through the extra argument
method
. The default option is
method = 'ICS'
.
However, before changing the algorithm lets consider first a practical solution that might. In such types of applications sometimes it is better to scale the data first (for each column subtract its mean and divide by its variance). So, lets try this
y.scaled = scale(y)
apply(y.scaled,2,mean); apply(y.scaled,2,var)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## -4.484318e-16 2.034094e-16 -2.895326e-17 -3.663049e-17
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 1 1
output = list(grid = seq(min(y.scaled[,4]), max(y.scaled[,4]), length.out = 100), out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y.scaled[,4], mcmc = mcmc, prior = prior, output = output)
summary(fit3)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 2.51 seconds
## Average number of groups: 4
## Min number of groups: 4 ; Max number of groups: 4
plot(fit3, band = TRUE, show_hist = TRUE, xlab = "Petal width", ylab = 'Density', col = 3)
Well, in this case we managed to run the sampler, but the results are awfull! So, lets change the sampler and use MAR instead.
mcmc = list(niter = 11000, nburn = 1000, method = 'MAR')
output = list(grid = seq(min(y[,4]), max(y[,4]), length.out = 100), out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y[,4], mcmc = mcmc, prior = prior, output = output)
## Completed: 1100/11000 - in 0.468 sec
## Completed: 2200/11000 - in 0.967 sec
## Completed: 3300/11000 - in 1.496 sec
## Completed: 4400/11000 - in 1.975 sec
## Completed: 5500/11000 - in 2.503 sec
## Completed: 6600/11000 - in 2.966 sec
## Completed: 7700/11000 - in 3.585 sec
## Completed: 8800/11000 - in 4.055 sec
## Completed: 9900/11000 - in 4.534 sec
## Completed: 11000/11000 - in 5.05 sec
##
## Estimation done in 5.05 seconds
summary(fit3)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 5.05 seconds
## Average number of groups: 3.2712
## Min number of groups: 2 ; Max number of groups: 30
plot(fit3, band = TRUE, show_hist = TRUE, xlab = "Petal width", ylab = 'Density', col = 3)
It works! Plus, it will also work without scaling the data.
Warning note: If you try to run this with
method = 'SLI'
RStudio will crash for some reason!