Today we will work with the abalone dataset from R package AppliedPredictiveModeling. So, first lets install the package and load it, and also load the other packages we will require.
install.packages('AppliedPredictiveModeling') # no reason to execute this if already installed
library(AppliedPredictiveModeling)
library(BNPmix)
library(ggplot2)
library(gridExtra)
To see the datasets contained in package
AppliedPredictiveModeling
we can use:
data(package = 'AppliedPredictiveModeling')
Now, lets import the abalone
data.
data(abalone)
class(abalone)
## [1] "data.frame"
dim(abalone)
## [1] 4177 9
head(abalone)
## Type LongestShell Diameter Height WholeWeight ShuckedWeight VisceraWeight
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775
## ShellWeight Rings
## 1 0.150 15
## 2 0.070 7
## 3 0.210 9
## 4 0.155 10
## 5 0.055 7
## 6 0.120 8
n = nrow(abalone)
Prior to proceeding to any analysis, it is important to check whether
a dataset contains missing values (NA
entries), we can do
this as follows.
sum(is.na(abalone))
## [1] 0
The result is zero, which is good as it means this dataset has no
missing values. If the result was not zero then we would remove the rows
that contain missing values by stating
abalone = na.omit(abalone)
.
Typing ?abalone
will open the help window. There we read
that the data consist of measurements of the type (male, female and
infant), the longest shell measurement, the diameter, height and several
weights (whole, shucked, viscera and shell). The outcome is the number
of rings. The age of the abalone is the number of rings plus 1.5.
The command attach
comes in very handy as we can use it
to extract the columns (with their names) for any
data.frame
. Lets consider the continuous measurements as
potential predictors and call them x1
… x7
for
simplicity.
attach(abalone)
x1 = LongestShell
x2 = Diameter
x3 = Height
x4 = WholeWeight
x5 = ShuckedWeight
x6 = VisceraWeight
x7 = ShellWeight
We will consider the age of the abalones as response variable, which as stated in the description is number of rings plus 1.5, so:
Age = Rings + 1.5
y = Age
hist(y, xlab = 'Abalone age', col = 51, main = '')
Next, we can have a look at plots of the predictors vs.the response.
Here, we can take advantage of the additional information we have from
variable Type
which is a factor
with 3 levels
F
(female), I
(infant) and M
(male) and colour the data points according to Type
.
par(mfrow = c(3,3))
plot(LongestShell, Age, col = Type, bty = 'l')
plot(Diameter,Age, col = Type, bty = 'l')
plot(Height,Age, col = Type, bty = 'l')
plot(WholeWeight,Age, col = Type, bty = 'l')
plot(ShuckedWeight,Age, col = Type, bty = 'l')
plot(VisceraWeight,Age, col = Type, bty = 'l')
plot(ShellWeight,Age, col = Type, bty = 'l')
To find out to which categories the colours correspond to we can type
levels(Type)
## [1] "F" "I" "M"
which means that the first category F
is
col = 1
(black), I
is col = 2
(red) and M
is col = 3
(green). Note that if
this was a clustering problem without knowing Type
then it
would be quite a complicated problem; the distinction between
I
and the rest is relatively obvious but the distinction
between F
and M
is pretty hard (only
Age
and Height
would probably give some
indication). Lets also plot the pairs of predictors only out of
curiosity:
pairs(abalone[ , -c(1, 9)], col = Type)
Difficult clustering problem indeed!
In the regression setting we have the response \(\mathbf{y} = (y_1,\ldots,y_n)^T\) and the
\(n\times(p+1)\) matrix of predictors
\(\mathbf{X}_0\), where its first
column is equal to a vector of 1’s. BNPmix exploits the stick-breaking
formulation (and uses Gibbs samplers on top of that)
\[\begin{equation}
p(y_i ~|~ \mathbf{x}_{i}) = \sum_{k=1}^\infty \pi_k
\mathrm{N}(y_i~|~\mathbf{x}_{0i}^T\boldsymbol{\beta}_k, \sigma^{2}_k),
\end{equation}\] with prior distributions for \(\boldsymbol{\beta}_k\) and \(\sigma_k^2\) based on a conditional
conjugate multivariate normal/inverse-gamma centering measure,
such that \[\boldsymbol{\beta}_k\sim
G_0(\boldsymbol{\beta}_k)\equiv \mathrm{N}_p(\mathbf{m}_0, \mathbf{S}_0)
\mbox{ and }
\sigma_k^2\sim G_0(\sigma_k^2) \equiv
\mathrm{InvGamma}(a_0,b_0),\] where \(a_0\) and \(b_0\) are the shape and scale
parameters of the inverse-gamma distribution, respectively. The default
options in BNPmix are: \[\mathbf{m}_0 =
(\bar{y}, 0,\ldots,0)^T, ~\mathbf{S}_0 = \mathrm{diag}(100,\ldots,100),~
a_0 =2, ~b_0 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2.\] As seen,
we have a data-dependent prior for the intercept (\(\beta_0\)) as the mean of the normal is
\(\bar{\boldsymbol{y}}\) and also for
the variance since \(b_0\) is set equal
to the sample of the response.
Exactly like previously, we can also extend the model to a hierarchical DPM regression model. In this case BNPmix facilitates the use of the following hyper-priors \[ \mathbf{m}_0 ~|~\mathbf{S}_0\sim \mathrm{N}_{p+1}(\mathbf{m}_1,k_1^{-1}\mathbf{S}_0), ~ \mathbf{S}_0 \sim \mathrm{IW}(\mathbf{S}_1,\nu_1), ~ b_0 \sim \mathrm{Gamma}(\tau_1,\zeta_1).\]
Again, BNPmix does not allow placing a hyper-prior on \(a_0\). The default options for a HPDM
regression model are: \[\mathbf{m}_1 =
(\bar{y}, 0,\ldots,0)^T, ~ k_1 =1,~\mathbf{S}_1 =
\mathrm{diag}(100,\ldots,100), ~\nu_1 = 1, ~\tau_1 =\zeta_1 =
1.\] Let us examine the prior
list object with all
of its relevant arguments:
prior
List with arguments for prior specification:
discount
The discount parameter (here it must be set to
0 for a DP).strength
The strength parameter of the PY process or
corresponding precision of the DP.hyper
Provides control for choosing a DPM or a
hierarchical DPM.m0
The prior mean-vector of each \(\boldsymbol{\mu}_k\) (DPM). each \(\boldsymbol{\mu}_k\) (DPM).S0
The variance-covariance \(\boldsymbol{\Sigma}_0\) (DPM).a0
The shape \(a_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\)
(DPM and HDPM).b0
The scale \(b_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\)
(DPM).m1
The hyper-prior mean-vector of \(\mathbf{m}_0\) (HDPM).k1
The scalar \(k_1>0\) controlling the
variance-covariance of \(\mathbf{m}_0\)
(HDPM).tau1
The shape \(\tau_1>0\) of the gamma distribution on
\(b_0\) (HDPM).zeta1
The rate \(\zeta_1>0\) of the gamma distribution on
\(b_0\) (HDPM).n1
The degrees-of-freedom parameter \(\nu_1>q-1\) of the inverse-Wishart
distribution on \(\mathbf{S}_0\)
(HDPM).S1
TThe variance-covariance \(\mathbf{S}_1\) of the inverse-Wishart
distribution on \(\mathbf{S}_0\)
(HDPM).It is important to understand that regression models essentially fit a different distribution to each \(y_i\) conditional on \(\mathbf{x}_{0i}\). In standard normal linear regression the distribution is always normal, symmetrically centered around \(\mathbf{x}_{0i}\hat{\boldsymbol{\beta}}\). DPM regressions essentially do the same but the conditional distribution can be any distribution and the conditional distributions can take different forms/shapes across the datapoints (so they are very flexible regression models).
Lets start with one predictor variable, choosing variable
Height
(x3
). The function DPR regression is
PYregression
and its functionality is essentially the same
with PYdensity
since it depends on PYcalibrate
and the lists prior
, mcmc
and
output
. Below we set some initial specifications like
Ek=3
, niter
(it is recommended to set
niter
small, at least in initial exploratory runs) and so
forth.
DPprior = PYcalibrate(Ek = 3, n = length(y), discount = 0)
DPprior
## $strength
## [1] 0.2329552
##
## $discount
## [1] 0
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE)
mcmc = list(niter = 2000, nburn = 1000)
grid_y = seq(min(y), max(y), length.out = 100)
Above we set the length grid_y
of be equal to 100; this
is okay in this case since \(\mathbf{y}\) is univariate (similarly to
univariate DPM density estimation).
Now we must consider the grid over x3
and this is
important. As previously mentioned, the DPM regression model
will provide a distribution of \(y_i\)
given each different \(x_{i3}\). So,
first, it would be too much to ask from BNPmix to set
grid_x3 = seq(min(x3), max(x3), length.out = 100)
! Second,
even if we could do that the results would be too chaotic to summarise!
So, what we want to do with this type of regression is to set the grid
based on specific values of \(x\) that might of interest. For
instance, we can use the quartiles of \(\mathbf{x}_3\). In R these are given by the
command quantile
:
quantile(x3)
## 0% 25% 50% 75% 100%
## 0.000 0.115 0.140 0.165 1.130
The above correspond to min, 1st quartile, median, 3rd quartile and max, respectively. Using the min and max seems a bit awkward so below we will replace these with the 1% and 99% quantiles, respectively.
grid_x3 = quantile(x3, probs = c(0.01, 0.25, 0.50, 0.75, 0.99))
grid_x3 = round(grid_x3, 3)
grid_x3
## 1% 25% 50% 75% 99%
## 0.045 0.115 0.140 0.165 0.220
We can now have our first try with PYregression
; note
that with this function it is important to specify
out_type = 'FULL'
in the output
list.
output = list(grid_y = grid_y, grid_x = grid_x3, out_type = 'FULL', out_param = TRUE)
set.seed(1)
fit.reg = PYregression(y = y, x = x3, prior = prior, mcmc = mcmc, output = output)
## Completed: 200/2000 - in 2.422 sec
## Completed: 400/2000 - in 4.764 sec
## Completed: 600/2000 - in 7.413 sec
## Completed: 800/2000 - in 10.422 sec
## Completed: 1000/2000 - in 13.05 sec
## Completed: 1200/2000 - in 15.378 sec
## Completed: 1400/2000 - in 17.668 sec
## Completed: 1600/2000 - in 20.931 sec
## Completed: 1800/2000 - in 25.113 sec
## Completed: 2000/2000 - in 30.541 sec
##
## Estimation done in 30.541 seconds
summary(fit.reg)
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 30.54 seconds
## Average number of groups: 6.889
## Min number of groups: 5 ; Max number of groups: 12
We managed a first attempt to fit a DPM regression. Results look a
bit complex with minimum and maximum number of clusters/groups equal to
5 and 12, respectively, and average number around 7. Here we can try
tweaking several things, like strength
, Ek
hyper
and the various prior options, but likely results
will not change drastically. A far more simpler and effective approach
is to first re-consider the data visualised above where we plotted
Age
against the available predictor variables and then
think of a transformation of \(y\). As
we see in that plot, normal linear regression would fail in general (for
all covariates) not so much due to non-linearity (although some of this
is also evident), but mainly due to unequal variances across the \(x\)-axes. So, lets reduce the variance of
the response and bring its scale closer to the scales of the
predictors.
Tasks:
1. Set y
to be the logarithm of
Age
and reproduce the second plot presented above. If done
correctly, the plot should look similar to the one presented below.
We definitely decreased the variance of the response by a bit.
2. Refit fit.req
(careful to specify
grid_y
and output
again as now the scale of
the response has changed). Use summary()
to see whether
there are any changes.
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 15.21 seconds
## Average number of groups: 4.067
## Min number of groups: 4 ; Max number of groups: 6
Done correctly, the result from summary should be as above. Looks we are heading to a better, simpler direction through the transformation of the response.
3. Try to lower the resulting average number of cluster a little bit by decreasing the prior expected number of cluster to 2. If the change is implemented correctly you should see the following output.
## $strength
## [1] 0.1143963
##
## $discount
## [1] 0
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 8.85 seconds
## Average number of groups: 3.074
## Min number of groups: 3 ; Max number of groups: 4
4. Given all the previous specifications, remove the
default data-dependent assumptions and set \(\mathbf{m_0}=(0,0)^T, ~ a_0 = 3\) and \(b_0 = 100\). In this way the prior
expectation and variance of \(\sigma_k^2\) are \(\mathbb{E}[\sigma_k^2] = b_0/(a_0-1)=50\)
and \(\mathbb{V}\mathrm{ar}[\sigma_k^2] =
b_0^2/(a_0-1)^2(a_0-2)=2500\), respectively. Use
summarise()
and also calculate the posterior probabilities
for the number of clusters (we have previously seen how to do this). The
results should be as below.
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 9.75 seconds
## Average number of groups: 3.147
## Min number of groups: 3 ; Max number of groups: 5
## n.cluster
## 3 4 5
## 0.855 0.143 0.002
Not bad so far! We have gradually managed to set up a DPM regression
model, which is not data dependent and actually gives high posterior
probability to the existence of 3 clusters! Note that no information
from Type
was used in the actual model (admittedly, at
least as a direct input).
Plotting posterior conditional densities
Visualising posterior densities is a bit tricky for DPM regression
models as it requires a relatively good understanding of
data.frame
and ggplot
, but the below R codes
can be used as custom codes. The vital information is stored in the
value density
of our object fit.reg
.
class(fit.reg$density)
## [1] "array"
dim(fit.reg$density)
## [1] 100 5 1000
As we see fit.reg$density
is a 3-dimensional array,
containing 1000 (number of posterior samples ) matrices with 100 (grid
length of y
) rows and 5 (grid length of x3
)
rows. So, in the regplot
data frame created below
dens
contains the mean density evaluations,
qlow
and qupp
the lower and upper levels of
the 95% credible band. The column grid
sets appropriately
the grid for the plots, while the last column label
sets
appropriately the correct names for the plots that we produce next.
regplot = data.frame(
dens = as.vector(apply(fit.reg$density, c(1, 2), mean)),
qlow = as.vector(apply(fit.reg$density, c(1, 2),
quantile, probs = 0.025)),
qupp = as.vector(apply(fit.reg$density, c(1, 2),
quantile, probs = 0.975)),
grid = rep(grid_y, length(grid_x3)),
label = factor(rep(paste('Height = ', grid_x3), each = length(grid_y)),
level = rep(paste('Height = ', grid_x3)))
)
We can now use ggplot
syntax to produce a nice plot of
the posterior conditional densities of \(y\) (logarithm of age) for \(x_3 \in\{0.045, 0.115, 0.140, 0.165,
0.220\}\) (height). As we see the density of the response varies
within the range of the predictor variable. We also see that the
credible bands are generally very narrow for non-extreme values of
Height
; this is due to large sample size (\(n= 4177\)). Of course, in the extreme
regions of Height
we have much less observations and, thus,
more uncertainty (this is reflected by the wider posterior credible
bands).
ggplot(regplot) + theme_bw() +
geom_line(data = regplot, map = aes(x = grid, y = dens)) +
geom_ribbon(data = regplot, map = aes(x = grid, ymin = qlow,
ymax = qupp), fill = 'blue', alpha = 0.3) +
facet_wrap(~label, ncol = 5, nrow = 1) +
labs(x = 'log(Age)', y = 'Density')
Posterior summaries and plots
Similarly, to what we have seen in previous applications, we can
detect objects of interest by examining the output of
fit.reg
via the command names()
. For instance
we see below the value beta
which contains the \(\boldsymbol{\beta}_k=(\beta_{k0},
\beta_{k_1})^T\) posterior samples for \(k\in\{3,4\}\) (the number of clusters for
the models visited by BNPmix).
names(fit.reg)
## [1] "density" "data" "grideval" "grid_x" "grid_y"
## [6] "clust" "mean" "beta" "sigma2" "probs"
## [11] "niter" "nburn" "tot_time" "univariate" "regression"
## [16] "dep" "group_log" "group" "wvals"
betas = fit.reg$beta
length(betas) # 1000 posterior samples
## [1] 1000
betas[[1]] # The 1st sample for the 3 clusters (1st column for intercept, 2nd for slope)
## [,1] [,2]
## [1,] 1.680566 4.52727815
## [2,] 1.547222 7.20887658
## [3,] 2.461443 -0.02907869
We can store those in 3 (since we are interesting in inferring under the MAP estimate \(\hat{K} = 3\)) matrices of appropriate dimensionality (1000 rows for the MCMC sample, and 2 columns; one for the intercept and one for the slope).
post.betas1 = matrix(NA, nrow = 1000, ncol = 2)
post.betas2 = matrix(NA, nrow = 1000, ncol = 2)
post.betas3 = matrix(NA, nrow = 1000, ncol = 2)
for(i in 1:1000){
post.betas1[i,] = betas[[i]][1,]
post.betas2[i,] = betas[[i]][2,]
post.betas3[i,] = betas[[i]][3,]
}
We can then evaluate numerical posterior summaries; e.g., for \(\boldsymbol{\beta}_1\) we have:
summary(post.betas1)
## V1 V2
## Min. :1.671 Min. :3.883
## 1st Qu.:1.714 1st Qu.:4.190
## Median :1.724 Median :4.264
## Mean :1.724 Mean :4.265
## 3rd Qu.:1.735 3rd Qu.:4.338
## Max. :1.775 Max. :4.630
Also, with some effort we can use generic R functions to summarise
the posterior densities in one plot (the slightly tricky part is to set
appropriately the argument line
in mtext
) as
demonstrated below.
par(mfrow = c(3,2))
plot(density(post.betas1[,1]), main = '', lwd = 2, ylab = '', xlab = expression(beta[0]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.betas1[,2]), main = '', lwd = 2, ylab = '', xlab = expression(beta[1]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Cluster 1', side = 3, line = -2, outer = TRUE, cex = 1.2)
plot(density(post.betas2[,1]), main = '', lwd = 2, ylab = 'Posterior density', xlab = expression(beta[0]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.betas2[,2]), main = '', lwd = 2, ylab = '', xlab = expression(beta[1]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Cluster 2', side = 3, line = -14, outer = TRUE, cex = 1.2)
plot(density(post.betas3[,1]), main = '', lwd = 2, ylab = '', xlab = expression(beta[0]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.betas3[,2]), main = '', lwd = 2, ylab = '', xlab = expression(beta[1]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Cluster 3', side = 3, line = -26, outer = TRUE, cex = 1.2)
Of course, we can also use package ggplot
(as we did in the
Applied Lecture) to produce similar plots.
We can further infer about the remaining parameters \((\sigma_1^2,\sigma_2^2,\sigma_3^2)\) and
\((\pi_1,\pi_2,\pi_3)\) under \(\hat{K}=3\). Examining with
names()
again, we understand that the values of the output
of fit.reg
we are interested in are sigma2
and
probs
, respectively. The following lines of code give us an
idea of what we have to do.
names(fit.reg)
## [1] "density" "data" "grideval" "grid_x" "grid_y"
## [6] "clust" "mean" "beta" "sigma2" "probs"
## [11] "niter" "nburn" "tot_time" "univariate" "regression"
## [16] "dep" "group_log" "group" "wvals"
sigmas = fit.reg$sigma2
dim(sigmas)
## [1] 1000 1
sigmas[[1]]
## [,1]
## [1,] 0.01305736
## [2,] 0.04399147
## [3,] 0.01533432
pi = fit.reg$probs
dim(pi)
## [1] 1000 1
pi[[1]]
## [,1]
## [1,] 5.874946e-01
## [2,] 3.252824e-01
## [3,] 8.722309e-02
## [4,] 7.165844e-13
## [5,] 7.165844e-13
Note: The above results are contradictory; in the first MCMC sample we have three variances and five probabilities. The reason for this is because both \(\pi_4\) and \(\pi_5\) are in practical terms equal to 0, so no parameters were estimated for these clusters.
Tasks:
5. Create a matrix with entries NA
called post.sigmas
of dimensionality \(1000\times 3\). Then, impute accordingly
the posterior samples of \(\sigma_1^2\), \(\sigma_2^2\) and \(\sigma_3^2\) in the columns of the matrix
using a for
loop. Finally, use summary()
to
evaluate numerical posterior summaries and also
par(mfrow = c(1,3))
to plot the posterior densities of the
three variances in one plot. Your results and plot should look similar
to the ones presented below.
## V1 V2 V3
## Min. :0.009881 Min. :0.03100 Min. :0.005328
## 1st Qu.:0.011791 1st Qu.:0.04707 1st Qu.:0.010810
## Median :0.012411 Median :0.04915 Median :0.012980
## Mean :0.012399 Mean :0.04863 Mean :0.014055
## 3rd Qu.:0.013030 3rd Qu.:0.05102 3rd Qu.:0.015936
## Max. :0.014945 Max. :0.05738 Max. :0.036026
6. Work similarly for the cluster probabilities
\(\pi_1\), \(\pi_2\) and \(\pi_3\), using a matrix with entries
NA
called post.pi
. The numerical results and
plot should look like the ones below.
## V1 V2 V3
## Min. :0.4762 Min. :0.2419 Min. :0.04576
## 1st Qu.:0.5357 1st Qu.:0.3318 1st Qu.:0.07858
## Median :0.5559 Median :0.3571 Median :0.08810
## Mean :0.5539 Mean :0.3539 Mean :0.08846
## 3rd Qu.:0.5746 3rd Qu.:0.3783 3rd Qu.:0.09859
## Max. :0.6280 Max. :0.4374 Max. :0.12870
As seen from the numerical summaries above (based on the means), the DPM regression identifies two relatively dominant clusters with \(\hat{\pi}_1\approx 0,55\) and \(\hat{\pi}_2\approx 0,35\) and one less crucial cluster with \(\hat{\pi}_2\approx 0,1\). This, of course, does not reflect the actual distinction across female, infant and male abalone shells, since their distribution within the overall sample is quite balanced as we see below
table(Type)/n
## Type
## F I M
## 0.3129040 0.3212832 0.3658128
This does not diminish the value of DPM regression; the main aim of this modelling approach is to infer about conditional densities of \(y~|~x\). Furthermore, it should also be acknowledged as an effective tool for initial identification of the number of clusters in complex datasets.
Suppose now that we want to include a second predictor in the
analysis additional to Height
. As we see from the
pairs
plot above the relationship of y
with
the remaining predictors seems to be more or lesss the same, so lets
just pick Diameter
(which is x2
). We must
first form a matrix, lets call it X
and lets keep
Height
as the first variable.
X = cbind(x3,x2)
Now, similarly to what we did before we must define a grid over
x2
. It is important to understand that the density
evaluations of y
will be based on a joint grid of all
possible combinations of the values of x3
and
x2
. We will start with a basic grid in which
x2
takes only one value; speficically when it is equal to
its median.
grid_x2 = quantile(x2, probs = c(0.50))
grid_x2
## 50%
## 0.425
Next, we use the command expand.grid
which we saw in the
Applied Lecture.
grid_X = expand.grid(grid_x3, grid_x2)
We can now run a DPM multiple regression model, storing it in
fit.reg1
. Note that we must specify again the
prior
list, since now we want \(\mathbf{m}_0=(0,0,0)^T\), as well as the
output
list using grid_x = grid_X
within
it.
DPprior
## $strength
## [1] 0.1143963
##
## $discount
## [1] 0
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE, m0 = c(0, 0, 0), a0 = 3, b0 = 100)
output <- list(grid_x = grid_X, grid_y = grid_y, out_type = 'FULL', out_param = TRUE)
set.seed(1)
fit.reg1 = PYregression(y = y, x = X, prior = prior, mcmc = mcmc, output = output)
summary(fit.reg1)
## PYdensity function call:
## 1000 burn-in iterations
## 2000 iterations
## Global estimation time: 9.29 seconds
## Average number of groups: 3.034
## Min number of groups: 3 ; Max number of groups: 4
In this case we have evaluated the density over only one value of
Diameter
, so the dimensionality of the density array
remains the same and we can essentially use the same code as above to
plot the varying densities of y
. Note of course, that these
are different densities since they are conditional on the
values of Height
and the value of Diameter
(0.425).
dim(fit.reg1$density)
## [1] 100 5 1000
regplot <- data.frame(
dens = as.vector(apply(fit.reg1$density, c(1, 2), mean)),
qlow = as.vector(apply(fit.reg1$density, c(1, 2),
quantile, probs = 0.025)),
qupp = as.vector(apply(fit.reg1$density, c(1, 2),
quantile, probs = 0.975)),
grid = rep(grid_y, length(grid_x3)),
label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
level = rep(paste("Height = ", grid_x3)))
)
ggplot(regplot) + theme_bw() +
geom_line(data = regplot, map = aes(x = grid, y = dens)) +
geom_ribbon(data = regplot, map = aes(x = grid, ymin = qlow,
ymax = qupp), fill = "red", alpha = 0.3) +
facet_wrap(~label, ncol = 5, nrow = 1) +
labs(x = "log(Age)", y = "Density given Diameter = 0.425")
We can imagine things becoming more complicated if we wish to examine
what is going on for different values of x2
on the grid.
For instance, below we consider the minimum, median and maximum values
of `x2.
grid_x2 = quantile(x2, probs = c(0, 0.50, 1))
grid_x2
## 0% 50% 100%
## 0.055 0.425 0.650
grid_X = expand.grid(grid_x3, grid_x2)
output <- list(grid_x = grid_X, grid_y = grid_y, out_type = 'FULL', out_param = TRUE)
set.seed(1)
fit.reg1 = PYregression(y = y, x = X, prior = prior, mcmc = mcmc, output = output)
dim(fit.reg1$density)
## [1] 100 15 1000
Now, the 3-dimensional array fit.reg1$density
has 15
columns. The first 5 are the combinations of the grid values of
x3
with x2
equal to 0.055, the next 5 the
combinations with x2
equal to 0.425, and the last 5 the
combinations with x2
equal to 0.65. So, now, we need to
consider all of these and treat each case separately as follows:
regplot1 <- data.frame(
dens = as.vector(apply(fit.reg1$density[, 1:5,], c(1, 2), mean)),
qlow = as.vector(apply(fit.reg1$density[, 1:5,], c(1, 2),
quantile, probs = 0.025)),
qupp = as.vector(apply(fit.reg1$density[, 1:5,], c(1, 2),
quantile, probs = 0.975)),
grid = rep(grid_y, length(grid_x3)),
label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
level = rep(paste("Height = ", grid_x3)))
)
regplot2 <- data.frame(
dens = as.vector(apply(fit.reg1$density[, 6:10,], c(1, 2), mean)),
qlow = as.vector(apply(fit.reg1$density[, 6:10,], c(1, 2),
quantile, probs = 0.025)),
qupp = as.vector(apply(fit.reg1$density[, 6:10,], c(1, 2),
quantile, probs = 0.975)),
grid = rep(grid_y, length(grid_x3)),
label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
level = rep(paste("Height = ", grid_x3)))
)
regplot3 <- data.frame(
dens = as.vector(apply(fit.reg1$density[, 11:15,], c(1, 2), mean)),
qlow = as.vector(apply(fit.reg1$density[, 11:15,], c(1, 2),
quantile, probs = 0.025)),
qupp = as.vector(apply(fit.reg1$density[, 11:15,], c(1, 2),
quantile, probs = 0.975)),
grid = rep(grid_y, length(grid_x3)),
label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
level = rep(paste("Height = ", grid_x3)))
)
plot1 = ggplot(regplot1) + theme_bw() +
geom_line(data = regplot, map = aes(x = grid, y = dens)) +
geom_ribbon(data = regplot, map = aes(x = grid, ymin = qlow,
ymax = qupp), fill = "green", alpha = 0.3) +
facet_wrap(~label, ncol = 5, nrow = 1) +
labs(x = "log(Age)", y = "Diameter = 0.055")
plot2 = ggplot(regplot2) + theme_bw() +
geom_line(data = regplot, map = aes(x = grid, y = dens)) +
geom_ribbon(data = regplot, map = aes(x = grid, ymin = qlow,
ymax = qupp), fill = "red", alpha = 0.3) +
facet_wrap(~label, ncol = 5, nrow = 1) +
labs(x = "log(Age)", y = "Diameter = 0.425")
plot3 = ggplot(regplot3) + theme_bw() +
geom_line(data = regplot, map = aes(x = grid, y = dens)) +
geom_ribbon(data = regplot, map = aes(x = grid, ymin = qlow,
ymax = qupp), fill = "blue", alpha = 0.3) +
facet_wrap(~label, ncol = 5, nrow = 1) +
labs(x = "log(Age)", y = "Diameter = 0.650")
Finally, we can use the grid.arrange()
function to
combine the three plots. In this particular example conditional on
Height
the density of log(Age)
is not affected
by Diameter
, as we see no differences in the conditional
densities.
grid.arrange(plot1, plot2, plot3, layout_matrix = matrix(1:3, 3, 1))
Apart of this added complexity, the rest of the coding procedures are
similar to previously with the only difference being that now we have
\(\boldsymbol{\beta}_k=(\beta_{k0},\beta_{k1},\beta_{k2})^T\),
for \(k=1,\ldots,K\), where \(\beta_{k1}\) and \(\beta_{k2}\) are the regression
coefficients of Height
and Diameter
under
group \(k\).
Task:
7. Produce a plot of the posterior distributions of the \(\boldsymbol{\beta}_k\)’s (for \(K=3\)) using a similar code to the one presented in 3.1. Done correctly, you should get something like the plot below.
As we saw, DPM regression modelling with BNPmix delivers estimates of posterior densities conditional on values of the predictor variable(s). As such, when the main aim is density estimation working simultaneously with many predictors is not particularly meaningful. Applications typically involve examining whether the conditional density of a given response changes over a range of values of one predictor of a few predictors (say, two or three at most). When we have a lot of predictors we may choose to select a few which seem more interesting and more relevant to the response. Alternatively, we can examine the densities of the response conditional on each individual predictor (i.e., run many regressions using one predictor at a time).