1 Abalone dataset

Abalone shell

Today we will work with the abalone dataset from R package AppliedPredictiveModeling. So, first lets install the package and load it, and also load the other packages we will require.

install.packages('AppliedPredictiveModeling') # no reason to execute this if already installed

library(AppliedPredictiveModeling)
library(BNPmix)
library(ggplot2)
library(gridExtra)

To see the datasets contained in package AppliedPredictiveModeling we can use:

data(package = 'AppliedPredictiveModeling')

Now, lets import the abalone data.

data(abalone)
class(abalone)

## [1] "data.frame"

dim(abalone)

## [1] 4177    9

head(abalone)

##   Type LongestShell Diameter Height WholeWeight ShuckedWeight VisceraWeight
## 1    M        0.455    0.365  0.095      0.5140        0.2245        0.1010
## 2    M        0.350    0.265  0.090      0.2255        0.0995        0.0485
## 3    F        0.530    0.420  0.135      0.6770        0.2565        0.1415
## 4    M        0.440    0.365  0.125      0.5160        0.2155        0.1140
## 5    I        0.330    0.255  0.080      0.2050        0.0895        0.0395
## 6    I        0.425    0.300  0.095      0.3515        0.1410        0.0775
##   ShellWeight Rings
## 1       0.150    15
## 2       0.070     7
## 3       0.210     9
## 4       0.155    10
## 5       0.055     7
## 6       0.120     8

n = nrow(abalone)

Prior to proceeding to any analysis, it is important to check whether a dataset contains missing values (NA entries), we can do this as follows.

sum(is.na(abalone))

## [1] 0

The result is zero, which is good as it means this dataset has no missing values. If the result was not zero then we would remove the rows that contain missing values by stating abalone = na.omit(abalone).

Typing ?abalone will open the help window. There we read that the data consist of measurements of the type (male, female and infant), the longest shell measurement, the diameter, height and several weights (whole, shucked, viscera and shell). The outcome is the number of rings. The age of the abalone is the number of rings plus 1.5.

The command attach comes in very handy as we can use it to extract the columns (with their names) for any data.frame. Lets consider the continuous measurements as potential predictors and call them x1 … x7 for simplicity.

attach(abalone)
x1 = LongestShell
x2 = Diameter 
x3 = Height
x4 = WholeWeight 
x5 = ShuckedWeight 
x6 = VisceraWeight
x7 = ShellWeight

We will consider the age of the abalones as response variable, which as stated in the description is number of rings plus 1.5, so:

Age = Rings + 1.5
y = Age
hist(y, xlab = 'Abalone age', col = 51, main = '')

Next, we can have a look at plots of the predictors vs.the response. Here, we can take advantage of the additional information we have from variable Type which is a factor with 3 levels F (female), I (infant) and M (male) and colour the data points according to Type.

par(mfrow = c(3,3))
plot(LongestShell, Age, col = Type, bty = 'l')
plot(Diameter,Age, col = Type, bty = 'l')
plot(Height,Age, col = Type, bty = 'l')
plot(WholeWeight,Age, col = Type, bty = 'l')
plot(ShuckedWeight,Age, col = Type, bty = 'l')
plot(VisceraWeight,Age, col = Type, bty = 'l')
plot(ShellWeight,Age, col = Type, bty = 'l')

To find out to which categories the colours correspond to we can type

levels(Type)

## [1] "F" "I" "M"

which means that the first category F is col = 1 (black), I is col = 2 (red) and M is col = 3 (green). Note that if this was a clustering problem without knowing Type then it would be quite a complicated problem; the distinction between I and the rest is relatively obvious but the distinction between F and M is pretty hard (only Age and Height would probably give some indication). Lets also plot the pairs of predictors only out of curiosity:

pairs(abalone[ , -c(1, 9)], col = Type)

Difficult clustering problem indeed!

2 DPM regression models in BNPmix

In the regression setting we have the response $\mathbf{y} = (y_1,\ldots,y_n)^T$ and the $n\times(p+1)$ matrix of predictors $\mathbf{X}$, where its first column is equal to a vector of 1’s. BNPmix exploits the stick-breaking formulation (and uses Gibbs samplers on top of that)
\[\begin{equation} p(y_i ~|~ \mathbf{x}_{i}) = \sum_{k=1}^\infty \pi_k \mathrm{N}(y_i~|~\mathbf{x}_{i}^T\boldsymbol{\beta}_k, \sigma^{2}_k), \end{equation}\] with prior distributions for $\boldsymbol{\beta}_k$ and $\sigma_k^2$ based on a conditional conjugate multivariate normal/inverse-gamma centering measure, such that \[\boldsymbol{\beta}_k\sim G_0(\boldsymbol{\beta}_k)\equiv \mathrm{N}_p(\mathbf{m}_0, \mathbf{S}_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0),\] where $a_0$ and $b_0$ are the shape and scale parameters of the inverse-gamma distribution, respectively. The default options in BNPmix are: \[\mathbf{m}_0 = (\bar{y}, 0,\ldots,0)^T, ~\mathbf{S}_0 = \mathrm{diag}(100,\ldots,100),~ a_0 =2, ~b_0 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2.\] As seen, we have a data-dependent prior for the intercept ($\beta_0$) as the mean of the normal is $\bar{{y}}$ and also for the variance since $b_0$ is set equal to the sample variance of the response.

Exactly like previously, we can also extend the model to a hierarchical DPM regression model. In this case BNPmix facilitates the use of the following hyper-priors \[ \mathbf{m}_0 ~|~\mathbf{S}_0\sim \mathrm{N}_{p+1}(\mathbf{m}_1,k_1^{-1}\mathbf{S}_0), ~ \mathbf{S}_0 \sim \mathrm{IW}(\mathbf{S}_1,\nu_1), ~ b_0 \sim \mathrm{Gamma}(\tau_1,\zeta_1).\]

Again, BNPmix does not allow placing a hyper-prior on $a_0$. The default options for a HPDM regression model are: \[\mathbf{m}_1 = (\bar{y}, 0,\ldots,0)^T, ~ k_1 =1,~\mathbf{S}_1 = \mathrm{diag}(100,\ldots,100), ~\nu_1 = 4, ~\tau_1 =\zeta_1 = 1.\]

Note that depending on the number of predictors we have the restriction $\nu_1>(p+1) - 1 = p$, so if we have more than 3 predictors it should be increased.

Let us examine the prior list object with all of its relevant arguments:

prior List with arguments for prior specification:
- discount The discount parameter (here it must be set to 0 for a DP).
- strength The strength parameter of the PY process or corresponding precision of the DP.
- hyper Provides control for choosing a DPM or a hierarchical DPM.
- m0 The prior mean-vector of each $\boldsymbol{\mu}_k$ (DPM). each $\boldsymbol{\mu}_k$ (DPM).
- S0 The variance-covariance $\boldsymbol{\Sigma}_0$ (DPM).
- a0 The shape $a_0>0$ of the inverse-gamma distribution of each $\sigma_k^2$ (DPM and HDPM).
- b0 The scale $b_0>0$ of the inverse-gamma distribution of each $\sigma_k^2$ (DPM).
- m1 The hyper-prior mean-vector of $\mathbf{m}_0$ (HDPM).
- k1 The scalar $k_1>0$ controlling the variance-covariance of $\mathbf{m}_0$ (HDPM).
- tau1 The shape $\tau_1>0$ of the gamma distribution on $b_0$ (HDPM).
- zeta1 The rate $\zeta_1>0$ of the gamma distribution on $b_0$ (HDPM).
- n1 The degrees-of-freedom parameter $\nu_1>q-1$ of the inverse-Wishart distribution on $\mathbf{S}_0$ (HDPM).
- S1 The variance-covariance $\mathbf{S}_1$ of the inverse-Wishart distribution on $\mathbf{S}_0$ (HDPM).

3 Applications of DPM regression models to Abalone dataset

It is important to understand that regression models essentially fit a different distribution to each $y_i$ conditional on $\mathbf{x}_{i}$. In standard normal linear regression the distribution is always normal, symmetrically centered around $\mathbf{x}_{i}\hat{\boldsymbol{\beta}}$. DPM regressions essentially do the same but the conditional distribution can be any distribution and the conditional distributions can take different forms/shapes across the datapoints (so they are very flexible regression models).

3.1 Regression with one predictor

Lets start with one predictor variable, choosing variable Height (x3). The main function for DPM regression is PYregression and its functionality is essentially the same with PYdensity since it depends on PYcalibrate and the lists prior, mcmc and output. Below we set some initial specifications like Ek=3, niter(it is recommended to set niter small, at least in initial exploratory runs) and so forth.

DPprior = PYcalibrate(Ek = 3, n = length(y), discount = 0)
DPprior

## $strength
## [1] 0.2329552
## 
## $discount
## [1] 0

prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE)
mcmc = list(niter = 2000, nburn = 1000)
grid_y = seq(min(y), max(y), length.out = 100)

Above we set the length grid_y to be equal to 100; this is okay in this case since $\mathbf{y}$ is univariate (similarly to univariate DPM density estimation).

Now we must consider the grid over x3 and this is important. As previously mentioned, the DPM regression model will provide a distribution of $y_i$ given each different $x_{i3}$. So, first, it would be too much to ask from BNPmix to set grid_x3 = seq(min(x3), max(x3), length.out = 100)! Second, even if we could do that the results would be too chaotic to summarise! So, what we want to do with this type of regression is to set the grid based on specific values of $x$ that might of interest. For instance, we can use the quartiles of $\mathbf{x}_3$. In R these are given by the command quantile:

quantile(x3)

##    0%   25%   50%   75%  100% 
## 0.000 0.115 0.140 0.165 1.130

The above correspond to min, 1st quartile, median, 3rd quartile and max, respectively. Using the min and max seems a bit awkward so below we will replace these with the 1% and 99% quantiles, respectively.

grid_x3 = quantile(x3, probs = c(0.01, 0.25, 0.50, 0.75, 0.99))
grid_x3 = round(grid_x3, 3)
grid_x3

##    1%   25%   50%   75%   99% 
## 0.045 0.115 0.140 0.165 0.220

We can now have our first try with PYregression; note that with this function it is important to specify out_type = 'FULL' in the output list.

output = list(grid_y = grid_y, grid_x = grid_x3, out_type = 'FULL',  out_param = TRUE)
set.seed(1)
fit.reg = PYregression(y = y, x = x3, prior = prior, mcmc = mcmc, output = output)

## Completed:   200/2000 - in 2.499 sec
## Completed:   400/2000 - in 4.989 sec
## Completed:   600/2000 - in 7.698 sec
## Completed:   800/2000 - in 10.788 sec
## Completed:   1000/2000 - in 13.57 sec
## Completed:   1200/2000 - in 15.963 sec
## Completed:   1400/2000 - in 18.335 sec
## Completed:   1600/2000 - in 21.684 sec
## Completed:   1800/2000 - in 25.962 sec
## Completed:   2000/2000 - in 31.488 sec
## 
## Estimation done in 31.488 seconds

summary(fit.reg)

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 31.49 seconds
##  Average number of groups:  6.889 
##  Min number of groups:  5 ; Max number of groups:  12

We managed a first attempt to fit a DPM regression. Results look a bit complex with minimum and maximum number of clusters/groups equal to 5 and 12, respectively, and average number around 7. Here we can try tweaking several things, like strength, Ek, hyper and the various prior options, but likely results will not change drastically. A far more simpler and effective approach is to first re-consider the data visualised above where we plotted Age against the available predictor variables and then think of a transformation of $y$. As we see in that plot, normal linear regression would fail in general (for all covariates) not so much due to non-linearity (although some of this is also evident), but mainly due to unequal variances across the $x$-axes. So, lets reduce the variance of the response and bring its scale closer to the scales of the predictors.

Tasks:

1. Set y to be the logarithm of Age and reproduce the second plot presented above. If done correctly, the plot should look similar to the one presented below.

We definitely decreased the variance of the response by a bit.

2. Refit fit.req (careful to specify grid_y and output again as now the scale of the response has changed). Use summary() to see whether there are any changes.

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 15.23 seconds
##  Average number of groups:  4.067 
##  Min number of groups:  4 ; Max number of groups:  6

Done correctly, the result from summary should be as above. Looks we are heading to a better, simpler direction through the transformation of the response.

3. Try to lower the resulting average number of cluster a little bit by decreasing the prior expected number of cluster to 2. If the change is implemented correctly you should see the following output.

## $strength
## [1] 0.1143963
## 
## $discount
## [1] 0

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 9.37 seconds
##  Average number of groups:  3.074 
##  Min number of groups:  3 ; Max number of groups:  4

4. Given all the previous specifications, remove the default data-dependent assumptions and set $\mathbf{m_0}=(0,0)^T, ~ a_0 = 3$ and $b_0 = 100$. In this way the prior expectation and variance of $\sigma_k^2$ are $\mathbb{E}[\sigma_k^2] = b_0/(a_0-1)=50$ and $\mathbb{V}\mathrm{ar}[\sigma_k^2] = b_0^2/(a_0-1)^2(a_0-2)=2500$, respectively. Use summarise() and also calculate the posterior probabilities for the number of clusters (we have previously seen how to do this). The results should be as below.

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 9.61 seconds
##  Average number of groups:  3.147 
##  Min number of groups:  3 ; Max number of groups:  5

## n.cluster
##     3     4     5 
## 0.855 0.143 0.002

Not bad so far! We have gradually managed to set up a DPM regression model, which is not data dependent and actually gives high posterior probability to the existence of 3 clusters! Note that no information from Type was used in the actual model (admittedly, at least as a direct input).

Plotting posterior conditional densities

Visualising posterior densities is a bit tricky for DPM regression models as it requires a relatively good understanding of data.frame and ggplot, but the below R codes can be used as custom codes. The vital information is stored in the value density of our object fit.reg.

class(fit.reg$density)

## [1] "array"

dim(fit.reg$density)

## [1]  100    5 1000

As we see fit.reg$density is a 3-dimensional array, containing 1000 (number of posterior samples) matrices with 100 (grid length of y) rows and 5 (grid length of x3) rows. So, in the regplot data frame created below dens contains the mean density evaluations, qlow and qupp the lower and upper levels of the 95% credible band. The column grid sets appropriately the grid for the plots, while the last column label sets appropriately the correct names for the plots that we produce next.

regplot = data.frame(
  dens = as.vector(apply(fit.reg$density, c(1, 2), mean)),
  qlow = as.vector(apply(fit.reg$density, c(1, 2),
                         quantile, probs = 0.025)),
  qupp = as.vector(apply(fit.reg$density, c(1, 2),
                         quantile, probs = 0.975)),
  grid = rep(grid_y, length(grid_x3)),
  label = factor(rep(paste('Height = ', grid_x3), each = length(grid_y)),
                 level = rep(paste('Height = ', grid_x3)))
)

We can now use ggplot syntax to produce a nice plot of the posterior conditional densities of $y$ (logarithm of age) for $x_3 \in\{0.045, 0.115, 0.140, 0.165, 0.220\}$ (height). As we see the density of the response varies within the range of the predictor variable. We also see that the credible bands are generally very narrow for non-extreme values of Height; this is due to large sample size ($n= 4177$). Of course, in the extreme regions of Height we have much less observations and, thus, more uncertainty (this is reflected by the wider posterior credible bands).

ggplot(regplot) + theme_bw() +
  geom_line(data = regplot, map = aes(x = grid, y = dens)) +
  geom_ribbon(data = regplot, map = aes(x = grid, ymin = qlow,
                                        ymax = qupp), fill = 'blue', alpha = 0.3) +
  facet_wrap(~label, ncol = 5, nrow = 1) +
  labs(x = 'log(Age)', y = 'Density')

Posterior summaries and plots

Similarly, to what we have seen in previous applications, we can detect objects of interest by examining the output of fit.reg via the command names(). For instance we see below the value beta which contains the $\boldsymbol{\beta}_k=(\beta_{k0}, \beta_{k_1})^T$ posterior samples for $k\in\{3,4\}$ (the number of clusters for the models visited by BNPmix).

names(fit.reg)

##  [1] "density"    "data"       "grideval"   "grid_x"     "grid_y"    
##  [6] "clust"      "mean"       "beta"       "sigma2"     "probs"     
## [11] "niter"      "nburn"      "tot_time"   "univariate" "regression"
## [16] "dep"        "group_log"  "group"      "wvals"

betas = fit.reg$beta
length(betas)         # 1000 posterior samples

## [1] 1000

betas[[1]]            # The 1st sample for the 3 clusters (1st column for intercept, 2nd for slope)

##          [,1]        [,2]
## [1,] 1.680566  4.52727815
## [2,] 1.547222  7.20887658
## [3,] 2.461443 -0.02907869

We can store those in 3 (since we are interesting in inferring under the MAP estimate $\hat{K} = 3$) matrices of appropriate dimensionality (1000 rows for the MCMC sample, and 2 columns; one for the intercept and one for the slope).

post.betas1 = matrix(NA, nrow = 1000, ncol = 2)
post.betas2 = matrix(NA, nrow = 1000, ncol = 2)
post.betas3 = matrix(NA, nrow = 1000, ncol = 2)

for(i in 1:1000){
  post.betas1[i,] = betas[[i]][1,]
  post.betas2[i,] = betas[[i]][2,]
  post.betas3[i,] = betas[[i]][3,]
}

We can then evaluate numerical posterior summaries; e.g., for $\boldsymbol{\beta}_1$ we have:

summary(post.betas1)

##        V1              V2       
##  Min.   :1.671   Min.   :3.883  
##  1st Qu.:1.714   1st Qu.:4.190  
##  Median :1.724   Median :4.264  
##  Mean   :1.724   Mean   :4.265  
##  3rd Qu.:1.735   3rd Qu.:4.338  
##  Max.   :1.775   Max.   :4.630

Also, with some effort we can use generic R functions to summarise the posterior densities in one plot (the slightly tricky part is to set appropriately the argument line in mtext) as demonstrated below.

par(mfrow = c(3,2))
plot(density(post.betas1[,1]), main = '', lwd = 2, ylab = '',
     xlab = expression(beta[0]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.betas1[,2]), main = '', lwd = 2, ylab = '',
     xlab = expression(beta[1]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Cluster 1', side = 3, line = -2, outer = TRUE, cex = 1.2)
plot(density(post.betas2[,1]), main = '', lwd = 2, ylab = 'Posterior density',
     xlab = expression(beta[0]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.betas2[,2]), main = '', lwd = 2, ylab = '',
     xlab = expression(beta[1]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Cluster 2', side = 3, line = -14, outer = TRUE, cex = 1.2)
plot(density(post.betas3[,1]), main = '', lwd = 2, ylab = '',
     xlab = expression(beta[0]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.betas3[,2]), main = '', lwd = 2, ylab = '',
     xlab = expression(beta[1]), bty = 'l', col = 2, cex.lab = 1.3)
mtext('Cluster 3', side = 3, line = -26, outer = TRUE, cex = 1.2)

Of course, we can also use package ggplot (as we did in the Applied Lecture 2) to produce similar plots.

We can further infer about the remaining parameters $(\sigma_1^2,\sigma_2^2,\sigma_3^2)$ and $(\pi_1,\pi_2,\pi_3)$ under $\hat{K}=3$. Examining with names() again, we understand that the values of the output of fit.reg we are interested in are sigma2 and probs, respectively. The following lines of code give us an idea of what we have to do.

names(fit.reg)

##  [1] "density"    "data"       "grideval"   "grid_x"     "grid_y"    
##  [6] "clust"      "mean"       "beta"       "sigma2"     "probs"     
## [11] "niter"      "nburn"      "tot_time"   "univariate" "regression"
## [16] "dep"        "group_log"  "group"      "wvals"

sigmas = fit.reg$sigma2
dim(sigmas)

## [1] 1000    1

sigmas[[1]]

##            [,1]
## [1,] 0.01305736
## [2,] 0.04399147
## [3,] 0.01533432

pi = fit.reg$probs
dim(pi)

## [1] 1000    1

pi[[1]]

##              [,1]
## [1,] 5.874946e-01
## [2,] 3.252824e-01
## [3,] 8.722309e-02
## [4,] 7.165844e-13
## [5,] 7.165844e-13

Note: The above results are contradictory; in the first MCMC sample we have three variances and five probabilities. The reason for this is because both $\pi_4$ and $\pi_5$ are in practical terms equal to 0, so no parameters were estimated for these clusters.

Tasks:

5. Create a matrix with entries NA called post.sigmas of dimensionality $1000\times 3$. Then, impute accordingly the posterior samples of $\sigma_1^2$, $\sigma_2^2$ and $\sigma_3^2$ in the columns of the matrix using a for loop. Finally, use summary() to evaluate numerical posterior summaries and also par(mfrow = c(1,3)) to plot the posterior densities of the three variances in one plot. Your results and plot should look similar to the ones presented below.

##        V1                 V2                V3          
##  Min.   :0.009881   Min.   :0.03100   Min.   :0.005328  
##  1st Qu.:0.011791   1st Qu.:0.04707   1st Qu.:0.010810  
##  Median :0.012411   Median :0.04915   Median :0.012980  
##  Mean   :0.012399   Mean   :0.04863   Mean   :0.014055  
##  3rd Qu.:0.013030   3rd Qu.:0.05102   3rd Qu.:0.015936  
##  Max.   :0.014945   Max.   :0.05738   Max.   :0.036026

6. Work similarly for the cluster probabilities $\pi_1$, $\pi_2$ and $\pi_3$, using a matrix with entries NA called post.pi. The numerical results and plot should look like the ones below.

##        V1               V2               V3         
##  Min.   :0.4762   Min.   :0.2419   Min.   :0.04576  
##  1st Qu.:0.5357   1st Qu.:0.3318   1st Qu.:0.07858  
##  Median :0.5559   Median :0.3571   Median :0.08810  
##  Mean   :0.5539   Mean   :0.3539   Mean   :0.08846  
##  3rd Qu.:0.5746   3rd Qu.:0.3783   3rd Qu.:0.09859  
##  Max.   :0.6280   Max.   :0.4374   Max.   :0.12870

As seen from the numerical summaries above (based on the means), the DPM regression identifies two relatively dominant clusters with $\hat{\pi}_1\approx 0,55$ and $\hat{\pi}_2\approx 0,35$ and one less crucial cluster with $\hat{\pi}_2\approx 0,1$. This, of course, does not reflect the actual distinction across female, infant and male abalone shells, since their distribution within the overall sample is quite balanced as we see below

table(Type)/n

## Type
##         F         I         M 
## 0.3129040 0.3212832 0.3658128

This does not diminish the value of DPM regression; the main aim of this modelling approach is to infer about conditional densities of $y~|~x$. Furthermore, it should also be acknowledged as an effective tool for initial identification of the number of clusters in complex datasets.

3.2 Regression with two predictors

Suppose now that we want to include a second predictor in the analysis additional to Height. As we see from the pairs plot above the relationship of y with the remaining predictors seems to be more or lesss the same, so lets just pick Diameter (which is x2). We must first form a matrix, lets call it X and lets keep Height as the first variable.

X = cbind(x3,x2)

Now, similarly to what we did before we must define a grid over x2. It is important to understand that the density evaluations of y will be based on a joint grid of all possible combinations of the values of x3 and x2. We will start with a basic grid in which x2 takes only one value; speficically when it is equal to its median.

grid_x2 = quantile(x2, probs = c(0.50))
grid_x2

##   50% 
## 0.425

Next, we use the command expand.grid which we saw in Applied Lecture 2.

grid_X = expand.grid(grid_x3, grid_x2)

We can now run a DPM multiple regression model, storing it in fit.reg1. Note that we must specify again the prior list, since now we want $\mathbf{m}_0=(0,0,0)^T$, as well as the output list using grid_x = grid_X within it.

DPprior

## $strength
## [1] 0.1143963
## 
## $discount
## [1] 0

prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE,
             m0 = c(0,0,0), a0 = 3, b0 = 100)
output <- list(grid_x = grid_X, grid_y = grid_y, out_type = 'FULL', out_param = TRUE)
set.seed(1)
fit.reg1 = PYregression(y = y, x = X, prior = prior, mcmc = mcmc, output = output)
summary(fit.reg1)

## PYdensity function call:
##  1000    burn-in iterations
##  2000    iterations 
##  Global estimation time: 9.11 seconds
##  Average number of groups:  3.034 
##  Min number of groups:  3 ; Max number of groups:  4

In this case we have evaluated the density over only one value of Diameter, so the dimensionality of the density array remains the same and we can essentially use the same code as above to plot the varying densities of y. Note of course, that these are different densities since they are conditional on the values of Height and the value of Diameter (0.425).

dim(fit.reg1$density)

## [1]  100    5 1000

regplot <- data.frame(
  dens = as.vector(apply(fit.reg1$density, c(1, 2), mean)),
  qlow = as.vector(apply(fit.reg1$density, c(1, 2),
                         quantile, probs = 0.025)),
  qupp = as.vector(apply(fit.reg1$density, c(1, 2),
                         quantile, probs = 0.975)),
  grid = rep(grid_y, length(grid_x3)),
  label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
                 level = rep(paste("Height = ", grid_x3)))
)

ggplot(regplot) + theme_bw() +
  geom_line(data = regplot, map = aes(x = grid, y = dens)) +
  geom_ribbon(data = regplot, map = aes(x = grid, ymin = qlow,
                                        ymax = qupp), fill = "red", alpha = 0.3) +
  facet_wrap(~label, ncol = 5, nrow = 1) +
  labs(x = "log(Age)", y = "Density given Diameter = 0.425")

We can imagine things becoming more complicated if we wish to examine what is going on for different values of x2 on the grid. For instance, below we consider the minimum, median and maximum values of x2.

grid_x2 = quantile(x2, probs = c(0, 0.50, 1))
grid_x2

##    0%   50%  100% 
## 0.055 0.425 0.650

grid_X = expand.grid(grid_x3, grid_x2)

output <- list(grid_x = grid_X, grid_y = grid_y, out_type = 'FULL', out_param = TRUE)
set.seed(1)
fit.reg1 = PYregression(y = y, x = X, prior = prior, mcmc = mcmc, output = output)
dim(fit.reg1$density)

## [1]  100   15 1000

Now, the 3-dimensional array fit.reg1$density has 15 columns. The first 5 are the combinations of the grid values of x3 with x2 equal to 0.055, the next 5 the combinations with x2 equal to 0.425, and the last 5 the combinations with x2 equal to 0.65. So, now, we need to consider all of these and treat each case separately as follows:

regplot1 <- data.frame(
  dens = as.vector(apply(fit.reg1$density[, 1:5,], c(1, 2), mean)),
  qlow = as.vector(apply(fit.reg1$density[, 1:5,], c(1, 2),
                         quantile, probs = 0.025)),
  qupp = as.vector(apply(fit.reg1$density[, 1:5,], c(1, 2),
                         quantile, probs = 0.975)),
  grid = rep(grid_y, length(grid_x3)),
  label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
                 level = rep(paste("Height = ", grid_x3)))
)

regplot2 <- data.frame(
  dens = as.vector(apply(fit.reg1$density[, 6:10,], c(1, 2), mean)),
  qlow = as.vector(apply(fit.reg1$density[, 6:10,], c(1, 2),
                         quantile, probs = 0.025)),
  qupp = as.vector(apply(fit.reg1$density[, 6:10,], c(1, 2),
                         quantile, probs = 0.975)),
  grid = rep(grid_y, length(grid_x3)),
  label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
                 level = rep(paste("Height = ", grid_x3)))
)

regplot3 <- data.frame(
  dens = as.vector(apply(fit.reg1$density[, 11:15,], c(1, 2), mean)),
  qlow = as.vector(apply(fit.reg1$density[, 11:15,], c(1, 2),
                         quantile, probs = 0.025)),
  qupp = as.vector(apply(fit.reg1$density[, 11:15,], c(1, 2),
                         quantile, probs = 0.975)),
  grid = rep(grid_y, length(grid_x3)),
  label = factor(rep(paste("Height = ", grid_x3), each = length(grid_y)),
                 level = rep(paste("Height = ", grid_x3)))
)

plot1 = ggplot(regplot1) + theme_bw() +
  geom_line(data = regplot1, map = aes(x = grid, y = dens)) +
  geom_ribbon(data = regplot1, map = aes(x = grid, ymin = qlow,
                                        ymax = qupp), fill = "green", alpha = 0.3) +
  facet_wrap(~label, ncol = 5, nrow = 1) +
  labs(x = "log(Age)", y = "Diameter = 0.055")

plot2 = ggplot(regplot2) + theme_bw() +
  geom_line(data = regplot2, map = aes(x = grid, y = dens)) +
  geom_ribbon(data = regplot2, map = aes(x = grid, ymin = qlow,
                                        ymax = qupp), fill = "red", alpha = 0.3) +
  facet_wrap(~label, ncol = 5, nrow = 1) +
  labs(x = "log(Age)", y = "Diameter = 0.425")

plot3 = ggplot(regplot3) + theme_bw() +
  geom_line(data = regplot3, map = aes(x = grid, y = dens)) +
  geom_ribbon(data = regplot3, map = aes(x = grid, ymin = qlow,
                                        ymax = qupp), fill = "blue", alpha = 0.3) +
  facet_wrap(~label, ncol = 5, nrow = 1) +
  labs(x = "log(Age)", y = "Diameter = 0.650")

(Note: Corrections have been made. In plot1 above previously it was data = regplot while it should be data = regplot1. Similarly, for plot2 and plot3.)

Finally, we can use the grid.arrange() function to combine the three plots. As we can see the conditional density of log(Age) varies substantially within a given Diameter value and also across different values of Diameter.

grid.arrange(plot1, plot2, plot3, layout_matrix = matrix(1:3, 3, 1))

Apart of this added complexity, the rest of the coding procedures are similar to previously with the only difference being that now we have $\boldsymbol{\beta}_k=(\beta_{k0},\beta_{k1},\beta_{k2})^T$, for $k=1,\ldots,K$, where $\beta_{k1}$ and $\beta_{k2}$ are the regression coefficients of Height and Diameter under group $k$.

Task:

7. Produce a plot of the posterior distributions of the $\boldsymbol{\beta}_k$’s (for $K=3$) using a similar code to the one presented in 3.1. Done correctly, you should get something like the plot below.

4 Final comments

As we saw, DPM regression modelling with BNPmix delivers estimates of posterior densities conditional on values of the predictor variable(s). As such, when the main aim is density estimation working simultaneously with many predictors is not particularly meaningful. Applications typically involve examining whether the conditional density of a given response changes over a range of values of one predictor or of a few predictors (say, two or three at most). When we have a lot of predictors we may choose to select a few which seem more interesting and more relevant to the response. Alternatively, we can examine the densities of the response conditional on each individual predictor (i.e., run many regressions using one predictor at a time).