Project IV 2024-25

Bayesian variable selection for regression models

Dr. Dinos Perrakis

Description

One of the most important aspects of regression modelling relates to the topic of variable selection (a.k.a. feature selection). For instance, consider the normal linear regression model with response variable \(\mathbf{y}=(y_1,\ldots,y_n)^T\) and \(p\) potential predictor variables \(\mathbf{x}_1=(x_{11},\ldots,x_{n1})^T,\ldots,\mathbf{x}_p=(x_{1p},\ldots,x_{np})^T\), which assumes that \[y_i \sim \mathrm{N}(\mu_i,\sigma^2), ~~ \mu_i=\mathtt{E}[y_i\mid\mathbf{x}_i]=\beta_0+x_{i1}\beta_1+\ldots+x_{ip}\beta_p,\] where in the Bayesian framework the parameters are assigned prior distributions; i.e., \(\beta_0\sim F(\beta_0)\), \(\beta_j\sim F(\beta_j)\) for \(j\in\{1,\ldots,p\}\) and \(\sigma^2\sim F(\sigma^2)\). Then, the problem of variable selections is about inferring which of the predictors should be included in the above model. In Bayesian statistics this is essentially a model selection problem where inference is based on evaluating the Bayes factors in favour of the simplest model \(\mathcal{M_0}\) (containing only the intercept term \(\beta_0\)) against all \(K=2^p\) competing models \(\mathcal{M_k}\) given by \[\begin{equation} \mathrm{B}_{0k}=\frac{f(\mathbf{y}|\mathcal{M}_{0})}{f(\mathbf{y}|\mathcal{M}_{k})}=\frac{\int f(\mathbf{y}|\beta_0,\sigma^2,\mathcal{M}_{0})f(\beta_0,\sigma^2_{k}|\mathcal{M}_{0})\mathrm{d}F(\beta_0,\sigma^2)}{\int f(\mathbf{y}|\beta_0,\boldsymbol{\beta}_{k},\sigma^2,\mathcal{M}_{k})f(\beta_0,\boldsymbol{\beta}_k,\sigma^2_{k}|\mathcal{M}_{k})\mathrm{d}F(\beta_0,\boldsymbol{\beta}_{k},\sigma^2)}, \end{equation}\]

for \(k\in\{1,\ldots,K\}\) where \(f(\boldsymbol{y|\cdot})\) is the marginal likelihood function (a.k.a. prior predictive distribution).

The above evaluation becomes immediately intractable as soon as we step outside convenient conjugate designs; for instance, above we might wish to use a Bayesian shrinkage prior for the \(\beta_j\)’s such as the Bayesian lasso prior, based on the Laplace distribution, which can result to better predictive performance in comparison to a standard conjugate normal distribution.

General strategies for handling non-conjugate models are the following:

The first is direct estimation of the marginal likelihood. Numerical integration methods can be used as an approach to the problem, but such techniques are of limited use when the sample size and/or the parametric vector have moderate to large dimensionality. To this end, there is a lot of research on the development of efficient Monte Carlo estimators, which utilise Markov chain Monte Carlo (MCMC) samples from the posterior distributions of the parameters. There are numerous such approaches based on importance-sampling, bridge-sampling, thermodynamic integration, Lebesgue intergation theory and the Fourier integral theorem, among others.
Generally, direct marginal likelihood estimators are effective and efficient when the number of predictors \(p\) is relatively small, but their use quickly becomes prohibitive for even moderately sized datasets; for instance, when \(p=20\) a use of a direct marginal-likelihood estimation method would require obtaining posterior samples from \(2^{20}=1,048,576\) models! An alternative strategy is to bypass estimation of marginal likelihoods altogether by using trans-dimensional MCMC algorithms specifically tailored to regression settings, such as the Stochastic Search Variable Selection (SSVS) and Gibbs Variable Selection algorithms, among others. These types of algorithms essentially introduce latent binary indicators for the inclusion of each regression coefficient and deliver posterior model probabilities and a posterior inclusion probability for each coefficient.

Projects on this general topic can take several paths; for instance, investigating in detail one of the aforementioned strategies or covering both. There are also several related and nested topics which are of interest, indluding; (i) using shrinkage priors, (ii) considering objective priors with desirable asymptotic selection properties and (iii) considering generalized linear models and other models for which by default there are no conjugate designs. Regardless of the specific direction, in this project you will have the opportunity to obtain a deeper understanding of the Bayesian variable selection framework.

Prerequisites

Bayesian Computation and Modelling III

In general, a good understanding of Bayesian statistics and good programming skills.

Some resources

Direct marginal-likelihood estimation
- A nice review paper
- Some of my work on the topic
Variable selection algorithms
- A couple of review papers here and here
- One of my reviews on SVSS
Shrinkage priors
- Review paper
Objective priors
- Review paper
- Relevant paper which also contains useful information about R packages

Feel free to email at konstantinos.perrakis@durham.ac.uk if you have questions.