Machine Learning

Konstantinos Perrakis

Model Search Methods: Best Subset and Stepwise Selections

Best Subset Selection: an illustrative example Suppose we have three predictors \(X_1\), \(X_2\) and \(X_3\). In this case best subset selection (BSS) works as follows: \[\begin{aligned} M_0: & \mbox{~intercept~only~(null~model)}\\ C_1: & \begin{cases} X_1\\ {\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_1: \mbox{~make~this~} {\color{ggreen} M_1}\\ X_3\\ \end{cases}\\ C_2: & \begin{cases} {\color{blue}X_1, X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_2: \mbox{~make~this~} {\color{blue} M_2}\\ X_1, X_3\\ X_2, X_3 \end{cases}\\ M_3: & ~X_1, X_2, X_3 \mbox{~ (full~model)} \end{aligned}\]

We then choose one model among \(M_0, {\color{ggreen} M_1}, {\color{blue} M_2}, M_3\) either via validation/cross-validation or based on \(Cp\), BIC, adjusted \(R^2\).

Best Subset Selection: the general algorithm

Let \(M_0\) denote the null model which contains no predictors. This model simply predicts the sample mean of the response for each observation.
For \(k = 1,2,\ldots,p\):
- Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors.
- Pick the best among these \(\binom{p}{k}\) models and call it \(M_k\). Here best is defined as having the smallest RSS or largest \(R^2\).
Select a single best model from among \(M_0, M_1,\ldots,M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

Note: \(\binom{p}{k}=\frac{p!}{k!(p-k)!}\) is the binomial coefficient. It gives all the possible ways to extract a subset of \(k\) elements from a fixed set of \(p\) elements. For instance, in the previous example we had \(p=3\) and \(k=2\) for \(C_2\), so the number of models with 2 predictors was \(\binom{3}{2}=\frac{3!}{2!(3-2)!}=\frac{1\cdot2\cdot3}{1\cdot2\cdot1}=3\).

Example: Credit Dataset

Credit related data of 400 individuals from the U.S.
Credit data
balance	average credit card debt (response variable)
age	in years
cards	number of credit cards
education	in years
income	thousands of dollars
limit	credit limit
rating	credit rating
gender	male/female
student	yes/no
status	marital status (yes/no)
ethnicity	Caucasian/African American/Asian

Credit data: a visual inspection

Note: Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Credit data: BSS Step 2 of the algorithm

The grey dots are the RSS (left) and the \(R^2\) (right) of all possible models (with number of predictors ranging from 1 to 11). The red dots (connected by a line) correspond to the models that have the lowest/highest RSS/\(R^2\).

Credit data: BSS step 3 with selection criteria

In this example \(Cp\) (AIC) selects 6 predictors, BIC 4 predictors and adjusted \(R^2\) 7 predictors.
As generally expected, BIC is selecting fewer variables; i.e., it selects a sparser model.

Credit data: BSS step 3 with validation/cross-validation

In this example results from validation using 75% of the sample for training and from 10-fold cross-validation are in agreement (6 predictors).
Generally, this will not be always the case.
Also, if we re-run either validation or cross-validation the splits will change and we might get a different result. Cross-validation is more stable, but still the problem exists.

Interpreting results and some general comments

Results from \(Cp\) (AIC) are in agreement with validation and cross-validation (6 predictors). Generally, this is to be expected since \(Cp\) (AIC) is designed to approximate the prediction error, while validation/cross-validation are directly estimating it!
The justification of the adjusted \(R^2\) is philosophically close to the above methods as it adjusts for model dimensionality but still aims at model fit. Here, adjusted \(R^2\) would select 7 predictors.
The theoretical background of BIC is quite different in nature; this criterion is designed to approximately locate the “true” model. The notion of a true model is debatable in statistics. A famous aphorism from the great statistician George Box is the following:

“All models are wrong, but some are useful”

On the other hand, theoreticians argue that we would like methods that are optimal under specific conditions (big discussion!). Here, based on BIC, we would select 4 predictors.

Credit data: which model to choose at the end?

Looking back at the results it seems that the models with 4, 5 and 6 predictors are all acceptable.
Based on the principle of parsimony one would select the simplest model from the set of good models.
In our example it is the model with 4 predictors as the gains from adding 1 or 2 additional predictors are negligible under all methods.

Personal opinion: choose the model with 4 predictor variables.
But: data science is not an exact science! Choices are subjective. In addition, decisions may change as new data become available.
Also there is something we can do: do not use part of the data, and then among the set of “best models” select the model which optimises predictive performance w.r.t. the held-out data!

The “curse of dimensionality”

BSS offers a principled way to select the best subset of variables depending upon the evaluating approach.
However, BSS is based on combinatorics and suffers from the so-called “curse of dimensionality”.
For \(p=10\) the number of possible subsets is \(2^{10}=1024\) (feasible).
But, for \(p=20\) there exist over 1 million possibilities and for \(p=30\) over 1 billion!!!
This brings forth the need for approximate algorithms which are more scalable.

Forward Stepwise Selection: 3 predictor example

Again consider three predictors \(X_1\), \(X_2\) and \(X_3\). Forward stepwise selection (FSTS) works as follows: \[\begin{aligned} M_0: & \mbox{~intercept~only~(null~model)}\\ C_1: & \begin{cases} X_1\\ {\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_1: \mbox{~make~this~} {\color{ggreen} M_1}\\ X_3\\ \end{cases}\\ C_2: & \begin{cases} {\color{blue}X_1},{\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_2: \mbox{~make~this~} {\color{blue} M_2}\\ {\color{ggreen}X_2}, X_3 \end{cases}\\ M_3: & ~X_1, X_2, X_3 \mbox{~ (full~model)}\end{aligned}\]

At the end we choose again one model among \(M_0, {\color{ggreen} M_1}, {\color{blue} M_2}, M_3\) based on cross-validation or some criterion.

Forward Stepwise Selection: the general algorithm

Let \(M_0\) denote the null model which contains no predictors.
For \(k = 0,1,\ldots,p-1\):
- Consider all \(p-k\) models that augment the predictors in \(M_k\) with one additional predictor.
- Choose the best among these \(p-k\) models and call it \(M_{k+1}\). Here best is defined as having the smallest RSS or largest \(R^2\).
Select a single best model from among \(M_0, M_1,\ldots,M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

BSS vs. FSTS

BSS requires training \(2^p\) models, FSTS requires only \(1+\displaystyle\frac{p(p+1)}{2}\) comparisons.
So for \(p=20\): BSS \(\rightarrow\) 1,048,576 models, FSTS \(\rightarrow\) 211 models!
There is cost to pay though for the gain in scalability.
FSTS can be sub-optimal in the sense that the models selected are not the best subset models.
This is because the guided FSTS search at each step depends on the previously selected predictor.
However, in practice BSS and FSTS perform quite similar in terms of predictive accuracy.

Credit data: BSS and FSTS models

BSS and FSS selected models for up to four predictors: the models with one, two and three variables are the same, but the models with four variables differ.
# Predictors	BSS	FSTS
One	rating	rating
Two	rating, income	rating, income
Three	rating, income, student	rating, income, student
Four	cards, income	rating, income
	student, limit	student, limit

Backward Stepwise Selection: 3 predictor example

Once again: three predictors \(X_1\), \(X_2\) and \(X_3\). Backward stepwise selection (BSTS) works as follows: \[\begin{aligned} M_3: & ~X_1, X_2, X_3 \mbox{~ (full~model)}\\ C_2: & \begin{cases} {\color{blue}X_1,X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_2: \mbox{~make~this~} {\color{blue} M_2}\\ X_1, X_3\\ X_2, X_3 \end{cases}\\ C_1: & \begin{cases} {\color{blue}X_1}\\ {\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_1: \mbox{~make~this~} {\color{ggreen} M_1}\\ \end{cases}\\ M_0: & \mbox{~intercept~only~(null~model)} \end{aligned}\]

At the end we choose again one model among \(M_0, {\color{ggreen} M_1}, {\color{blue} M_2}, M_3\) based on cross-validation or some criterion.

Backward Stepwise Selection: the general algorithm

Let \(M_p\) denote the full model which contains all predictors.
For \(k = p,p-1,\ldots,1\):
- Consider all \(k\) models that contain all but one of the predictors in \(M_k\), for a total of \(k-1\) predictors.
- Choose the best among these \(k\) models and call it \(M_{k-1}\). Here best is defined as having the smallest RSS or largest \(R^2\).
Select a single best model from among \(M_0, M_1,\ldots,M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

Just like the forward algorithm, backward elimination requires only \(1+\displaystyle\frac{p(p+1)}{2}\) comparisons.

Model search methods: concluding remarks

Best subset, forward stepwise and backward stepwise selection approaches give similar but not identical results.
Hybrid approaches that combine forward and backward steps also exist (but we will not cover these).
Best subset and forward selection can be applied to some extend when \(n<p\), but only up to the model that contains \(n-1\) predictors. Backward selection cannot be applied when \(n<p\).

We will see in workshops how to apply these methods to real data sets.

Next topic

A common characteristic of the optimisation methods we just saw is that they are discrete in nature. Next I will begin discussing penalized regression procedures which are continuous in nature. We will start with ridge regression.