Machine Learning

image
Konstantinos Perrakis

Model Search Methods: Best Subset and Stepwise Selections

Best Subset Selection: an illustrative example Suppose we have three predictors \(X_1\), \(X_2\) and \(X_3\). In this case best subset selection (BSS) works as follows: \[\begin{aligned} M_0: & \mbox{~intercept~only~(null~model)}\\ C_1: & \begin{cases} X_1\\ {\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_1: \mbox{~make~this~} {\color{ggreen} M_1}\\ X_3\\ \end{cases}\\ C_2: & \begin{cases} {\color{blue}X_1, X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_2: \mbox{~make~this~} {\color{blue} M_2}\\ X_1, X_3\\ X_2, X_3 \end{cases}\\ M_3: & ~X_1, X_2, X_3 \mbox{~ (full~model)} \end{aligned}\]

We then choose one model among \(M_0, {\color{ggreen} M_1}, {\color{blue} M_2}, M_3\) either via validation/cross-validation or based on \(Cp\), BIC, adjusted \(R^2\).

Best Subset Selection: the general algorithm

Note: \(\binom{p}{k}=\frac{p!}{k!(p-k)!}\) is the binomial coefficient. It gives all the possible ways to extract a subset of \(k\) elements from a fixed set of \(p\) elements. For instance, in the previous example we had \(p=3\) and \(k=2\) for \(C_2\), so the number of models with 2 predictors was \(\binom{3}{2}=\frac{3!}{2!(3-2)!}=\frac{1\cdot2\cdot3}{1\cdot2\cdot1}=3\).

Example: Credit Dataset

Credit related data of 400 individuals from the U.S.
Credit data
balance average credit card debt (response variable)
age in years
cards number of credit cards
education in years
income thousands of dollars
limit credit limit
rating credit rating
gender male/female
student yes/no
status marital status (yes/no)
ethnicity Caucasian/African American/Asian

Credit data: a visual inspection

Note: Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Credit data: BSS Step 2 of the algorithm

The grey dots are the RSS (left) and the \(R^2\) (right) of all possible models (with number of predictors ranging from 1 to 11). The red dots (connected by a line) correspond to the models that have the lowest/highest RSS/\(R^2\).

Credit data: BSS step 3 with selection criteria

Credit data: BSS step 3 with validation/cross-validation

Interpreting results and some general comments

Credit data: which model to choose at the end?

Personal opinion: choose the model with 4 predictor variables.
But: data science is not an exact science! Choices are subjective. In addition, decisions may change as new data become available.
Also there is something we can do: do not use part of the data, and then among the set of “best models” select the model which optimises predictive performance w.r.t. the held-out data!

The “curse of dimensionality”

Forward Stepwise Selection: 3 predictor example

Again consider three predictors \(X_1\), \(X_2\) and \(X_3\). Forward stepwise selection (FSTS) works as follows: \[\begin{aligned} M_0: & \mbox{~intercept~only~(null~model)}\\ C_1: & \begin{cases} X_1\\ {\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_1: \mbox{~make~this~} {\color{ggreen} M_1}\\ X_3\\ \end{cases}\\ C_2: & \begin{cases} {\color{blue}X_1},{\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_2: \mbox{~make~this~} {\color{blue} M_2}\\ {\color{ggreen}X_2}, X_3 \end{cases}\\ M_3: & ~X_1, X_2, X_3 \mbox{~ (full~model)}\end{aligned}\]

At the end we choose again one model among \(M_0, {\color{ggreen} M_1}, {\color{blue} M_2}, M_3\) based on cross-validation or some criterion.

Forward Stepwise Selection: the general algorithm

BSS vs. FSTS

Credit data: BSS and FSTS models

BSS and FSS selected models for up to four predictors: the models with one, two and three variables are the same, but the models with four variables differ.
# Predictors BSS FSTS
One rating rating
Two rating, income rating, income
Three rating, income, student rating, income, student
Four cards, income rating, income
student, limit student, limit

Backward Stepwise Selection: 3 predictor example

Once again: three predictors \(X_1\), \(X_2\) and \(X_3\). Backward stepwise selection (BSTS) works as follows: \[\begin{aligned} M_3: & ~X_1, X_2, X_3 \mbox{~ (full~model)}\\ C_2: & \begin{cases} {\color{blue}X_1,X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_2: \mbox{~make~this~} {\color{blue} M_2}\\ X_1, X_3\\ X_2, X_3 \end{cases}\\ C_1: & \begin{cases} {\color{blue}X_1}\\ {\color{ggreen}X_2} \rightarrow \mbox{~lowest~training~} RSS \mbox{~within~} C_1: \mbox{~make~this~} {\color{ggreen} M_1}\\ \end{cases}\\ M_0: & \mbox{~intercept~only~(null~model)} \end{aligned}\]

At the end we choose again one model among \(M_0, {\color{ggreen} M_1}, {\color{blue} M_2}, M_3\) based on cross-validation or some criterion.

Backward Stepwise Selection: the general algorithm

Just like the forward algorithm, backward elimination requires only \(1+\displaystyle\frac{p(p+1)}{2}\) comparisons.

Model search methods: concluding remarks

We will see in workshops how to apply these methods to real data sets.

Next topic

A common characteristic of the optimisation methods we just saw is that they are discrete in nature. Next I will begin discussing penalized regression procedures which are continuous in nature. We will start with ridge regression.