Project_IV_Perrakis.knit

Penalized Regression

Dr. Konstantinos Perrakis

Description

Modern regression applications in many fields are often characterized by one or more of the following; (1) highly-correlated predictors (multi-collinearity problem), (2) a large number of predictors (large-\(p\) problem), and (3) smaller sample size than number of predictors (\(n<p\) problem). In these cases maximum-likelihood (ML) and least-squares (LS) estimation are either inefficient or simply infeasible; for instance, given a predictor matrix \(\mathbf{X}\) the inversion of \(\mathbf{X}^T\mathbf{X}\) (which is required for the ML and LS solutions) cannot be performed when \(n<p\).

In such settings we require regularization of the problem at hand. This is achieved through the imposition of constraints on the regression coefficient vector \(\boldsymbol{\beta}\), which effectively shrink certain coefficients towards zero. This approach is also known as penalization and essentially entails restricting a certain \(L_q\)-norm (\(\Vert\cdot\Vert_q\)) of vector \(\boldsymbol{\beta}\) not to exceed a specific threshold. Thus, in linear regression penalized solutions generally take the following form \(\begin{equation*} \underset{\boldsymbol{\beta}\in\mathbb{R}^p}{\mathop{\mathrm{arg\,min}}}\Vert \mathbf{y}- \mathbf{X}\boldsymbol{\beta}\Vert_2^2 ~~~ \text{subject to}~~~ \Vert\boldsymbol{\beta}\Vert_q \le t ~~(t>0), \end{equation*}\) or equivalently the Lagrangian form \(\begin{equation*} \underset{\boldsymbol{\beta}\in\mathbb{R}^p}{\mathop{\mathrm{arg\,min}}}\Vert \mathbf{y}- \mathbf{X}\boldsymbol{\beta}\Vert_2^2 + \lambda\Vert\boldsymbol{\beta}\Vert_q ~~(\lambda>0), \end{equation*}\)

where threshold \(t\) and penalty parameter \(\lambda\) have a one-to-one correspondence.

Two prominent examples of penalized methods are based on the \(L_1\) and \(L_2\) norms. The first, ridge regression, utilizes the \(L_2\)-norm and is amongst the oldest penalized methods. The second, which is based on the \(L_1\) norm, is called lasso regression. One important advantage of lasso is that it can set non-influential coefficients exactly equal to zero; thus, performing in a way also variable selection. Ever since the development of these two methods numerous extensions and variations have emerged. Nowadays, penalized regression is widely used in Statistics and Machine Learning when the main goal is prediction.

The goal in this project will be to initially learn the basics around these methods, comprehend why they work well for statistically ill-posed problems and also understand how we set the penalty parameter in practice. Individual projects can then follow several directions; for example, comparing the methods in simulation studies, focusing deeper on the theory for a specific method, or using the methods in interesting real-word applications.

Prerequisites

Statistical Methods III. Also we will use package \(\verb|glmnet|\), so some familiarity with R, Python or other appropriate programming languages is essential.

Resources

Good introductory slides for ridge & lasso are available online here and here, respectively.
Original lasso paper: Tibshirani, R., (1996). “Regression shrinkage and selection via the lasso”. Journal of the Royal Statistical Society Series B, 58, 267–88.
Books:
- Chapter 1 in: The Elements of Statistical Learning, 2nd edition. Available online by the authors here.
- Chapters 1 & 3 in: Statistical Learning with Sparsity. Available online by the authors here.

Project IV 2021-22