Ridge Regression

Dr. Konstantinos Perrakis

Description

Linear regression is one of the most important inferential tools in Statistics and Machine Learning. The methods of ordinary least-squares (LS) and maximum-likelihood (ML) estimation are the standard approaches in regression analysis, leading to a common solution which is characterized by good theoretical properties; for instance, the estimator of the coefficient vector \(\boldsymbol{\beta}\) is unbiased. However, in settings where covariates (predictor variables) are highly-correlated (multi-collinearity problem) the LS estimator, although unbiased, has a high variance. This leads to unstable estimates and can severely affect model inference and predictive performance. Moreover, the problem becomes aggravated when the number of covariates (\(p\)) is large.

As is known, the mean squared error (MSE) of any estimator can be decomposed into two parts; (i) the squared bias of the estimator and (ii) the variance of the estimator. This result, commonly called the bias-variance trade-off, suggests that under multi-collinearity introducing some bias in the LS solution can actually result in lower MSE because of smaller variance. This strategy, is generally called regularization and essentially entails imposing certain constraints on the parameter of interest; i.e., through the constraints we “regularize” the problem at hand.

A prominent regularization method is ridge regression. This method introduces a constraint based on the \(L_2\)-norm of the regression coefficients; namely, \(\sum_{j=1}^{p}\beta_j^2 \le t\) for some positive threshold \(t\). With this modification we move from the LS solution
\(\begin{equation*} \boldsymbol{\widehat{\beta}}^{LS}=\underset{\boldsymbol{\beta}\in\mathbb{R}^p}{\mathop{\mathrm{arg\,min}}}\Vert \mathbf{y}- \mathbf{X}\boldsymbol{\beta}\Vert_2^2 \end{equation*}\)
to the ridge solution
\(\begin{equation*} \boldsymbol{\widehat{\beta}}^{ridge}=\underset{\boldsymbol{\beta}\in\mathbb{R}^p}{\mathop{\mathrm{arg\,min}}}\Vert \mathbf{y}- \mathbf{X}\boldsymbol{\beta}\Vert_2^2 + \lambda\Vert\boldsymbol{\beta}\Vert_2 ~~(\lambda>0), \end{equation*}\)

where \(\Vert\cdot\Vert_2\) is the \(L_2\)-norm and \(\lambda\) is a penalty parameter which has a one-to-one correspondence with \(t\). Ridge regression offers significant advantages; (1) it can cope with multi-collinear data, (2) it shrinks non-influential coefficients to zero (important when \(p\) is large), and (3) it works even in the underdetermined setting where sample size \(n\) is smaller than \(p\).

The goal in this project will be to understand the fundamentals of linear and ridge regression, and learn about diagnostic checks of multi-collinearity and different ways of setting the ridge penalty parameter in practice. Individual projects can then follow several directions; for example, comparing LS and ridge in simulation studies, investigating predictive performance under different criteria for the ridge penalty, examining ways of “sparsifying” the ridge solution above, or using ridge regression in interesting real-word applications.

Prerequisites

Statistical Concepts II. Also some familiarity with R, Python or other appropriate programming languages is essential.

Resources