Maths projects

Topics in Variable Selection and Dimension Reduction - Jonathan Cumming

Description

Real data applications of statistics often involve large data sets with many potentially interacting variables, and it is the statistician's job to analyse and investigate this wealth of data to discover and/or model the interesting and relevant features and relationships. For example, a single camera phone can take photos containing millions of pixels, each of which could be considered a random variable, and so every photo could be treated as a point in ℝ^p. As p increases, it becomes more challenging to perform standard statistical tasks such as fitting linear models, estimating parameters, calculating summary statistics, plotting the data, or even storing the data itself. This problem of the increasing difficulty in organising, analysing and understanding such data is often loosely referred to as the 'curse of dimensionality'.

In general, often much of the meaningful and useful variation in the data can be found to be restricted to a smaller subspace of interest. Dimension reduction and variable selection are two general strategies that try to address the problems of big data by reducing the size of the problem to something more manageable. Dimension reduction (or feature extraction) operates with the goal of constructing a smaller collection of m new derived variables from the original dataset that are somehow representative of the larger whole. Variable (or feature) selection attempts to tackle the problem by identifying a subset of m<p key variables from the data set and discarding the remainder. The goal behind both of these approaches is that the reduced data now captures the majority of the 'signal' in the data, and whatever is left over is then just (hopefully ignorable and non-informative) 'noise'.

Dimension reduction in two dimensions

For example, in the data set above is two dimensional but clearly the majority of variation in the data occurs in the direction of the red arrow, with the remainder having a lesser contribution to differentiating one data point from another. If we transformed the data (right) we may choose to work only with the projections of the data onto the red line and ignore the variation in the green direction. Thus we reduce the dimension of the data. This is a simple example of Principal Component Analysis, which is covered in Statistical Methods III.

There are many possible routes this project could take according to the student's interests:

Dimension reduction - study methods beyond those of standard principal component analysis seen in SMIII. For example, canonical correlation analysis, multidimensional scaling, and variations of PCA such as independent component analysis;
Variable selection - the problem of identifying and selecting a subset of variables that 'best' describes a dataset, e.g. McCabe's principal variables;
Variable selection in regression - investigate and compare methods for selecting important covariates in a regression problem (subset selection, shrinkage methods, cross-validation). Comparison with stepwise methods seen in statistics modules;
How to assess the intrinsic dimensionality of a dataset (i.e. what is the value of m);
Methods for data with particular structures (e.g. time series, images, text, functional data).

This project has a focus on statistical methodology and data analysis. Familiarity with the statistical package R, general statistical concepts, and data analysis are essential.

Prerequisites/Corequisites

Statistical Concepts II, Statistical Methods III

Further information

Wikipedia pages on Dimension reduction, Feature selection, and Feature extraction, The curse of dimensionality and relevant pages from links therein.
Principal component analysis: I. T. Jolliffe (2010) Principal Component Analysis, New York : Springer-Verlag
For variable selection: George P. McCabe (1984) Principal Variables, Technometrics, 26(2), pp. 137-144
For variable selection in regression: Chapter 3 of Hastie, Tibshirani and Friedman (2009), The Elements of Statistical Learning, New York : Springer-Verlag (also available online.)

Email

Jonathan Cumming

Back