Real data applications of statistics often involve large data sets with many potentially interacting variables, and it is the statistician's job to analyse and investigate this wealth of data to discover and/or model the interesting and relevant features and relationships. For example, a single camera phone can take photos containing millions of pixels, each of which could be considered a random variable, and so every photo could be treated as a point in ℝp. As p increases, it becomes more challenging to perform standard statistical tasks such as fitting linear models, estimating parameters, calculating summary statistics, plotting the data, or even storing the data itself. This problem of the increasing difficulty in organising, analysing and understanding such data is often loosely referred to as the 'curse of dimensionality'.
In general, often much of the meaningful and useful variation in the data can be found to be restricted to a smaller subspace of interest. Dimension reduction and variable selection are two general strategies that try to address the problems of big data by reducing the size of the problem to something more manageable. Dimension reduction (or feature extraction) operates with the goal of constructing a smaller collection of m new derived variables from the original dataset that are somehow representative of the larger whole. Variable (or feature) selection attempts to tackle the problem by identifying a subset of m<p key variables from the data set and discarding the remainder. The goal behind both of these approaches is that the reduced data now captures the majority of the 'signal' in the data, and whatever is left over is then just (hopefully ignorable and non-informative) 'noise'.
For example, in the data set above is two dimensional but clearly the majority of variation in the data occurs in the direction of the red arrow, with the remainder having a lesser contribution to differentiating one data point from another. If we transformed the data (right) we may choose to work only with the projections of the data onto the red line and ignore the variation in the green direction. Thus we reduce the dimension of the data. This is a simple example of Principal Component Analysis, which is covered in Statistical Methods III.
There are many possible routes this project could take according to the student's interests:
This project has a focus on statistical methodology and data analysis. Familiarity with the statistical package R, general statistical concepts, and data analysis are essential.
Statistical Concepts II, Statistical Methods III