Project IV (MATH4072) 2020-21


Principal components for image analysis

Dr J. Einbeck

Description

Principal component analysis (PCA) is a traditional methodology in Statistics (established by Karl Pearson in 1901), but, despite its age, is still an important and frequently used building block of many machine learning routines. Unlike regression methodology, principal component analysis does not distinguish between predictors and responses, and so takes a `symmetric' view on the variables. One can think of the first principal component as the `best line through the middle of the data', where `best' can be statistically quantified as `maximimizing the variance of data projected onto the line', or putting more simply, explaining most of the variation in the data. The second principal component line then maximizes this criterion among all lines which are orthogonal to the first one, and so so. The resulting sequence of `best linear approximations' to a multivariate data set can in principle be used to reconstruct the original data set fully, but of course not a lot would be gained by this! The power of principal components stems from the property that the `later' principal components capture less and less useful information, and so discarding them opens the possibility of `denoising' the original data set, or producing a `compressed' version of the original data which requires less storage space and is better to handle due to its reduced size. Furthermore, specific principal components may contain useful information which can be exploited for further analysis.

An interesting application of such methods is image analysis. Images can be converted into `data' by arraying the grey scale or RGB values of each pixel into suitable vectors or matrices which are then analyzed through Principal Component Analysis. In the case of a single image, the task of interest is usually denoising of the image. However, PCA can also applied onto many images at once, and the resulting decomposition can be used to extract features of interest. For instance, the sequence to the right shows (i) in the top, four handwritten digits out of much larger data set containing 12000 of such digits, all between 0 and 3 (ii) in the middle, the images corresponding to the first four principal components, (iii) in the bottom, a plot of the first against the second principal component scores of all 12000 images, which then could be used in subsequent clustering or classification steps for automated digit recognition. In this project, you will study PCA and related methods, based on your knowledge from Statistical Methods III, and learn how to `open' and prepare images for statistical analysis. You will then apply PCA on image analysis, in this context also considering other relevant statistical methods such as classification routines as required to complete your analysis.

Prerequisites

  • Statistical Methods III

Resources

  • James G., Witten D., Hastie T., and Tibshirani R. (2013) Introduction to Statistical Learning. Springer, New York. PDF , Section 6.3.
  • Hastie T., Tibhsirani R., and Friedman, J. (2001) The Elements of Statistical Learning. Springer, New York. PDF , Section 14.5.1 (including an example involving handwritten digits)
  • Interesting discussion on Stackexchange.

HTML5 Icon HTML5 Icon HTML5 Icon

email: jochen.einbeck "at" durham.ac.uk