Project IV 12-13

Project IV (MATH4072) 2012-13

Feature extraction from gene expression microarray data

Dr J. Einbeck

co-supervised by Dr Adetayo Kasim, Wolfson Research Institute

Description
DNA microarray data are a rich source of information for molecular biologists, and a fascinating source of data for statisticians. Slightly simplified, microarrays are solid surfaces made of glass or silicon, which carry an arrayed series of thousands of microscopic DNA spots. DNA microarrays are used to measure changes in gene expression levels, where "expression" is the translation of information encoded in a gene into proteins. The analysis of such data has gained enormous importance in understanding various biological processes including cancer. Typically, one screens simultaneously the expression of all genes in a cell exposed to some specific conditions, yielding a table of "sample versus genes" of dimension, say n x p. One of the basic features of microarray data is that p >> n, where the number of observations, n, is usually in the region of tens and the number genes, p, in the region of thousands. This makes any attempt of direct statistical analysis (for instance, classification of genes) almost impossible, and one needs efficient preprocessing steps which extract low-dimensional ``features" from the data matrix.
This project will look at ways of extracting information or ``features'' from high-dimensional data, using initially familiar statistical techniques such as principal component analysis, but proceeding to more advanced methods lateron. The project will be run in collaboration with Dr Adetayo Kasim from the Wolfson Research Institute, who has arranged for microarray data provided by Janssen Pharmaceutical, Beerse, Belgium. There are two directions of research in which to turn in the course of the project:

Informative or non-informative calls for gene expression. Microarray and other OMICS technologies generate thousands of data on the whole genome of an organism for molecular profiling. The high dimensional data generated by microarrays are both its strength and weakness because of huge number of false positives. This project will investigate statistical methods for filtering irrelevant genes in a microarray experiment.
Identification of gene signatures for irritable bowel syndrome. One of the challenges in the diagnosis of irritable bowel syndrome is its lack of specificity and the difficulty in distinguishing it from other forms of bowel disorders. This project will investigate statistical methods for molecular profiling of irritable bowel syndrome using microarrays.

Prerequisites
Statistical Methods III

Resources and Examples

Clark, D. and Russel, L. (2000): Molecular Biology Made Simple and Fun, Cache River Press.
Draghici, S. (2011): Statistics and Data analysis for microarrays using R and Bioconductor, Chapman and Hall/CRC, ISBN 978-1-4398-0975-4
Reilly, C. Statistics in Human Genetics and Molecular Biology (2009), Chapman and Hall/CRC, ISBN 978-1-4200-7263-1
A nice online tutorial of microarray technology (your work would start after the end of the last slide only!)

email: jochen.einbeck "at" durham.ac.uk

Back