Communicating Mathematics III (MATH3131) 2011-12


Mixture models

Dr J. Einbeck

Description

A frequently arising situation is that data are collected from several subpopulations, but the information to which subpopulation a certain object belongs is unknown or not recorded. For instance, when collecting spectra emitted by astronomic objects, the object may be a star, quasar, galaxy,..., but this information is not a priori available; it is "latent". Data arising from several subpopulations often feature a clustered appearance, where the clusters may sometimes be well distinct, but often they are not well separated, or even overlapping. Data of this type pose considerable challenges to the data analyst. A possible way forward is to describe the data by a mixture model. For instance, if one believes that two latent subpopulations are available, one may model the data as a mixture of two normal distributions, which in the univariate case takes the shape

Y ~ p N(m1, σ12) +(1-p) N(m2, σ22)

where p is the probability of being generated by the first normal distribution. Usually, all of the parameters p, m1, σ1, m22 have to be estimated from the data. This is done through the EM (Expectation-Maximization) algorithm, which over the past decades has become an extraordinarily important statistical device (with significance far beyond the mixture modelling problem). Mixture models are not a clustering technique per se , but, nicely, the EM algorithm delivers probabilities of component membership as a by-product. Mixture models are of major relevance in a wide range of sciences, including the environmental, social, and medical sciences as well as the finance sector. In this project, you will get some insight into the methodology used for mixture modelling, and apply this technique onto real data sets which can be chosen from a field of application which suits your interests.

Prerequisites

  • Statistical Concepts II

Resources

email: Jochen Einbeck


Back