Cluster analysis or clustering is an exploratory statistical technique which is used to identify natural groupings within a large number of observations. Discovering that there are sub-groups in the data is often valuable information, particularly if the different groups of observations have different properties or behaviours - or else we risk falling into the trap of Simpson's paradox. Clustering as a methodology also comes under the heading of unsupervised learning, as the goal is to try and identify the structure of the data set without any guidance as to how many clusters there are, what shape they have, or where they may be located.
A simple example of a data set with three distinct clusters.
Simple methods for clustering seek to identify sub-groups of the observations which are "similar" to other observations in the same group, but "different" from data in other groups. These groups then constitute the clusters within the data. Clearly, the definition of "similarity" between our data points is a crucial one and our results will vary depending on our choices. A standard route is to define similarity in terms of the Euclidean distance between the data points, which leads to simple methods such as k-Means (animation of k-means in action). A more rigorous approach than k-means would assume that each cluster in the data is represented by a multivariate probability distribution, making the clustering problem one of dealing with a finite mixture model.
A completely different approach to clustering uses decision trees, which progressively divide the data into the "most dissimilar" groups via simple tests on the individual variables. This process then repeats, dividing and sub-dividing the groups until we're left with single data points at the leaves of this hierarchical tree structure. Conversely, we could apply the same in reverse, and progressively merge the "most similar" points into clusters, and then merge the clusters. Beyond this, there are a wide variety of alternative ways you can define "similarity", or even a "cluster", and this leads to an equally broad spectrum of methods which seek to discover these features.
Clustering is widely-applied in the sciences and in the wider world, making it a useful method in any statistician's toolbox. For example: in biology, analysing genetic information to identify groups of genes related to a particular genetic disease; in marketing, analysing similar purchases during online shopping to make recommendations; in computer science, analysing characteristics of internet connections to distinguish valid activity from hacking attempts; in digital commerce, identifying groups of customers with similar behaviours and interests; in medicine and health, finding cohorts of patients with similar treatment needs or similar outcomes after treatment.
In this project, we will being with the key concepts behind cluster analysis - the ideas of distance and similarity. We will investigate k-means, and apply this knowledge to the analysis of real data sets. From there, you will study and implement some standard clustering methods, and evaluate their strengths and weaknesses through application to data. Further topics for study could include:
This project has a focus on data analysis and statistical computation. Familiarity with the statistical package R, general statistical concepts, and data analysis are essential.
Statistical Inference II
Data Science & Statistical Computing II
Many books contain suitable introductory material: