Maths projects

Statistical Classification - Rachel Oughton & Jonathan Cumming

Description

Classification methods are used when a data observation needs to be categorised as belonging to one of a predefined set of classes, based on its characteristics. A training dataset, where the class of each observation is already known, is used to 'train’ the model, and then as new data arrive they can be classified using the model. For example, each of a collection of news articles is determined to be either ‘real’ or ‘fake’. Various properties are extracted from each article, such as length, types of word used, source, etc., and these are used to train a classification model. When a new article emerges, the same properties are extracted, and the model can determine with some level of confidence whether the article is real or fake.

Classification is a useful statistical skill to have, with example applications including medical (does a patient have a particular disease?), email (is an email spam or not?), e-commerce (based on their browsing, is a prospective customer likely to make a purchase?) environmental (classifying land cover from aerial photographs) and many others.

Approaches to classification vary. The nearest neighbour classifier simply assigns the new data point with the same class as its nearest neighbour in the training data. In Linear discriminant analysis, we try to find a function of the variables that partitions the space into the different classes. Logistic regression treats the problem as a regression in which the response variable is categorical. One can also build classification trees, where the characteristics are used sequentially to categorise a data point. Each of these approaches has different strengths and weaknesses in terms of accuracy, user interpretability, level of statistical rigour and so on.


Classification tree model for a dataset on Irises. At each node, an observation goes to the left child node if and only if the condition is true.
The pair of numbers below each terminal node gives the number misclassified and the node sample size.

In this project, you will study a variety of classification methods and apply them to real data. Two datasets we suggest are:

It is not hard to find other datasets suitable for classification, for example at the Office for National Statistics or Kaggle.

There are a number of directions you could take with your project, such as

In this project there will be a lot of data analysis and statistical computation. It is therefore essential to be familiar with the statistical package R, as well as general statistical and data analysis concepts.

Prerequisites/Corequisites

Statistical Concepts II, Statistical Methods III

Web Resources

Books

Many books contain suitable introductory material:

Email

Rachel Oughton

Back