Maths projects
Statistical Classification - Rachel Oughton & Jonathan Cumming
Description
Classification methods are used when a data observation needs to
be categorised as belonging to one of a predefined set of classes,
based on its characteristics. A training dataset, where the class
of each observation is already known, is used to 'train’ the model,
and then as new data arrive they can be classified using the
model. For example, each of a collection of news articles is determined
to be either ‘real’ or ‘fake’. Various properties are extracted from
each article, such as length, types of word used, source, etc., and
these are used to train a classification model. When a new article
emerges, the same properties are extracted, and the model can
determine with some level of confidence whether the article is real
or fake.
Classification is a useful statistical skill to have, with example
applications including medical (does a patient have a particular disease?),
email (is an email spam or not?), e-commerce (based on their browsing,
is a prospective customer likely to make a purchase?) environmental
(classifying land cover from aerial photographs) and many others.
Approaches to classification vary. The
nearest
neighbour classifier simply assigns the new data point with the same
class as its nearest neighbour in the training data. In
Linear
discriminant analysis, we try to find a function of the variables
that partitions the space into the different classes.
Logistic regression
treats the problem as a regression in which the response variable is
categorical. One can also build classification trees, where the characteristics
are used sequentially to categorise a data point. Each of these approaches has different strengths
and weaknesses in terms of accuracy, user interpretability, level of statistical rigour and so on.

Classification tree model for a dataset on Irises. At each node, an observation
goes to the left child node if and only if the condition is true.
The pair of numbers
below each terminal node gives the number misclassified and the node sample size.
In this project, you will study a variety of classification methods and apply them to real data.
Two datasets we suggest are:
- Mushrooms – classifying mushrooms as edible or poisonous based on their physical characteristics
- Twitter gender – classifying twitter account holders as male, female or brand (non-human)
based on their profile, behaviour and one randomly selected tweet
It is not hard to find other datasets suitable for classification, for example at the
Office for National Statistics or
Kaggle.
There are a number of directions you could take with your project, such as
- Investigating modelling problems such as model parsimony compared to accuracy
- Looking into Bayesian methods for classification
- Investigating measures of confidence in classification
- Considering clustering, an unsupervised learning problem in which the groups/classes are unknown
- Comparing several different classification methods for their accuracy and efficiency
- Working on developing the best possible classifier for a particular dataset
In this project there will be a lot of data analysis and statistical computation. It is therefore
essential to be familiar with the statistical package R, as well as general statistical and data analysis
concepts.
Prerequisites/Corequisites
Statistical Concepts II, Statistical Methods III
Web Resources
- There is a large amount of material available on the web; start with the
wikipedia
page for example, and the topics linked above.
- The R statistical package can be downloaded from here. See the glm
function for logistic regression, the lda function for
linear discriminants, and the class package for
nearest-neighbour and other methods.
- R-bloggers has many useful R
articles, for example
classifying text messages and
importing data into R.
Books
Many books contain suitable introductory material:
Email
Rachel Oughton