Project 3 -- Statistics, Machine Learning, and Data Science

Project III: Statistics, Machine Learning, and Data Science
Jonathan Cumming & Sebastian Schmon

Description

Digital information has become so entrenched in all aspects of our lives and society, that the recent growth in information production appears unstoppable. Each day on Earth we generate 500 million tweets, 294 billion emails, 4 million gigabytes of Facebook data, 65 billion WhatsApp messages and 720,000 hours of new content added daily on YouTube. In 2020, the total amount of data created, captured, copied and consumed in the world was estimated at 59 zettabytes (one zettabyte is 10²¹ bytes, or a trillion gigabytes).

Clearly, no human being could ever analyse or understand such vast quantities of information! Instead, during the past 20 years has seen the rapid development of new tools in the field of statistics, giving rise to new fields such as data science, machine learning, and artificial intelligence. Most of the methods used in these areas are soundly based in statistics and mathematics, however they've arguably achieved better PR than the more traditional disciplines!

In this project, we will focus on methods of statistical and machine learning, which are key tools in the toolbox of a data scientist. At the core of machine learning are a variety of methods which sit at the intersection of statistics and computer science. These methods are developed with the emphasis primarily on making highly accurate predictions of some outcome of interest. In abstraction, we can treat these simply as a statistical or mathematical model for the quantity we want to predict. They often produce highly complex and hard-to-interpret models, but can nonetheless achieve impressive levels of accuracy.

Two models for classifying points as black or red. The blue exhibits a low bias but high variance, and the grey is the opposite with low variance but high bias. Which is best?

A visualization of a convolutional neural network.

The initial goals of the project will be to explore some key statistical learning methods. As the field is so broad, there is no shortage of options! However, some obvious directions for methodology would include

Classification methods such as k-nearest neighbour, Naive Bayes, and logistic regression
Clustering methods to identify groupings within data, such as k-means and finite mixture models
Tree-based methods and models, such as CART (classification and regression trees), random forests, and related topics such as boosting and bagging
Neural network approaches, such as multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neuarl networks (RNNs), neural ODEs etc.
Probabilistic approaches, such as variational methods and deep generative models combine neural networks with concepts from probability and statistics

With this understanding you may then take the project in whatever direction you find most interesting, which might include going deeper into the methodology of a technique which interests you, or finding a data set to use for a real-world application.

This project has a focus on statistical methodology and data analysis. Familiarity with the statistical package R (or equivalent languages such as Python), general statistical concepts, and data analysis are essential.

Prerequisites

Statistical Concepts II - for familiarity with standard statistical ideas and experience with R.

Co-requisites

Statistical Methods III - for multivariate statistical concepts, and connections to regression.

Books

Many books contain suitable introductory material. All books below are available freely and in full online.

An Introduction to Statistical Learning by James, Witten, Hastie and Tibshirani provides a more gentle introduction to the subject and contains many examples in R.
The Elements of Statistical Learning by Hastie, Tibshirani and Friedman covers similar topics, but goes more into detail.
Probabilistic Machine Learning: An Introduction by Kevin Murphy. A very comprehensive overview and probably the most up-to-date introduction to machine learning.

Email