Digital information has become so entrenched in all aspects of our lives and society, that the recent growth in information production appears unstoppable. Each day on Earth we generate 500 million tweets, 294 billion emails, 4 million gigabytes of Facebook data, 65 billion WhatsApp messages and 720,000 hours of new content added daily on YouTube. In 2020, the total amount of data created, captured, copied and consumed in the world was estimated at 59 zettabytes (one zettabyte is 1021 bytes, or a trillion gigabytes).
Clearly, no human being could ever analyse or understand such vast quantities of information! Instead, during the past 20 years has seen the rapid development of new tools in the field of statistics, giving rise to new fields such as data science, machine learning, and artificial intelligence. Most of the methods used in these areas are soundly based in statistics and mathematics, however they've arguably achieved better PR than the more traditional disciplines!
In this project, we will focus on methods of statistical and machine learning, which are key tools in the toolbox of a data scientist. At the core of machine learning are a variety of methods which sit at the intersection of statistics and computer science. These methods are developed with the emphasis primarily on making highly accurate predictions of some outcome of interest. In abstraction, we can treat these simply as a statistical or mathematical model for the quantity we want to predict. They often produce highly complex and hard-to-interpret models, but can nonetheless achieve impressive levels of accuracy.
The initial goals of the project will be to explore some key statistical learning methods. As the field is so broad, there is no shortage of options! However, some obvious directions for methodology would include
With this understanding you may then take the project in whatever direction you find most interesting, which might include going deeper into the methodology of a technique which interests you, or finding a data set to use for a real-world application.
This project has a focus on statistical methodology and data analysis. Familiarity with the statistical package R (or equivalent languages such as Python), general statistical concepts, and data analysis are essential.
Statistical Concepts II - for familiarity with standard statistical ideas and experience with R.
Statistical Methods III - for multivariate statistical concepts, and connections to regression.
Many books contain suitable introductory material. All books below are available freely and in full online.