It is estimated that there is about 1.8 zettabytes (1.8 trillion GB) of data today, and that 70-80% of that data is unstructured (text documents, notes, comments, surveys with free-text fields, medical charts). While that data is out there, the real problem is not the amount of information but our inability to process it. Text documents are particularly challenging for mathematical approaches since language is ambiguous, context-dependent and riddled with inconsistencies and other problems (such as synonyms, homographs, etc).
The first challenge to any modelling of text data is to transform the textual information into an appropriate numerical data format that allows us to apply statistical techniques. The simplest approach is to turn a document into a vector of word frequencies, and so a collection of documents simply becomes a collection of observations of word counts over a particular vocabulary (though things are more complicated than that). For example below, we have taken the text of the Project 3 descriptions from 2015 and reduced each topic to a vector of word frequencies. After some simple processing and normalising, we can represent the most common word frequencies graphically (left), or as a projection into a 2-D numerical space (right). Even using such a simple and aggressive dimension reduction, we begin to see separations between topics on different subjects based on word usage. A more sophisticated version of this representation is illustrated by methods such as word2vec which attempts to capture both information on what words were used in the document, but also to position words with similar meanings close together in the vector space.
We will begin by studying how we can treat textual information as data, which we can understand and model with appropriate statistical techniques. We will apply these techniques to collections of text data (such as webpages, online news articles, or tweets), and once reduced to numerical form, we can apply standard statistical methods to explore, build models, find structure and label documents, identify similar documents, predict the type or topic of new documents, and so on. An interesting development from this basis would be to identify and explore 'trending' topics (such as Brexit, Covid, or Black Lives Matter among news articles).
General statistical topics we could consider include:
For this project, we will access to real data on webpage content and activity provided by Carbon. This project has a focus on data analysis and statistical computation, therefore familiarity with the statistical package R, general statistical concepts, programming skills, and practical data analysis are essential.
Statistical Concepts II, Statistical Methods III.