It is estimated that there is about 1.8 zettabytes (1.8 trillion GB) of data today, and that 70-80% of that data is unstructured (text documents, notes, comments, surveys with free-text fields, medical charts). While that data is out there, the real problem is not the amount of information but our inability to process it. Unstructured text documents are particularly challenging for mathematical approaches since language is ambiguous, context-dependent and riddled with inconsistencies and other problems (synonyms, homographs, etc).
The first challenge to any modelling is to transform text information into an appropriate numerical data format that allows us to apply statistical techniques. The simplest approach is to turn a document into a vector of word frequencies, and so a collection of documents simply becomes a collection of observations of word counts over a particular vocabulary (though things are more complicated than that). For example below, we have taken the text of the Project 3 descriptions from 2015 and reduced each topic to a vector of word frequencies. After some simple processing and normalising, we can represent the most common word frequencies graphically (left), or as a projection into a 2-D numerical space (right). Even using such a simple and aggressive dimension reduction, we begin to see separations between topics on different subjects based on word usage. A more sophisticated version of this representation is illustrated by methods such as word2vec which attempts to capture both information on what words were used in the document, but also to position words with similar meanings close together in the vector space.
Once reduced to numerical form, we can apply standard statistical methods to build models, or find clusters, or predict the type or topic of new documents, and so on. Our goal in this project will be to apply these techniques to webpages in order to identify classify the pages according to their textual content. We will begin by studying how we can treat textual information as data, which we can understand and model with appropriate statistical techniques. From there, we will consider:
For this project, we will be looking at real data supplied for this project by Clicksco. This project has a focus on data analysis and statistical computation, therefore familiarity with the statistical package R, general statistical concepts, programming skills, and practical data analysis are essential.
Statistical II, Statistical Methods III.