Welcome to the MLNN3 practical sessions. The point of these practicals is three-fold:
In this practical, we will be reminding ourselves of some of the basics of R and investigating the use of different performance metrics and distance measures in regression and classification problems.
There are three principal ways in which to complete the practicals:
The advantage of the third option is that R has been set up with all the required packages preinstalled and tested, but you will need a GitHub account to use it.
We will also be attempting to use R Markdown to facilitate submissions of a couple of formative assessments and the practical exam. To get R Markdown working in RStudio so that you can export your code and answers as a pdf, we need to install LaTeX within RStudio. (Again, this is already installed and set-up in the Codespace version.)
install.packages('rmarkdown')
install.packages('tinytex') # This might not be necessary if it is part of the rmarkdown install.
tinytex::install_tinytex()
Open a new R markdown document by clicking on
File -> New File -> R Markdown...
. Give the document
any title you wish, put yourself as author and select pdf as output
type.
Delete everything after line 10.
Add a heading.
# My main section
Add some maths.
Here's some maths:
$$
\log(x^3) \neq \frac{\exp(3x)}{x}.
$$
Add some maths with better alignment in its own subsection.
## Align subsection
Adding maths in an "align" environment makes it easier to line things up.
\begin{align*}
\mu &\sim \text{N}(0,1),\\
X_i|\mu &\sim \text{N}(\mu,1), ~~~i=1,\dots,n.
\end{align*}
Add some R code.
# R section
Here's some very simple R code.
```{r}
x <- rnorm(100)
y <- runif(100) + x
```
Add a simple scatter plot of y
against
x
.
Press the Knit button in the editor bar, and let’s see if things have been set up correctly.
Let’s start by getting some interesting data into your R environment:
the Glass
dataset taken from the mlbench
package.
# Load in the data
Glass <- read.csv("https://www.maths.dur.ac.uk/users/john.p.gosling/MATH3431_practicals/Glass.csv")
# Look at the first few rows
head(Glass)
# Look at a pairs plot
pairs(Glass)
We can find more information about the dataset here: [https://rdrr.io/cran/mlbench/man/Glass.html][https://rdrr.io/cran/mlbench/man/Glass.html].
Here, we will build a model of the refractive index RI
based on the other variables in the dataset ignoring the
Type
variable.
# Fit a linear model
lm1 <- lm(RI ~ . - Type,
data = Glass)
# Summarise the model
????
# Make predictions
preds <- ????
# Plot the predictions against the actual values
????
What else can we say about model performance? The model summary gives us an adjusted R-squared value of 0.89, but we can also calculate the mean squared error (MSE) and the mean absolute error (MAE) to get a better idea of how well the model is performing.
# Calculate the MSE
????
# Calculate the MAE
????
Now, does the model perform better if we remove some of the variables? Perhaps just concentrate on the variables that have significant coefficients in the model.
# Fit a linear model
lm2 <- lm(RI ~ Na + Mg + K + Ca + Ba,
data = ????)
# Summarise the model
????
# Make predictions
preds2 <- ???
# Calculate the MSE
????
# Calculate the MAE
????
Is this any better? Consider a model that is built with just the variables with insiginificant coefficients.
# Fit a linear model
lm3 <- lm(RI ~ Si + Al + Fe,
data = Glass)
# Summarise the model
????
# Make predictions
????
# Calculate the MSE
????
# Calculate the MAE
????
We can extend the linear model to include interactions between the
variables. This can be done by including the :
operator in
the formula (if we still are interested in the direct effects, it is
useful to use the *
operator). Let’s concentrate on a model
that just utilises the variables Na
and
Ba
.
# Fit a linear model without interactions
lm4 <- ????
# Summarise the model
????
# Fit a model with the interaction
lm5 <- lm(RI ~ Na*Ba,
data = Glass)
# Summarise the model
????
What impact is the interaction having on the model? How would we know when to include interaction terms?
Now, let’s have a data-centric look at the relationships.
# Create a proxy for the interaction term
Glass$Na_Ba <- Glass$Na * Glass$Ba
# Look at the relationships between the variables
pairs(????)
Is the relationship between Na
and RI
linear? If not, how could we model it?
To illustrate the use of linear classification, we will now use the
LetterRecognition
dataset from the mlbench
package. This dataset contains 20,000 observations of 17 variables, each
of which is a measure of a letter of the alphabet. The first variable is
the letter itself, and the remaining 16 are measures of the letter. To
keep things simple, we will just look at the classification of the
letters A
, I
, and W
.
# Load in the data
LetterRecognition <- read.csv("https://www.maths.dur.ac.uk/users/john.p.gosling/MATH3431_practicals/LetterRecognition.csv")
# Look at the first few rows
????
# Look at the structure of the data
str(LetterRecognition)
# Look at the levels of the letter variable
levels(LetterRecognition$lettr)
# Create a subset of the data
ltrs <- subset(LetterRecognition,
lettr %in% c("I", "A", "W"))
# Reset the levels of the letter variable
ltrs$lettr <- factor(ltrs$lettr)
Let’s try to visualise the data.
# A jittered scatter plot of the width vs the height with the letters coloured
plot(ltrs$width+rnorm(nrow(ltrs), 0, 0.1),
ltrs$high+rnorm(nrow(ltrs), 0, 0.1),
col = as.numeric(ltrs$lettr),
pch = 19, cex = 0.2,
xlab = "Width", ylab = "Height")
# Add a legend
legend("bottomright", legend = levels(ltrs$lettr), col = 1:3, pch = 19)
Let’s use a linear discriminant analysis (LDA) to classify the letters.
# Load the MASS package
library(MASS)
# Fit the LDA model
lda1 <- lda(lettr ~ .,
data = ltrs)
# Summarise the model
lda1
# Make predictions
preds <- ????
# Calculate the confusion matrix using the table function
table(ltrs$lettr, preds$class)
LDA does an excellent job of classifying the letters. But haven’t we just used the same data to train and test the model? How can we be sure that the model will generalise to new data? We will answer these questions in later sessions.
Now, let’s just reduce the number of variables to see if we can still classify the letters.
# Fit the LDA model
lda2 <- lda(lettr ~ x.box + y.box + width + high,
data = ltrs)
# Summarise the model
????
# Make predictions
preds2 <- ????
# Calculate the confusion matrix
????
Now, we have good separation between I
and
W
, but A
is being misclassified fairly
regularly. Let’s have a look at some performance metrics.
# Calculate the overall accuracy
# (hint consider the elements of the confusion matrix)
????
# Calculate the precision for `A`
????
# Calculate the recall for `A`
????
As in linear regression, we are free to transform the variables in
any manner we wish. Let’s try to classify the letters using a linear
model that utilises the width
and high
variables, but with a transformation of the width
variable.
# Fit the LDA model
lda3 <- lda(lettr ~ I(width^0.5) + high,
data = ltrs)
# Summarise the model
????
# Make predictions
preds3 <- ????
# Calculate the confusion matrix
????
How does this model compare to the previous one?
# Calculate the overall accuracy
????
# Calculate the precision for `A`
????
# Calculate the recall for `A`
????