Linear regression and linear classification

Welcome to the MLNN3 practical sessions. The point of these practicals is three-fold:

To give you hands-on experience with the concepts we discuss in the lectures.
To allow you to further explore concepts that we do not have time to cover in the lectures.
To give you experience with the tools that you will need to use in your assignments and practical exam.

In this practical, we will be reminding ourselves of some of the basics of R and investigating the use of different performance metrics and distance measures in regression and classification problems.

Task 0 - computer set-up

There are three principal ways in which to complete the practicals:

You can use your own computer and install R and RStudio on it.
You can use the AppsAnywhere service provided by the University.
You can utilise the GitHub Codespaces service that has been set up specifically for this module.

The advantage of the third option is that R has been set up with all the required packages preinstalled and tested, but you will need a GitHub account to use it.

We will also be attempting to use R Markdown to facilitate submissions of a couple of formative assessments and the practical exam. To get R Markdown working in RStudio so that you can export your code and answers as a pdf, we need to install LaTeX within RStudio. (Again, this is already installed and set-up in the Codespace version.)

install.packages('rmarkdown')
install.packages('tinytex') # This might not be necessary if it is part of the rmarkdown install.
tinytex::install_tinytex()

Some R and LaTeX in a pdf

Open a new R markdown document by clicking on File -> New File -> R Markdown.... Give the document any title you wish, put yourself as author and select pdf as output type.

Delete everything after line 10.

Add a heading.

# My main section

Add some maths.

Here's some maths:
$$
\log(x^3) \neq \frac{\exp(3x)}{x}.
$$

Add some maths with better alignment in its own subsection.

## Align subsection

Adding maths in an "align" environment makes it easier to line things up.
\begin{align*}
\mu &\sim \text{N}(0,1),\\
X_i|\mu &\sim \text{N}(\mu,1), ~~~i=1,\dots,n. 
\end{align*}

Add some R code.

# R section

Here's some very simple R code.
```{r}
x <- rnorm(100)
y <- runif(100) + x
```

Add a simple scatter plot of y against x.

Press the Knit button in the editor bar, and let’s see if things have been set up correctly.

Task 1 - a “simple” linear model

Let’s start by getting some interesting data into your R environment: the Glass dataset taken from the mlbench package.

# Load in the data
Glass <- read.csv("https://www.maths.dur.ac.uk/users/john.p.gosling/MATH3431_practicals/Glass.csv")

# Look at the first few rows
head(Glass)

# Look at a pairs plot
pairs(Glass)

We can find more information about the dataset here: [https://rdrr.io/cran/mlbench/man/Glass.html][https://rdrr.io/cran/mlbench/man/Glass.html].

Here, we will build a model of the refractive index RI based on the other variables in the dataset ignoring the Type variable.

# Fit a linear model
lm1 <- lm(RI ~ . - Type,
          data = Glass)

# Summarise the model
????

# Make predictions
preds <- ????

# Plot the predictions against the actual values
????

What else can we say about model performance? The model summary gives us an adjusted R-squared value of 0.89, but we can also calculate the mean squared error (MSE) and the mean absolute error (MAE) to get a better idea of how well the model is performing.

# Calculate the MSE
????

# Calculate the MAE
????

Now, does the model perform better if we remove some of the variables? Perhaps just concentrate on the variables that have significant coefficients in the model.

# Fit a linear model
lm2 <- lm(RI ~ Na + Mg + K + Ca + Ba,
          data = ????)

# Summarise the model
????

# Make predictions
preds2 <- ???

# Calculate the MSE
????

# Calculate the MAE
????

Is this any better? Consider a model that is built with just the variables with insiginificant coefficients.

# Fit a linear model
lm3 <- lm(RI ~ Si + Al + Fe,
          data = Glass)
          
# Summarise the model
????

# Make predictions
????

# Calculate the MSE
????

# Calculate the MAE
????

Task 2 - extended linear modelling

We can extend the linear model to include interactions between the variables. This can be done by including the : operator in the formula (if we still are interested in the direct effects, it is useful to use the * operator). Let’s concentrate on a model that just utilises the variables Na and Ba.

# Fit a linear model without interactions
lm4 <- ????

# Summarise the model
????
  
# Fit a model with the interaction
lm5 <- lm(RI ~ Na*Ba,
          data = Glass)

# Summarise the model
????

What impact is the interaction having on the model? How would we know when to include interaction terms?

Now, let’s have a data-centric look at the relationships.

# Create a proxy for the interaction term
Glass$Na_Ba <- Glass$Na * Glass$Ba

# Look at the relationships between the variables
pairs(????)

Is the relationship between Na and RI linear? If not, how could we model it?

Task 3 - linear classification

To illustrate the use of linear classification, we will now use the LetterRecognition dataset from the mlbench package. This dataset contains 20,000 observations of 17 variables, each of which is a measure of a letter of the alphabet. The first variable is the letter itself, and the remaining 16 are measures of the letter. To keep things simple, we will just look at the classification of the letters A, I, and W.

# Load in the data
LetterRecognition <- read.csv("https://www.maths.dur.ac.uk/users/john.p.gosling/MATH3431_practicals/LetterRecognition.csv")

# Look at the first few rows
????

# Look at the structure of the data
str(LetterRecognition)

# Look at the levels of the letter variable
levels(LetterRecognition$lettr)

# Create a subset of the data
ltrs <- subset(LetterRecognition,
               lettr %in% c("I", "A", "W"))

# Reset the levels of the letter variable
ltrs$lettr <- factor(ltrs$lettr)

Let’s try to visualise the data.

# A jittered scatter plot of the width vs the height with the letters coloured
plot(ltrs$width+rnorm(nrow(ltrs), 0, 0.1),
     ltrs$high+rnorm(nrow(ltrs), 0, 0.1),
     col = as.numeric(ltrs$lettr),
     pch = 19, cex = 0.2,
     xlab = "Width", ylab = "Height")

# Add a legend
legend("bottomright", legend = levels(ltrs$lettr), col = 1:3, pch = 19)

Let’s use a linear discriminant analysis (LDA) to classify the letters.

# Load the MASS package
library(MASS)

# Fit the LDA model
lda1 <- lda(lettr ~ .,
            data = ltrs)

# Summarise the model
lda1

# Make predictions
preds <- ????

# Calculate the confusion matrix using the table function
table(ltrs$lettr, preds$class)

LDA does an excellent job of classifying the letters. But haven’t we just used the same data to train and test the model? How can we be sure that the model will generalise to new data? We will answer these questions in later sessions.

Now, let’s just reduce the number of variables to see if we can still classify the letters.

# Fit the LDA model
lda2 <- lda(lettr ~ x.box + y.box + width + high,
            data = ltrs)

# Summarise the model
????

# Make predictions
preds2 <- ????

# Calculate the confusion matrix
????

Now, we have good separation between I and W, but A is being misclassified fairly regularly. Let’s have a look at some performance metrics.

# Calculate the overall accuracy 
# (hint consider the elements of the confusion matrix)
????

# Calculate the precision for `A`
????

# Calculate the recall for `A`
????

Task 4 - linear-but-not-as-we-know-it classification

As in linear regression, we are free to transform the variables in any manner we wish. Let’s try to classify the letters using a linear model that utilises the width and high variables, but with a transformation of the width variable.

# Fit the LDA model
lda3 <- lda(lettr ~ I(width^0.5) + high,
            data = ltrs)

# Summarise the model
????

# Make predictions
preds3 <- ????

# Calculate the confusion matrix
????

How does this model compare to the previous one?

# Calculate the overall accuracy
????

# Calculate the precision for `A`
????

# Calculate the recall for `A`
????