Cross validation and classification

In this practical, we will be utilising the caret package in R to perform cross-validation and classification tasks. The caret package is a powerful tool for machine learning in R and provides a consistent interface for many different models.

Cross validation

Recall that cross-validation is a technique used to assess the performance of a model. The idea is to split the data into a training set and a test set, and then train the model on the training set and evaluate it on the test set. This process is repeated multiple times, with different splits of the data, and the results are averaged to get an estimate of the model’s performance.

Task 1 - splitting the data

We will start be creating some synthetic data to work with. We will generate a data set with 100 observations that have three features and one continuous response variable.

# Generate synthetic data
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  x3 = rnorm(100))
data$y = data$x1 + 0.5*data$x2^2 + 
  (data$x3-0.1*data$x2)^3 + rnorm(100,0,0.1)

# Pairs plot
????

Now that we have our data, we can split it into a training set and a test set. We will use 80% of the data for training and 20% for testing.

# Split the data
train_indices <- ????
data_train <- ????
data_test <- ????

# Pairs plot colouring by training/test
pairs(data,
      col = ifelse(1:nrow(data) %in% train_indices,
                   "blue", "red"))

Task 2 - fitting a polynomial

Fit a polynomial model of degree two to the training data and evaluate its performance on the test data (specifically, MSE, adjusted R-squared and actual vs predictions plot).

# Fit a polynomial model
model <- lm(y ~ poly(x1, 2) + poly(x2, 2) + poly(x3, 2),
            data = data_train)

# Predict on the test data
predictions <- ????

# Calculate MSE
????
  
# Calculate adjusted R-squared
# (Hint it has been calculated in the model summary)
????

# Plot actual vs predictions
????

Task 3 - fitting using cross validation

Now, we will use cross-validation to fit a polynomial model of degree two to the data. We will use 5-fold cross-validation.

# Load the caret package
library(caret)

# Set up the cross-validation
control <- trainControl(method = "cv",
                        number = 5)

# Fit the model using cross-validation
model_cv <- train(y ~ poly(x1, 2) + poly(x2, 2) + poly(x3, 2),
                  data = data,
                  method = "lm",
                  trControl = control)

Now, let’s try to understand what is contained in the model_cv object.

# List the elements
names(model_cv)

Typing ?train in the console will give you more information about the train function and these outputs.

We are interested in the performance of the model. Let’s extract the results from the cross-validation.

# Extract the results
model_cv$results

# Hence, calculate the MSE
????

# Make predictions on the "test" data
predictions_cv <- ????

# Plot actual vs predictions
????

# Overlay points from initial model
????

Task 4 - repeat with leave-one-out cross-validation

Now, we will repeat the cross-validation process, but this time we will use leave-one-out cross-validation. This is a special case of cross-validation where each observation is used as the test set once, and the rest of the data is used as the training set.

# Set up the cross-validation
control <- trainControl(method = "LOOCV")

# Fit the model using leave-one-out cross-validation
model_loocv <- train(????,
                     data = ????,
                     method = ????,
                     trControl = control)

# Extract the results
????

# Calculate the MSE 
????

# Make predictions on the "test" data
predictions_loocv <- ????

# Plot actual vs predictions
????

# Overlay points from initial model
????

# Overlay points from 5-fold cross-validation
????

Classification

Task 5 - k-nearest neighbours

We will utilise part of a weather classification data set to demonstrate the k-nearest neighbours algorithm. We have 13,200 observations in the data set with 11 features. We will attempt to predict the season based on five continuous features.

# Load the data
weather_full <- read.csv("https://www.maths.dur.ac.uk/users/john.p.gosling/MATH3431_practicals/weather_classification_data.csv")

# Display the first few rows
????

# Select the features of interest
weather <- weather_full[,c(8,1,3,4,6)]

# Convert the season to a factor
weather$Season <- as.factor(weather$Season)

# Summarise the data
????

Now, we will split the data into a training and validation set (80/20 split).

# Split the data
train_indices <- ????
weather_train <- ????
weather_test <- ????

Utilise the caret package to fit a k-nearest neighbours model to the training data. We will use 10-fold cross-validation to select the optimal value of k.

# Fit a k-nearest neighbours model
model_knn <- train(Season ~ .,
                   data = ????,
                   method = ????,
                   trControl = trainControl(method = "cv",
                                            number = 10))

# Extract the results
????

# Make predictions on the test data
predictions_knn <- ????

# Calculate the confusion matrix
????

What are your thoughts on the model performance?

Task 6 - naive Bayes

We will now use the naive Bayes algorithm to classify the weather data set. We will use the same features as before.

# Fit a naive Bayes model
model_nb <- train(????,
                  data = ????,
                  method = "naive_bayes",
                  trControl = ????)

# Extract the results
????

# Make predictions on the test data
predictions_nb <- ????

# Calculate the confusion matrix
????

Better or worse than the k-nearest neighbours model? Try looking at a pairs plot of the data to see if you can understand why.

# Pairs plot coloured by season
pairs(weather[sample(1:nrow(weather),1000),2:5],
      col = weather_train$Season)

# Also consider the names of the full set of variables
names(weather_full)

Task 7 - decision trees

We now move on to a classification task with categorical variables. For this, we will be using a data set that considers whether a mushroom is edible or poisonous based on various features.

# Load the data
mushroom <- read.csv("https://www.maths.dur.ac.uk/users/john.p.gosling/MATH3431_practicals/mushrooms.csv")

# Display the first few rows
head(mushroom)

# Convert all variables to factors and put back into the data frame
mushroom <- lapply(mushroom, as.factor)
mushroom <- as.data.frame(mushroom)

# Remove all but the first four predictors
# to aid interpretation of the tree
mushroom <- mushroom[,1:5]

# Summarise the data
????

The first column of the data set contains the class label (edible or poisonous), and the remaining columns contain the features. We will use a decision tree to classify the mushrooms.

# Load the `rpart` package
library(rpart)

# Fit a decision tree using 5-fold cross-validation
model_tree <- train(????,
                    data = ????,
                    method = "rpart",
                    trControl = trainControl(method = "cv",
                                             number = 5))

It can be useful to visualise the fitted tree.

# Load the `rpart.plot` package
library(rpart.plot)

# Plot the tree
rpart.plot(model_tree$finalModel)

What do the numbers relate to on the tree? Could you use the tree to classify a new mushroom?

Cross validation and classification

John Paul Gosling

2024-11-07