MLNN fundamentals

JP Gosling

2024-10-15

Learning


Source: xkcd.com/1838

Statistical learning


“Statistical learning, with its roots firmly in probability and statistics, seeks to unveil the hidden structure within data. By building interpretable models, it allows us to not only make predictions but also understand the relationships between variables and even draw causal inferences. Think of it this way: statistical learning, like a skilled detective, meticulously examines the data to uncover the underlying story.”

Machine learning


“Machine learning (ML), however, takes a more pragmatic approach. Borrowing heavily from computer science, it prioritises the creation of algorithms that excel at making accurate predictions on entirely new data. While interpretability might be sacrificed in the process, the raw predictive power gained can be immense. Complex algorithms like neural networks exemplify this philosophy – they may be opaque in their inner workings, but their ability to learn intricate patterns from data allows them to make impressive predictions.”

SL vs ML (1)

  • I asked ChatGPT, “which is best: statistical learning or machine learning?”.
  • “Statistical learning focuses on understanding data properties and making inferences, while machine learning prioritises predictive accuracy and scalability. Use statistical learning for hypothesis testing and uncovering relationships, and machine learning for accurate predictions .”
  • I then asked two follow-up questions: “Don’t statistical methods produce accurate predictions?” and “Can’t ML methods be used to interpret complex data?”.
  • “Certainly! I apologise if my previous response caused any confusion…”

SL vs ML (2)


Source: Google NGram Viewer

SL vs ML (3)


Source: Google NGram Viewer

SL vs ML (4)


Source: Created using the Image Creator in Bing

Confession

Here’s a figure from my most recent paper…

End of section

Model Performance


Source: Created using the Image Creator in Bing

Dividing the data (1)

We have a finite set of data points: how do we best use it to train the model, pick the correct model settings and check its generalisation performance?

Dividing the data (2)

Let’s look at an example of LDA applied again to the iris dataset. Due to the very many assumptions of LDA, we don’t really need to worry about having a separate validation set.

data(iris)

# Split the data
train_index <- sample(1:nrow(iris),
                      0.5 * nrow(iris))
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

Dividing the data (3)

Dividing the data (4)

Dividing the data (5)

How about the 75 test observations?

# Fit the LDA model
lda_model <- lda(Species ~ ., data = train_data)

# Predict the test data
lda_pred <- predict(lda_model, test_data)

# Confusion matrix
table(lda_pred$class, test_data$Species)
##             
##              setosa versicolor virginica
##   setosa         29          0         0
##   versicolor      0         24         0
##   virginica       0          0        22

Dividing the data incorrectly (1)

What if we split the data differently and hoped for the best?

data(iris)

# Split the data
train_index <- 1:75
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

Dividing the data incorrectly (2)

Dividing the data incorrectly (3)

Dividing the data incorrectly (4)

How about the 75 test observations?

# Fit the LDA model
lda_model <- lda(Species ~ ., data = train_data)
## Warning in lda.default(x, grouping, ...): group virginica is empty
# Predict the test data
lda_pred <- predict(lda_model, test_data)

# Confusion matrix
table(lda_pred$class, test_data$Species)
##             
##              setosa versicolor virginica
##   setosa          0          0         0
##   versicolor      0         25        50
##   virginica       0          0         0

Comparitors - classification

Apart from LDA, we consider two other baseline classifiers:


  • random guessing: a classifier that randomly assigns a class to each observation based on observed class proportions;


  • majority voting: a classifier that assigns the most frequent class to each observation.

Comparitors - regression

Apart from linear regression, we consider two other baseline regression models:


  • random guessing: a regression model that randomly assigns a value to each observation based on observed values in the training set;


  • mean prediction: a regression model that assigns the mean value of the training set to each observation.

Confusion matrices (1)

Imagine we have a binary classification problem with two classes: positive and negative. The confusion matrix is a 2x2 matrix that summarises the performance of a classifier.


Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Confusion matrices (2)

The confusion matrix can be extended to multi-class classification problems. For a \(k\)-class problem, the confusion matrix is a \(k\)x\(k\) matrix. The diagonal elements represent the number of correct predictions for each class, but this can get a bit unwieldy for large \(k\).

The statistics contained in the confusion matrix can be used to calculate a number of performance metrics.


See notes

The perils of class imbalance

Consider a binary classification problem with 95 data points belonging to the negative class and 5 belonging to the positive class. A classifier that always predicts negative will achieve an accuracy of 95%.


Predicted Positive Predicted Negative
Actual Positive 0 5
Actual Negative 0 95


The precision and recall of the classifier with respect to the negative class 95% and 100% respectively, but, for the positive class, they are undefined and 0% respectively.

Precision and recall curves (1)

If we can vary the threshold for classifying an observation as positive, we can plot the precision and recall of the classifier as the threshold changes. This is known as the precision-recall curve.


  • Let’s consider a binary classification problem with 1,000 observations, 100 of which are positive.
  • As we increase the sensitivity of the classifier, we will classify more observations as positive regardless of their true class.

Precision and recall curves (2)

Here, we have turned down the sensitivity of the classifier, so it is less likely to classify an observation as positive.


Predicted Positive Predicted Negative
Actual Positive 50 50
Actual Negative 0 900

Precision and recall curves (3)

Now, we ramp the sensitivity up, so the classifier is more likely to classify an observation as positive.


Predicted Positive Predicted Negative
Actual Positive 100 0
Actual Negative 200 700

Precision and recall curves (4)

ROC curves (1)

The receiver operating characteristic (ROC) curve is another way of visualising the performance of a classifier. It plots the true positive rate (TPR) against the false positive rate (FPR) as the threshold for classifying an observation as positive is varied.


See notes

ROC curves (2)

Probabilistic predictions

  • So far, we have assumed that the classifier outputs a deterministic prediction. How useful is this in practice?


  • A probabilistic prediction offers far more insight into uncertainty.


  • There are two main metrics that consider the probabilistic output of a classifier: the Brier score and cross-entropy loss.

Metrics for continuous predictions (1)

We have zero chance of getting a perfect prediction of a continuous variable. How do we measure the performance of a regression model? Here are some similar questions:


  • How do we measure the distance between two points?
  • How do we measure the spread of a set of points?
  • How long is a piece of string?


  • See notes

Metrics for continuous predictions (2)

Now that we have a battery of metrics, let’s see what they come up with for a simple regression model.


We’ll generate some data from

\[ y = 2x + \sin(x) + \epsilon, \]

where \(\epsilon \sim N(0, 1)\) with the \(x\) values ranging from 0 to 10.


We will fit three linear models: constant, linear and quintic.

Metrics for continuous predictions (3)

Metrics for continuous predictions (4)



Model RMSE MAE MAPE R-squared
Constant 6.16 5.52 127.65 0
Linear 1.25 1.02 18.03 0.96
Quintic 0.96 0.76 15.53 0.98

Metrics for continuous predictions (4)



Model RMSE MAE MAPE R-squared
Constant 6.16 5.52 127.65 0
Linear 1.25 1.02 18.03 0.96
Quintic 0.96 0.76 15.53 0.98
Duodenic 0.88 0.66 11.32 0.98

Metrics for continuous predictions (5)

Have we forgotten about test data?

Let’s generate 1,000 test data points and see how the models perform.


Model RMSE MAE MAPE
Constant 5.81 5.09 467.92
Linear 1.19 0.96 38.96
Quintic 1.07 0.84 60.47
Duodenic 1.08 0.86 52.4

End of section

Overfitting


Source: xkcd.com/2048

Perfect prediction (1)


  • Underfitting: the model is too simple to capture the underlying structure of the data.
  • Overfitting: the model is too complex and captures the noise in the data.
  • Just right: the model captures the underlying structure of the data.


  • Do I believe in my model?

Perfect prediction (2)

Perfect prediction (3)

Perfect prediction (3)

Perfect prediction (4)

Perfect prediction (5)

Perfect prediction (6)

Avoiding overfitting

Three main strategies:


  • Regularisation: add a penalty term to the loss function that penalises complex models.


  • Early stopping: stop training the model when the validation error starts to increase.


  • Cross-validation: repeatedly use different subsets of the data to train and validate/test the model.

Adjusted \(R^2\)

This is a basic form of regularisation.


See notes

Model selection criteria

You may have already heard of the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These are model selection criteria that directly penalise the complexity of the model.


See notes

Early stopping (1)

This strategy comes is most useful when we are training our model in a sequential fashion.


The idea is to keep track of errors in a test set whilst we add data to a model.

Early stopping (2)

Early stopping (3)

Let’s train in batches of 10 observations and see how the error changes.

set.seed(124)

# Randomly slice the training data
train_index <- sample(1:nrow(iris),
                      nrow(iris))

# Fit the LDA model
lda_model_1 <- lda(Species ~ .,
                   data = train_data[train_index[1:10], ])

# Predict the test data
lda_pred_1 <- predict(lda_model_1, test_data)

# Evaluate accuracy
con_mat <- table(lda_pred_1$class, test_data$Species)
sum(diag(con_mat)) / sum(con_mat)
## [1] 0.5866667

Early stopping (4)

End of section

Cross validation


Source: Created using the Image Creator in Bing

Train-test-validate revisited


\(k\)-fold cross validation

Have you done any bootstrapping before? This is a(n) (almost) deterministic version of that.



Source: Cross-validation_(statistics) on Wikipedia (Creative Commons By SA 4.0)

Leave-one-out cross validation

Have you heard of the jackknife estimator? This is a model-fitting version of that.



Source: Cross-validation_(statistics) on Wikipedia (Creative Commons By SA 4.0)

Implementation (1)

Perhaps, you can imagine slicing up the data and looping through model fits on each slice in R.

# Select 10 training sets 90/10 split
train <- NULL
random_index <- sample(1:nrow(iris), nrow(iris))
for (i in 1:10){
  train[[i]] <- iris[random_index[-((i - 1) * 15 + 1):(i * 15)], ]
}

# Fit 10 LDA models
lda_models <- NULL
for (i in 1:10){
  lda_models[[i]] <- lda(Species ~ .,
                         data = train[[i]])
}

Implementation (2)

Perhaps, you can imagine slicing up the data and looping through model fits on each slice in R.

# Select 10 training sets 90/10 split
train <- NULL
random_index <- sample(1:nrow(iris), nrow(iris))
for (i in 1:10){
  train[[i]] <- iris[random_index[-((i - 1) * 15 + 1):(i * 15)], ]
}

# Fit 10 LDA models
lda_models <- lapply(train,
                     function(x) lda(Species ~ .,
                                     data = x))

Implementation (3)

You will be pleased to know that there is a function in R that does this for you.

library(caret)

# Fit 10 LDA models
lda_cv <- train(Species ~ .,
                data = iris,
                method = "lda",
                trControl = trainControl(method = "cv",
                                         number = 10))

And it currently can work with 239 different model types.

Using the CV results (1)


# Extract the predictions for each of the 10 fits
all_predictions <- lda_cv$pred

# Convert the Resample column to an integer that is the final two 
# digits of the entry
all_predictions$Resample <- as.integer(substr(all_predictions$Resample,
                                              5,6))

# Calculate the accuracy of each of the 10 fits
accuracy_ <- NULL
for (i in 1:10){
  con_mat <- table(all_predictions$pred[all_predictions$Resample == i],
                   all_predictions$obs[all_predictions$Resample == i])
  accuracy_[i] <- sum(diag(con_mat)) / sum(con_mat)
}

Using the CV results (2)

Using the CV results (3)

Using the CV results (4)

Using the CV results (5)

Using the CV results (6)

End of chapter