“Statistical learning, with its roots firmly in probability and statistics, seeks to unveil the hidden structure within data. By building interpretable models, it allows us to not only make predictions but also understand the relationships between variables and even draw causal inferences. Think of it this way: statistical learning, like a skilled detective, meticulously examines the data to uncover the underlying story.”
“Machine learning (ML), however, takes a more pragmatic approach. Borrowing heavily from computer science, it prioritises the creation of algorithms that excel at making accurate predictions on entirely new data. While interpretability might be sacrificed in the process, the raw predictive power gained can be immense. Complex algorithms like neural networks exemplify this philosophy – they may be opaque in their inner workings, but their ability to learn intricate patterns from data allows them to make impressive predictions.”
Here’s a figure from my most recent paper…
We have a finite set of data points: how do we best use it to train the model, pick the correct model settings and check its generalisation performance?
Let’s look at an example of LDA applied again to the
iris
dataset. Due to the very many assumptions of LDA, we
don’t really need to worry about having a separate validation set.
How about the 75 test observations?
# Fit the LDA model
lda_model <- lda(Species ~ ., data = train_data)
# Predict the test data
lda_pred <- predict(lda_model, test_data)
# Confusion matrix
table(lda_pred$class, test_data$Species)
##
## setosa versicolor virginica
## setosa 29 0 0
## versicolor 0 24 0
## virginica 0 0 22
What if we split the data differently and hoped for the best?
How about the 75 test observations?
## Warning in lda.default(x, grouping, ...): group virginica is empty
# Predict the test data
lda_pred <- predict(lda_model, test_data)
# Confusion matrix
table(lda_pred$class, test_data$Species)
##
## setosa versicolor virginica
## setosa 0 0 0
## versicolor 0 25 50
## virginica 0 0 0
Apart from LDA, we consider two other baseline classifiers:
Apart from linear regression, we consider two other baseline regression models:
Imagine we have a binary classification problem with two classes: positive and negative. The confusion matrix is a 2x2 matrix that summarises the performance of a classifier.
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | True Positive (TP) | False Negative (FN) |
Actual Negative | False Positive (FP) | True Negative (TN) |
The confusion matrix can be extended to multi-class classification problems. For a \(k\)-class problem, the confusion matrix is a \(k\)x\(k\) matrix. The diagonal elements represent the number of correct predictions for each class, but this can get a bit unwieldy for large \(k\).
The statistics contained in the confusion matrix can be used to calculate a number of performance metrics.
See notes
Consider a binary classification problem with 95 data points belonging to the negative class and 5 belonging to the positive class. A classifier that always predicts negative will achieve an accuracy of 95%.
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | 0 | 5 |
Actual Negative | 0 | 95 |
The precision and recall of the classifier with respect to the negative class 95% and 100% respectively, but, for the positive class, they are undefined and 0% respectively.
If we can vary the threshold for classifying an observation as positive, we can plot the precision and recall of the classifier as the threshold changes. This is known as the precision-recall curve.
Here, we have turned down the sensitivity of the classifier, so it is less likely to classify an observation as positive.
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | 50 | 50 |
Actual Negative | 0 | 900 |
Now, we ramp the sensitivity up, so the classifier is more likely to classify an observation as positive.
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | 100 | 0 |
Actual Negative | 200 | 700 |
The receiver operating characteristic (ROC) curve is another way of visualising the performance of a classifier. It plots the true positive rate (TPR) against the false positive rate (FPR) as the threshold for classifying an observation as positive is varied.
See notes
We have zero chance of getting a perfect prediction of a continuous variable. How do we measure the performance of a regression model? Here are some similar questions:
Now that we have a battery of metrics, let’s see what they come up with for a simple regression model.
We’ll generate some data from
\[ y = 2x + \sin(x) + \epsilon, \]
where \(\epsilon \sim N(0, 1)\) with the \(x\) values ranging from 0 to 10.
We will fit three linear models: constant, linear and quintic.
Model | RMSE | MAE | MAPE | R-squared |
---|---|---|---|---|
Constant | 6.16 | 5.52 | 127.65 | 0 |
Linear | 1.25 | 1.02 | 18.03 | 0.96 |
Quintic | 0.96 | 0.76 | 15.53 | 0.98 |
Model | RMSE | MAE | MAPE | R-squared |
---|---|---|---|---|
Constant | 6.16 | 5.52 | 127.65 | 0 |
Linear | 1.25 | 1.02 | 18.03 | 0.96 |
Quintic | 0.96 | 0.76 | 15.53 | 0.98 |
Duodenic | 0.88 | 0.66 | 11.32 | 0.98 |
Have we forgotten about test data?
Let’s generate 1,000 test data points and see how the models perform.
Model | RMSE | MAE | MAPE |
---|---|---|---|
Constant | 5.81 | 5.09 | 467.92 |
Linear | 1.19 | 0.96 | 38.96 |
Quintic | 1.07 | 0.84 | 60.47 |
Duodenic | 1.08 | 0.86 | 52.4 |
Three main strategies:
This is a basic form of regularisation.
See notes
You may have already heard of the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These are model selection criteria that directly penalise the complexity of the model.
See notes
This strategy comes is most useful when we are training our model in a sequential fashion.
The idea is to keep track of errors in a test set whilst we add data to a model.
Let’s train in batches of 10 observations and see how the error changes.
set.seed(124)
# Randomly slice the training data
train_index <- sample(1:nrow(iris),
nrow(iris))
# Fit the LDA model
lda_model_1 <- lda(Species ~ .,
data = train_data[train_index[1:10], ])
# Predict the test data
lda_pred_1 <- predict(lda_model_1, test_data)
# Evaluate accuracy
con_mat <- table(lda_pred_1$class, test_data$Species)
sum(diag(con_mat)) / sum(con_mat)
## [1] 0.5866667
Have you done any bootstrapping before? This is a(n) (almost) deterministic version of that.
Have you heard of the jackknife estimator? This is a model-fitting version of that.
Perhaps, you can imagine slicing up the data and looping through model fits on each slice in R.
Perhaps, you can imagine slicing up the data and looping through model fits on each slice in R.
You will be pleased to know that there is a function in R that does this for you.
library(caret)
# Fit 10 LDA models
lda_cv <- train(Species ~ .,
data = iris,
method = "lda",
trControl = trainControl(method = "cv",
number = 10))
And it currently can work with 239 different model types.
# Extract the predictions for each of the 10 fits
all_predictions <- lda_cv$pred
# Convert the Resample column to an integer that is the final two
# digits of the entry
all_predictions$Resample <- as.integer(substr(all_predictions$Resample,
5,6))
# Calculate the accuracy of each of the 10 fits
accuracy_ <- NULL
for (i in 1:10){
con_mat <- table(all_predictions$pred[all_predictions$Resample == i],
all_predictions$obs[all_predictions$Resample == i])
accuracy_[i] <- sum(diag(con_mat)) / sum(con_mat)
}