MLNN classification

JP Gosling

2024-10-29

Running examples


Source: Created using the Image Creator in Bing

Handwriting classification (1)

Handwriting classification (2)

Handwriting classification (3)

Handwriting classification (4)

Handwriting classification (5)

Handwriting classification (6)


We have a training set of 2,000 digits and a test set of 1,000 digits.


The digits are 28x28 pixels in size giving us 784 variables.

Handwriting classification (6)


We have a training set of 2,000 digits and a test set of 1,000 digits.


The digits are 28x28 pixels in size giving us 784 variables.


There is massive correlation in the variables, but the algorithms presented here will tend to ignore this.

Terrible fake data

End of section

\(k\)-nearest neighbours


Source: Created using the Image Creator in Bing

The algorithm

The \(k\)-nearest neighbours algorithm is a lazy algorithm. This means that it does not learn a model from the training data. Instead, it classifies new data points based on the training data.


  1. Compute the distance between the new input and each training input.
  2. Sort the distances in ascending order.
  3. Select the \(k\) nearest neighbours of the new input.
  4. Classify the new input based on the majority class of the neighbours.

Choices

The \(k\)-nearest neighbours algorithm has two main choices:


  • The distance metric to use.


  • The number of neighbours to consider.


  • (Plus a technicality about ties for the majority.)

Handwriting classification (1)

library(class)

# Train the model
knn_model <- knn(train = MNIST_train[, -1], 
                 test = MNIST_test[, -1], 
                 cl = MNIST_train$label, 
                 k = 3)

# Compute the accuracy
sum(knn_model == MNIST_test$label) / length(knn_model)
## [1] 0.868

Handwriting classification (2)

Handwriting classification (3)

Handwriting classification (4)

library(caret)

# Leave 100-out cross-validation
knn_cv <- train(x = MNIST_train[, -1], 
                y = as.factor(MNIST_train$label), 
                method = "knn", 
                tuneGrid = expand.grid(k = 1:10), 
                trControl = trainControl(method = "cv",
                                         number = 20))

# Plot the results
plot(knn_cv)

Handwriting classification (5)

Handwriting classification (6)

knn_cv
## k-Nearest Neighbors 
## 
## 2000 samples
##  784 predictor
##   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 1900, 1900, 1902, 1898, 1900, 1901, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9124143  0.9025703
##    2  0.8979975  0.8865268
##    3  0.9075030  0.8971012
##    4  0.9020126  0.8909903
##    5  0.9009925  0.8898518
##    6  0.8959968  0.8842878
##    7  0.8930119  0.8809638
##    8  0.8939970  0.8820677
##    9  0.8914812  0.8792735
##   10  0.8940216  0.8820880
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

Terrible fake data (1)

library(caret)

# Leave 100-out cross-validation
knn_cv <- train(x = X[, -3], 
                y = X$Class, 
                method = "knn", 
                tuneGrid = expand.grid(k = 1:10), 
                trControl = trainControl(method = "cv",
                                         number = 20))

# Plot the results
plot(knn_cv)

Terrible fake data (2)

Terrible fake data (3)

knn_cv
## k-Nearest Neighbors 
## 
## 202 samples
##   2 predictor
##   4 classes: '1', '2', '3', '4' 
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 191, 191, 193, 193, 191, 193, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.6426894  0.5198390
##    2  0.6090530  0.4715523
##    3  0.6311237  0.5020650
##    4  0.5773232  0.4303253
##    5  0.5945076  0.4547095
##    6  0.5940530  0.4522104
##    7  0.5987753  0.4593627
##    8  0.6131944  0.4783842
##    9  0.6088258  0.4702154
##   10  0.6002399  0.4614984
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

Terrible fake data (4)

End of section

Naive Bayes classifier


Source: xkcd.com/2059

The basic idea


In the absence of any information, what proportion of the data belongs to each class?


If it did belong to class A, how probable it is that the explanatory variables would look like this?


See notes

The algorithm


The Naive Bayes classifier is an eager algorithm. This means that it learns a model from the training data. It is also probabilistic in nature, but most implementations will return a class label alone.


  1. Estimate the class probabilities.
  2. Estimate the conditional probabilities of the explanatory variables given the class.
  3. Use Bayes’ theorem to estimate the probability of the class given the explanatory variables for new data.

Choices


The Naive Bayes classifier has one main choice:


  • The distributions of the explanatory variables given the class.


  • (Plus altering the initial class probabilities.)

Simple example (1)


\(x_1\) \(x_2\) \(y\)
1.5 A Positive
1.7 B Positive
1.3 A Positive
1.9 B Positive
2.1 A Positive
2.3 B Negative
2.5 A Negative
2.7 B Negative
1.9 A Negative

Simple example (2)


We have estimated class prior probabilities of \(P(\text{Positive}) = 5/9\) and \(P(\text{Negative}) = 4/9\).

Simple example (2)


We have estimated class prior probabilities of \(P(\text{Positive}) = 5/9\) and \(P(\text{Negative}) = 4/9\).


We have also estimated the likelihoods of the variables as follows: \[ \begin{aligned} x_1 | \text{Positive}~ &\sim N(1.7, 0.1),\\ x_1 | \text{Negative} &\sim N(2.35, 0.1),\\ x_2 | \text{Positive}~ &\sim \text{Multinomial}(P_A = 0.6, P_B = 0.4),\\ x_2 | \text{Negative} &\sim \text{Multinomial}(P_A = 0.5, P_B = 0.5).\\ \end{aligned} \]

Simple example (3)


We can now estimate the new data points falling into the positive class.


\(x_1\) \(x_2\) \(P(\text{Positive}|x)\)
1.8 A 0.87
2.2 B 0.24
2.0 B 0.54

Handwriting classification (1)

library(e1071)

# Train the model
nb_model <- naiveBayes(x = MNIST_train[, -1], 
                       y = as.factor(MNIST_train$label))
                       
# Summarise
nb_model$apriori
## as.factor(MNIST_train$label)
##   0   1   2   3   4   5   6   7   8   9 
## 191 220 198 191 214 180 200 224 172 210
# Predict for test data
nb_predict <- predict(nb_model, MNIST_test[, -1])

Handwriting classification (2)

# Calculate accuracy
sum(nb_predict == MNIST_test$label) / length(nb_predict)
## [1] 0.498
# Confusion matrix
table(MNIST_test$label, nb_predict)
##    nb_predict
##       0   1   2   3   4   5   6   7   8   9
##   0  74   1   0   0   0   1   2   0   7   0
##   1   0 125   0   0   0   0   1   0   0   0
##   2   6  22  20   5   0   4  18   0  41   0
##   3   9  20   1  26   0   5   7   2  32   5
##   4   5  10   0   0  18   0  12   1  26  38
##   5  10  13   1   0   2   1   4   0  54   2
##   6   4   5   1   0   0   0  72   0   5   0
##   7   2  24   0   1   2   1   1  29  13  26
##   8   1  19   1   1   2   2   0   0  60   3
##   9   1  13   0   0   1   0   1   0   5  73

Terrible fake data (1)

# Train the model
nb_model <- naiveBayes(x = X[, -3], 
                       y = X$Class)

# Summarise
nb_model$apriori
## X$Class
##  1  2  3  4 
## 50 50 50 52
# Predict for test data
nb_predict <- predict(nb_model, X_test[, -3])

Terrible fake data (2)

Terrible fake data (3)

Terrible fake data (3)

Terrible fake data (4)

End of section

Decision trees


Source: xkcd.com/518

End of section

Perceptron


Source: xkcd.com/1838

End of section

Balancing classes


Source: Created using the Image Creator in Bing

End of chapter