MLNN preliminaries

JP Gosling

2024-10-07

Front matter


Prof. John Paul Gosling

Email john-paul.gosling@durham.ac.uk

Office MSC3068

Office hours Tuesdays 1300-1500

A disclaimer

This part of the course is brand new. There will be errors as we go along.

I have put the course notes on Ultra: they will be updated as we go, but I will lock them down after we have completed a chapter and I have made the necessary edits.

All slides will also be appearing as we go.

Aims of the course

By the end of this term, I want you to be able to:

  • Understand some common terms in machine learning.
  • Be able to use R to utilise some of the most common machine learning techniques.
  • Be able to evaluate the performance of a machine learning model.
  • Realise that there are many difficult questions that arise in practice and these are lurking behind the scenes in the R package implementations.

Practicals, assignments and exams

  • Practical sessions will be held in weeks 3, 5, 7 and 9.

  • You will need access to a laptop with R and RStudio installed.


  • There will be four formative assignments.


  • There will be a short practical exam in January.


  • There will be a final exam in May/June.

Course structure

The general structure for this half of the course will be as follows:

  1. key concepts in statistics for ML,
  2. fundamental ML concepts,
  3. classification techniques,
  4. regression techniques,
  5. making use of multiple models,
  6. making ML models more interpretable,
  7. unsupervised learning.

End of section

Some notation


Source: xkcd.com/2343

Data matrices

Throughout this course, we will be considering data in the form of a matrix:

\[ X = (x_{ij}) = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}, \]


where \(x_{ij}\) is the value of the \(j\)th variable for the \(i\)th observation. Here, we have \(n\) observations and \(p\) variables.

Sets of data

We will typically be considering elements of the following sets:

\[ \mathcal{Z} = \mathcal{X} \times \mathcal{Y}, \]

where \(\mathcal{Z}\) is the set of all data points, \(\mathcal{X}\) is the set of all input data points (with individual elements usually encoded in a matrix), and \(\mathcal{Y}\) is the set of all output data points (with individual elements encoded in a vector).

Matrix forms

We have a number of matrix-based formulae and concepts to keep in mind for later manipulations and derivations.


See notes

Univariate measures of variability (1)

Consider our covariance matrix \(S\).

  • Its trace is a proxy for the total variability in the data.

  • Its square root of the determinant is a proxy for the volume of the data cloud.

Both are easily computed in R.

# The trace of matrix S
sum(diag(S))

# The square root of the determinant of S
sqrt(det(S))

Univariate measures of variability (2)

Consider two data sets that are generated from nearly the same distribution. For the first data set, we have:

\[\begin{align*} X_1&\sim N(0,1),\\ U&\sim N(0,1),\\ X_2&\sim 0.5X_1 + U. \end{align*}\]

For the second data set, we will simply multiply both variables by 0.5.

Univariate measures of variability (3)

Univariate measures of variability (4)

Univariate measures of variability (5)



Data set Total variance Volume of data cloud
1 1.93 0.88
2 0.48 0.22

Univariate measures of variability (6)

The trace of the covariance matrix is clearly the sum of the marginal variances.

The determinant is more subtle. But recall that, for a 2-by-2 covariance matrix, we have

\[ \det(S) = \sigma_1^2\sigma_2^2(1-\rho^2), \]

where \(\sigma_1^2\) and \(\sigma_2^2\) are the marginal variances and \(\rho\) is the correlation coefficient. If we take the square root of this, we get

\[ \sqrt{\det(S)} = \sigma_1\sigma_2\sqrt{1-\rho^2}, \] which is directly related to the area on an ellipse.

Linear transformations

There are a number of manipulations that help with predictions and understanding the data.

See notes

End of section

Distances


Source: Created using the Image Creator in Bing

Distances

When using machine learning models, we will need to consider

  • the distance between two observations in space,
  • the distance between an observation and a prediction.

There are infinitely many choices for distance measures, and our choice will depend on the context.

Common distance measures


See notes

Euclidean distance

Pearson distance

Manhattan distance

End of section

Linear regression


Source: xkcd.com/1725

The basic model

Following the notation for data matrices, we have the following model:

\[ Y = X\beta + \epsilon, \]

where \(Y\) is the response vector, \(X\) is the data matrix, \(\beta\) is the vector of coefficients, and \(\epsilon\) is the error term.

In \(X\), it is usual to have a column of 1s to account for the intercept.


See notes

Fitting in R (1)

Fitting in R (2)

# Fit a linear model
fit <- lm(y ~ x)

If we have multiple variables, we would use

# Fit a linear model
fit <- lm(y ~ .)

Note that the function lm utilises least-squares fitting meaning that we use a Euclidean distance metric.

Fitting in R (3)

# Summarise the fitted model
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14294 -0.72029  0.01953  0.70563  1.88121 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.55664    0.40287   3.864 0.000605 ***
## x            1.90701    0.06294  30.299  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9876 on 28 degrees of freedom
## Multiple R-squared:  0.9704, Adjusted R-squared:  0.9693 
## F-statistic:   918 on 1 and 28 DF,  p-value: < 2.2e-16

Fitting in R (4)

Some assumptions

Linear regression depends on the following assumptions:

  1. The errors are normally distributed.
  2. The errors are independent.
  3. The errors have constant variance.
  4. A linear relationship exists between the response and the predictors.


See notes

Linear model with polynomial terms (1)

Linear model with polynomial terms (2)

Adding a polynomial term is easy in R (and can be done in multiple ways).

# Fit a linear model with a polynomial term
fit <- lm(y2 ~ poly(x1, 2), data = anscombe)

# Fit a linear model with a polynomial term
fit <- lm(y2 ~ x1 + x1^2, data = anscombe)

# Fit a linear model with a polynomial term
anscombe$x1_2 <- anscombe$x1^2
fit <- lm(y2 ~ x1 + x1_2, data = anscombe)

Linear model with polynomial terms (3)

Changing the distance metric

Partioning of variance

It is often useful to understand how much of the variance in the response is explained by the model and how much we can never hope to explain.

When we use the Euclidean distance, we are minimising the sum of the squared errors and this leads us quite naturally to a partitioning of the variance.


See notes

What did you expect? (1)

What did you expect? (2)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27171 -0.18006  0.02867  0.25684  0.74620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.1490     0.1167   1.277   0.2047   
## x1            0.3952     0.1410   2.804   0.0061 **
## x2           -0.1196     0.1523  -0.786   0.4340   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3982 on 97 degrees of freedom
## Multiple R-squared:  0.0843, Adjusted R-squared:  0.06542 
## F-statistic: 4.465 on 2 and 97 DF,  p-value: 0.01396

What did you expect? (3)

What did you expect? (4)

What did you expect? (5)

End of section

Linear discriminant analysis


Source: Created using the Image Creator in Bing

The basic idea

Linear discriminant analysis (LDA) is an old technique for slicing up the input dimensions into regions that are likely to be associated with different classes.

The algorithm boils down to two steps:

  1. Find a low dimensional (linear) projection of the data that maximises the separation between the classes.
  2. Slice the space up by finding the hyperplanes that separate the classes.


See notes

LDA in R (1)

We utilise the iris data set to illustrate LDA.

LDA in R (2)

We can use the lda function in the MASS package to fit an LDA model.

# Fit the LDA model
fit <- lda(Species ~ ., data = iris)

LDA in R (3)

fit
## Call:
## lda(Species ~ ., data = iris)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa            5.006       3.428        1.462       0.246
## versicolor        5.936       2.770        4.260       1.326
## virginica         6.588       2.974        5.552       2.026
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.8293776 -0.02410215
## Sepal.Width   1.5344731 -2.16452123
## Petal.Length -2.2012117  0.93192121
## Petal.Width  -2.8104603 -2.83918785
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088

LDA in R (4)

LDA in R (5)

Some assumptions

LDA depends on the following assumptions:

  1. The classes are normally distributed.
  2. The classes have the same covariance matrix.
  3. Each observation is independent of the others.

A massive failure (1)

A massive failure (2)

A massive failure (3)

We can also create the confusion matrix:

##    
##      1  2  3  4
##   1 15 10 14 11
##   2  5 35  2  8
##   3  0  5 30 15
##   4 11 15 22  4

A massive failure (4)

A massive failure (5)

A massive failure (6)

Again, it is useful to look at the confusion matrix:

##    
##      1  2  3  4
##   1 20  0  0 30
##   2  0  0  0  0
##   3  0  0  0  0
##   4 21  0  0 31

A massive failure (7)

A massive failure (8)

A massive failure (9)

##    
##      1  2  3  4
##   1  0  0  0  0
##   2  0 40 10  0
##   3  0  9 41  0
##   4  0  0  0  0

End of chapter