Prof. John Paul Gosling
Email john-paul.gosling@durham.ac.uk
Office MSC3068
Office hours Tuesdays 1300-1500
This part of the course is brand new. There will be errors as we go along.
I have put the course notes on Ultra: they will be updated as we go, but I will lock them down after we have completed a chapter and I have made the necessary edits.
All slides will also be appearing as we go.
By the end of this term, I want you to be able to:
Practical sessions will be held in weeks 3, 5, 7 and 9.
You will need access to a laptop with R and RStudio installed.
The general structure for this half of the course will be as follows:
Throughout this course, we will be considering data in the form of a matrix:
\[ X = (x_{ij}) = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}, \]
where \(x_{ij}\) is the value of the \(j\)th variable for the \(i\)th observation. Here, we have \(n\) observations and \(p\) variables.
We will typically be considering elements of the following sets:
\[ \mathcal{Z} = \mathcal{X} \times \mathcal{Y}, \]
where \(\mathcal{Z}\) is the set of all data points, \(\mathcal{X}\) is the set of all input data points (with individual elements usually encoded in a matrix), and \(\mathcal{Y}\) is the set of all output data points (with individual elements encoded in a vector).
We have a number of matrix-based formulae and concepts to keep in mind for later manipulations and derivations.
See notes
Consider our covariance matrix \(S\).
Its trace is a proxy for the total variability in the data.
Its square root of the determinant is a proxy for the volume of the data cloud.
Both are easily computed in R.
Consider two data sets that are generated from nearly the same distribution. For the first data set, we have:
\[\begin{align*} X_1&\sim N(0,1),\\ U&\sim N(0,1),\\ X_2&\sim 0.5X_1 + U. \end{align*}\]
For the second data set, we will simply multiply both variables by 0.5.
Data set | Total variance | Volume of data cloud |
---|---|---|
1 | 1.93 | 0.88 |
2 | 0.48 | 0.22 |
The trace of the covariance matrix is clearly the sum of the marginal variances.
The determinant is more subtle. But recall that, for a 2-by-2 covariance matrix, we have
\[ \det(S) = \sigma_1^2\sigma_2^2(1-\rho^2), \]
where \(\sigma_1^2\) and \(\sigma_2^2\) are the marginal variances and \(\rho\) is the correlation coefficient. If we take the square root of this, we get
\[ \sqrt{\det(S)} = \sigma_1\sigma_2\sqrt{1-\rho^2}, \] which is directly related to the area on an ellipse.
There are a number of manipulations that help with predictions and understanding the data.
See notes
When using machine learning models, we will need to consider
There are infinitely many choices for distance measures, and our choice will depend on the context.
See notes
Following the notation for data matrices, we have the following model:
\[ Y = X\beta + \epsilon, \]
where \(Y\) is the response vector, \(X\) is the data matrix, \(\beta\) is the vector of coefficients, and \(\epsilon\) is the error term.
In \(X\), it is usual to have a column of 1s to account for the intercept.
See notes
If we have multiple variables, we would use
Note that the function lm
utilises least-squares fitting
meaning that we use a Euclidean distance metric.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14294 -0.72029 0.01953 0.70563 1.88121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.55664 0.40287 3.864 0.000605 ***
## x 1.90701 0.06294 30.299 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9876 on 28 degrees of freedom
## Multiple R-squared: 0.9704, Adjusted R-squared: 0.9693
## F-statistic: 918 on 1 and 28 DF, p-value: < 2.2e-16
Linear regression depends on the following assumptions:
See notes
Adding a polynomial term is easy in R (and can be done in multiple ways).
It is often useful to understand how much of the variance in the response is explained by the model and how much we can never hope to explain.
When we use the Euclidean distance, we are minimising the sum of the squared errors and this leads us quite naturally to a partitioning of the variance.
See notes
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27171 -0.18006 0.02867 0.25684 0.74620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1490 0.1167 1.277 0.2047
## x1 0.3952 0.1410 2.804 0.0061 **
## x2 -0.1196 0.1523 -0.786 0.4340
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3982 on 97 degrees of freedom
## Multiple R-squared: 0.0843, Adjusted R-squared: 0.06542
## F-statistic: 4.465 on 2 and 97 DF, p-value: 0.01396
Linear discriminant analysis (LDA) is an old technique for slicing up the input dimensions into regions that are likely to be associated with different classes.
The algorithm boils down to two steps:
See notes
We utilise the iris
data set to illustrate LDA.
We can use the lda
function in the MASS
package to fit an LDA model.
## Call:
## lda(Species ~ ., data = iris)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 5.006 3.428 1.462 0.246
## versicolor 5.936 2.770 4.260 1.326
## virginica 6.588 2.974 5.552 2.026
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.8293776 -0.02410215
## Sepal.Width 1.5344731 -2.16452123
## Petal.Length -2.2012117 0.93192121
## Petal.Width -2.8104603 -2.83918785
##
## Proportion of trace:
## LD1 LD2
## 0.9912 0.0088
LDA depends on the following assumptions:
We can also create the confusion matrix:
##
## 1 2 3 4
## 1 15 10 14 11
## 2 5 35 2 8
## 3 0 5 30 15
## 4 11 15 22 4
Again, it is useful to look at the confusion matrix:
##
## 1 2 3 4
## 1 20 0 0 30
## 2 0 0 0 0
## 3 0 0 0 0
## 4 21 0 0 31
##
## 1 2 3 4
## 1 0 0 0 0
## 2 0 40 10 0
## 3 0 9 41 0
## 4 0 0 0 0