This computer practical consists of three exercises.
First load the following package into your workspace for visualisation.
library(factoextra)
In this first part, you will analyze data from the database USArrests, which is already in R.
For each of the \(n = 50\) states in
the United States, the data set contains the number of arrests per
100,000 residents made in 1973 for each of three types of violent
crimes: Assault, Murder, and Rape. Moreover, the data set records also
for each state the percent of the population living in urban
areas.
Perform principal component analysis on USArrests and answer the following questions:
PCA is justified by the fact that the crime rates variables (especially assault and murder) are highly correlated, as shown by the correlation matrix:
cor(USArrests)
## Murder Assault UrbanPop Rape
## Murder 1.00000000 0.8018733 0.06957262 0.5635788
## Assault 0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape 0.56357883 0.6652412 0.41134124 1.0000000
It is therefore appropriate to perform dimension reduction via PCA.
As we did in the workshop, we can perform PCA using the prcomp() command. We need to scale the data because the units are not the same for each variable, and the variance of one variable (Assault) is significantly higher than the rest. This makes sense, as assault is on average much more common than the other two types of crime (see summary table).
summary(USArrests)
apply(USArrests, 2, sd)
pr.out <- prcomp(USArrests, scale =TRUE)
The loading vectors are the columns of the rotation matrix
pr.out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 -0.4181809 0.3412327 0.64922780
## Assault -0.5831836 -0.1879856 0.2681484 -0.74340748
## UrbanPop -0.2781909 0.8728062 0.3780158 0.13387773
## Rape -0.5434321 0.1673186 -0.8177779 0.08902432
The scores are obtained as follows
pr.out$x
and the importance of the principal components can be computed from
pr.out$sd
as in the workshop, or more easily from the
summary:
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
We clearly see here that \(PC1\)
encodes 62% of variance, so it does not retain enough information.
However, \(PC1\) and \(PC2\) together encode 86.75% of variance in
the data, enough to give an informative picture.
The library factoextra has a simple command that can be used to make a nice scree plot:
library(factoextra)
fviz_screeplot(pr.out, addlabels = TRUE)
This is an effective way to visualize the importance of the principal components.
Other useful functions for visualization in the factoextra package are fviz_pca_ind, fviz_pca_var and fviz_pca_biplot.
Explore the functions above (e.g. by typing
?fviz_pca_ind
) and describe the kind of plots that you can
obtain using them.
Let us start with fviz_pca_ind(). This command plots the scores of observations of the first two principal components. This is informative, as we saw that \(PC1\) and \(PC2\) explain a big proportion (> 80%) of the variance.
fviz_pca_ind(pr.out, axes = c(1, 2), repel = TRUE,
col.ind = "blue"
)
We can compare this to the autoplot() function and see that it is just another way of creating the same visualization. The advantage of fviz_pca_ind() is that it is somewhat more flexibile if you want to change options.
#install.packages('ggfortify')
library(ggfortify)
autoplot(pr.out, data = USArrests, label = TRUE) #, label = TRUE
The command fviz_pca_var() plots the loadings of the first two principal components.
fviz_pca_var(pr.out, axes = c(1, 2), repel = TRUE,
col.var = "red"
)
In the case of USArrests
, we can see that the length of
the four vectors are very similar. This means that the first two
principal components represent quite fairly the original variables. As
expected, the crime variables are the one that mostly contribute to the
first principal component, while the UrbanPop
variable is
predominantly contributing to the second principal component.
Finally, the fviz_pca_biplot() plots together scores and loading.
fviz_pca_biplot(pr.out, axes = c(1, 2), repel = TRUE,
col.var = "red"
)
This is very helpful to interpret the data: based only on the plot of
the scores, one could think that a higher \(PC1\) score corresponds to higher crime
rates. The biplot shows instead that the states on the left of the plot
are those with higher crime rates.
You can plot the contribution of the variables to PC1 as follows:
fviz_contrib(pr.out, choice = "var", axes = 1, top = 10)
Modify the code to plot the contribution of the variables to PC2.
# Contributions of variables to PC2
fviz_contrib(pr.out, choice = "var", axes = 2, top = 10)
The library factoextra has commands to quickly extract PCA results for variables. These results include correlation between variables and principal components (cor), quality of representation (cos2), and contributions (contrib). They can be obtained respectively as follows:
var <- get_pca_var(pr.out)
var$cor
## Dim.1 Dim.2 Dim.3 Dim.4
## Murder -0.8439764 -0.4160354 0.2037600 0.27037052
## Assault -0.9184432 -0.1870211 0.1601192 -0.30959159
## UrbanPop -0.4381168 0.8683282 0.2257242 0.05575330
## Rape -0.8558394 0.1664602 -0.4883190 0.03707412
var$cos2
## Dim.1 Dim.2 Dim.3 Dim.4
## Murder 0.7122962 0.1730854 0.04151814 0.073100217
## Assault 0.8435380 0.0349769 0.02563817 0.095846950
## UrbanPop 0.1919463 0.7539938 0.05095143 0.003108430
## Rape 0.7324611 0.0277090 0.23845544 0.001374491
var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4
## Murder 28.718825 17.487524 11.643977 42.149674
## Assault 34.010315 3.533859 7.190358 55.265468
## UrbanPop 7.739016 76.179065 14.289594 1.792325
## Rape 29.531844 2.799553 66.876071 0.792533
If you have a hard time making sense of these numbers, don’t worry:
there will be more on this in the lectures.
Generate n = 200 observations and p=30 variables from a multivariate normal distribution with correlation matrix given as \(corr(x_{i},x_{j}) = \rho^{|i-j|}\). You can create the covariance matrix using
covmat <- function(rho, p) {
rho^(abs(outer(seq(p), seq(p), "-")))
}
and use the mvrnorm
function in MASS
package to generate the multivariate Gaussian data.
Take \(\rho=0.95\) and carry out
PCA. For reproducibility, set the pseudo-number generator seed = 123.
How many PCs are sufficient to retain at least 80% of variance? Comment
on your general observation. (Hint: Use the following codes to generate
the data called dd.
library(MASS)
set.seed(123)
p = 30
n = 200
cov.e = covmat(0.95, p)
mean.e = rep(0, p)
dd <- mvrnorm(n, mean.e, cov.e)
prr <- prcomp(dd)
summary(prr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.1415 2.2933 1.3464 1.00341 0.77935 0.63838 0.48520
## Proportion of Variance 0.6179 0.1895 0.0653 0.03627 0.02188 0.01468 0.00848
## Cumulative Proportion 0.6179 0.8073 0.8726 0.90886 0.93074 0.94542 0.95390
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.41615 0.40216 0.37139 0.29564 0.29060 0.26536 0.2469
## Proportion of Variance 0.00624 0.00583 0.00497 0.00315 0.00304 0.00254 0.0022
## Cumulative Proportion 0.96014 0.96597 0.97093 0.97408 0.97712 0.97966 0.9819
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.24380 0.22571 0.21583 0.20360 0.18959 0.18087 0.17804
## Proportion of Variance 0.00214 0.00184 0.00168 0.00149 0.00129 0.00118 0.00114
## Cumulative Proportion 0.98400 0.98583 0.98751 0.98900 0.99030 0.99148 0.99262
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.1744 0.16711 0.16001 0.15574 0.15240 0.14561 0.1390
## Proportion of Variance 0.0011 0.00101 0.00092 0.00087 0.00084 0.00076 0.0007
## Cumulative Proportion 0.9937 0.99472 0.99564 0.99652 0.99735 0.99812 0.9988
## PC29 PC30
## Standard deviation 0.12848 0.12825
## Proportion of Variance 0.00059 0.00059
## Cumulative Proportion 0.99941 1.00000
plot(summary(prr)$importance[2,], type="b", xlab="PCs", ylab="Variability explained")
The summary shows that 2 components are enough to keep 80% of the
variability in data.
Take \(\rho=0.2\) and carry out PCA.
As done before, use the pseudo-number generator, seed = 123
for reproducibility. How many PCs are sufficient to retain at least 80%
of variance? Compare your results with the case of \(\rho=0.95\).
library(MASS)
set.seed(123)
p = 30
n = 200
cov.e = covmat(0.2, p)
mean.e = rep(0, p)
dd <- mvrnorm(n, mean.e, cov.e)
prr <- prcomp(dd)
summary(prr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.46267 1.33775 1.32938 1.28976 1.22917 1.17781 1.17183
## Proportion of Variance 0.07223 0.06042 0.05966 0.05616 0.05101 0.04683 0.04636
## Cumulative Proportion 0.07223 0.13265 0.19231 0.24847 0.29948 0.34631 0.39267
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.16283 1.13430 1.1074 1.05365 1.03779 1.01332 0.99861
## Proportion of Variance 0.04565 0.04344 0.0414 0.03748 0.03636 0.03467 0.03367
## Cumulative Proportion 0.43832 0.48176 0.5232 0.56065 0.59701 0.63167 0.66534
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.96032 0.92220 0.91021 0.88108 0.85584 0.83619 0.81532
## Proportion of Variance 0.03114 0.02871 0.02797 0.02621 0.02473 0.02361 0.02244
## Cumulative Proportion 0.69648 0.72519 0.75316 0.77937 0.80410 0.82770 0.85015
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.78841 0.75942 0.74078 0.71840 0.70589 0.69243 0.66375
## Proportion of Variance 0.02099 0.01947 0.01853 0.01742 0.01682 0.01619 0.01487
## Cumulative Proportion 0.87113 0.89060 0.90913 0.92655 0.94338 0.95956 0.97444
## PC29 PC30
## Standard deviation 0.62502 0.60544
## Proportion of Variance 0.01319 0.01238
## Cumulative Proportion 0.98762 1.00000
plot(summary(prr)$importance[2,], type="b", xlab="PCs", ylab="Variability explained")
The summary shows that 19 components are needed to keep 80% of the variability in data. This shows that the higher the correlation, the higher the redundancy in the data in which case PCA becomes more useful and gives more clear interpretation.
One of the many uses of PCA is principal component regression (PCR). We will get a taste of how this works on the Iris data.
We have seen that Sepal length is highly correlated with Petal length
and Petal width. Suppose that we want to carryout regression analysis
between Sepal length (response variable) and the three
remaining variables (covariates). Then we can use PCA
to improve our analysis.
Perform the PCA and find the scores of the principal components of
the covariates.
iris.pca <- prcomp(iris[,2:4], scale=TRUE)
iris.pca$x
Run a linear regression by using the lm() command. This
carries out the principal component regression (PCR,
not PCA!). Use just the first 2 principal components as covariates in
the linear regression model. As hinted before, the response variable
should be Sepal length
.
Z = iris.pca$x[,1:2] # select the first two PCs
iris.lm <- lm(iris$Sepal.Length~Z)
iris.lm
##
## Call:
## lm(formula = iris$Sepal.Length ~ Z)
##
## Coefficients:
## (Intercept) ZPC1 ZPC2
## 5.8433 0.4230 -0.4348
Now, you can transform this coefficient vector to the scale of the original variables as follows (might look mysterious for now: come back and see this again after the lecture!). This gives the final PCR estimator, whose dimension is equal to the total number of original covariates.
iris.pca$rotation[,1:2]%*%matrix(coef(iris.lm)[-1], ncol=1)
## [,1]
## Sepal.Width 0.2173767
## Petal.Length 0.3853085
## Petal.Width 0.4150270
In practice, we do not need to do PCR ourselves in this way, as this
is implemented in the pls R package. Carryout PCR using the
pcr
function in R (check
HERE) and compare the results.
library(pls)
iris.pcr <- pcr(Sepal.Length~ Sepal.Width+Petal.Length+Petal.Width, 2,
scale=TRUE, data=iris) # 2 stands for number of principal components used in pcr,
coef(iris.pcr)
## , , 2 comps
##
## Sepal.Length
## Sepal.Width 0.2173767
## Petal.Length 0.3853085
## Petal.Width 0.4150270