In this session, you are going to get some hands-on practice of how
clustering works in R
.
In this exercise, we will simulate data, and then perform PCA and K-means clustering on the data.
We are required to generate a data set with 20 observations in each of three classes (i.e. 60 observations in total), and 50 variables. Read and run the code given below which will help you to start:
# Part (a) generate data:
#
K = 3 # the number of classes
n = 20 # the number of samples per class
p = 50 # the number of variables
set.seed(123)
# Create data for class 1:
X_1 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
X_1[row,] = X_1[row,] + rep( 1, p )
}
# Create data for class 2:
X_2 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
X_2[row,] = X_2[row,] + rep( -1, p )
}
# Create data for class 3:
X_3 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
X_3[row,] = X_3[row,] + c( rep( 1, p/2 ), rep( -1, p/2 ) )
}
X <- rbind(X_1, X_2, X_3)
dim(X)
## [1] 60 50
Perform K-means clustering on the observations with \(K = 3\). How well do the clusters that you
obtained in K-means clustering compare to the true class labels? Use
fviz_cluster
function for visualisation.
# Kmeans on the original data
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Xs <- scale(X)
kmean.out = kmeans(Xs, centers=3, nstart=25)
fviz_cluster(kmean.out, Xs, ellipse.type = "norm")
What is clear from this result is that the clusters are well separated.
Indeed we generated non-overlapping clusters.
To check the optimal number of clusters, we perform the silhouette analysis. First run \(K\)-means clustering for different sizes of \(K = 2, 3, \ldots, 6\).
kmean1 = kmeans(Xs, centers=1, nstart=25)
kmean2 = kmeans(Xs, centers=2, nstart=25)
kmean3 = kmeans(Xs, centers=3, nstart=25)
kmean4 = kmeans(Xs, centers=4, nstart=25)
kmean5 = kmeans(Xs, centers=5, nstart=25)
kmean6 = kmeans(Xs, centers=6, nstart=25)
Or there is another way of doing this.
set.seed(2023)
for(i in 1:6){
assign(paste("kmean", i, sep=""), kmeans(Xs, centers=i, nstart=25))
}
Now we compute the silhouette width for each clustering output and plot it to find the optimal \(K\) where the silhouette width is maximised.
Here we can use the for loop again.
library(cluster)
sil.width <- rep(NA, 6)
sil.width[1] <- 0
for(i in 2:6){
kmean.out <- get(paste("kmean", i, sep=""))
sil <- silhouette(kmean.out$cluster, dist(Xs))
sil.width[i] <- summary(sil)$avg.width
}
plot(1:6, sil.width, type="b", pch=19, xlab="k", ylab="silhouette width")
abline(v=which.max(sil.width), col=2, lty=2)
It seems that \(k=3\) is a reasonable choice. We can get the similar plot by running below.
library(mclust)
## Package 'mclust' version 6.0.1
## Type 'citation("mclust")' for citing this R package in publications.
#K-means
fviz_nbclust(Xs, kmeans, method = "silhouette")+labs(title = "K-means")
However, the silhoutte width is maximised at \(k=2\) here. Why? because the function fviz_nbclust() use the kmeans() inside but with the default inputs e.g. nstart is set to 1 and iter.max is set to 10. If we run the above code built on silhouette() function after tweaking the objects kmean1, kmean2, …, kmean6 again without set.seed() and using different nstart e.g. nstart=1, then it will give different results whenever you run the code. To get a robust result, it is important to try different initial clusters and select the best results which you can control by changing nstart option in kmeans() function.
Repeat the previous Tasks with the updated data shown below (only the variables \(X_2\) and \(X_3\) are updated). Are the two optimal \(K\)s obtained as the same?
# Part (a) generate data:
#
K = 3 # the number of classes
n = 20 # the number of samples per class
p = 50 # the number of variables
# Create data for class 1:
X_1 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
X_1[row,] = X_1[row,] + rep( 1, p )
}
# Create data for class 2:
X_2 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
X_2[row,] = X_2[row,] + rep( -5, p )
}
# Create data for class 3:
X_3 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
X_3[row,] = X_3[row,] + c( rep( 5, p/2 ), rep( -5, p/2 ) )
}
X <- rbind(X_1, X_2, X_3)
Xs <- scale(X)
kmean.out = kmeans(Xs, centers=3, nstart=25)
fviz_cluster(kmean.out, Xs, ellipse.type = "norm")
It is highly likely that the optimal \(K\)s are obtained as the same due to stronger signal (i.e. clusters are more clearly detected even by eyes).
Start by reading and exploring the UK elections
dataset
ele <- read.csv('UK_elections.csv')
The number of variables in this data set (58!) is too high to explore them all at once (unless they all belonged to the same low dimensional vector space). It is also too diverse, containing a range of different socio-economic variables (e.g. car ownership, rental status, SeC, work seeking) and demographic variables (age, cohabitation, lone parents, household size) on top of election results. The most obvious thing is to do a cluster analysis on a subset of variables, for example the results in the 2017 election.
# We will create an object that contains only 'Name' (of the constituency)
# and number of votes per party
ele_sub <- ele[,c(1,6:13)]
For each party in each constituency, convert raw numbers of votes into a percentage share of votes, to take account of differences in population between constituencies.
Solution to Task 2a
# Calculate how many total votes were cast in each constituency
ele2017_total <- rowSums(ele_sub[,2:9])
# Calculate what percentage of votes went to each party in each constituency
percent2017 <- as.data.frame(apply(ele_sub[,2:9], 2, function(x) (x/ele2017_total)*100))
The resulting data shows that two parties (Conservative and Labour) dominate the elections in most constituencies. It makes sense to start our cluster analysis using only these two numerical variables.
Compute the optimal number of clusters and perform k-means cluster analysis on the data set where only the Labour and Conservative votes are considered.
Solution to Task 2b
We start by analyzing the optimal number of clusters:
library(factoextra)
perc17.scaled <- scale(percent2017[,(1:2)])
par(mfrow=c(1, 2))
fviz_nbclust(perc17.scaled, kmeans, method = "wss")
fviz_nbclust(perc17.scaled, kmeans, method = "silhouette")+
labs(title = "K-means")
There seems to be little doubt that 2 is an ideal number of clusters. Let us see how they look like in the data space.
km.res <- kmeans(perc17.scaled, 2, nstart = 25)
# Visualise kmeans clustering
fviz_cluster(km.res, perc17.scaled, ellipse.type = "norm") +
geom_abline(intercept = 0.3, slope = 1.1, col="blue")
Interpretation is not too difficult here: cluster 1 corresponds to
constituencies where Conservative votes are higher, while cluster 2
corresponds to constituencies where Labour votes are higher.
Clustering on Labour and Conservative votes is not particularly useful, as there is quite a simple relationship between them: when one goes up, the other will usually go down. So before we assess our results further, we will include the other political parties.
library(factoextra)
percent17.scaled <- scale(percent2017)
par(mfrow=c(1, 2))
fviz_nbclust(percent17.scaled, kmeans, method = "wss")
fviz_nbclust(percent17.scaled, kmeans, method = "silhouette")+
labs(title = "K-means")
In this case, there does not seem to be an optimal number of clusters as above. No clear elbow, a faint indication that 8 is the number that optimizes the algorithm. This is an example of a situation where hyerarchical clustering might be useful, but we haven’t seen how to perform that, so just keep this in mind for the next workshop.
Perform k-medoids cluster analysis on the data set where all the parties are considered for 4, 5 and 6 clusters. For each output, compare size and centers.
Solution to Task 2c
library(cluster)
pam.res4 <- clara(percent17.scaled, k=4)
pam.res5 <- clara(percent17.scaled, k=5)
pam.res6 <- clara(percent17.scaled, k=6)
To find the size of clusters, you should look into the
clusinfo
category of output:
pam.res4$clusinfo
## size max_diss av_diss isolation
## [1,] 205 10.322825 2.025727 5.466321
## [2,] 202 16.371215 2.041803 10.930066
## [3,] 165 9.391773 1.763984 6.270316
## [4,] 1 0.000000 0.000000 0.000000
pam.res5$clusinfo
## size max_diss av_diss isolation
## [1,] 196 10.227159 1.876164 4.348961
## [2,] 272 16.399774 1.682666 6.973782
## [3,] 68 4.090901 1.873592 1.282469
## [4,] 36 7.489162 1.927639 1.620978
## [5,] 1 0.000000 0.000000 0.000000
pam.res6$clusinfo
## size max_diss av_diss isolation
## [1,] 194 6.518499 1.786001 2.7719036
## [2,] 270 16.399774 1.616775 6.9737821
## [3,] 68 4.090901 1.873592 1.2824688
## [4,] 7 5.205576 3.089669 0.5089953
## [5,] 33 2.794454 1.586896 0.5651787
## [6,] 1 0.000000 0.000000 0.0000000
To look at the centers, look at the i.med
category of
output:
pam.res4$i.med
## [1] 569 203 450 76
pam.res5$i.med
## [1] 166 351 511 106 76
pam.res6$i.med
## [1] 166 351 511 8 116 76
At five clusters the situation seems to get stable (4 out of 5 of the centers are in common with the six clusters analysis). To understand the clusters, we can look at the centers as prototypical constituencies for the corresponding cluster.
pam.res5$medoids
## X17CON X17LAB X17LIB X17UKIP X17Green X17NAT
## [1,] -1.0488208 1.2798079 -0.5524193 -0.1774475 -0.10718192 -0.191713
## [2,] 0.6327855 -0.2765234 -0.4204159 0.3319728 -0.04995112 -0.191713
## [3,] 0.8549990 -0.9475646 0.3063692 0.2632802 0.37454950 -0.191713
## [4,] -0.3887219 -1.4057111 3.8959074 -0.9937119 -0.27287256 -0.191713
## [5,] -1.6666415 -0.9340065 -0.8322616 -0.4853175 19.58942683 -0.191713
## X17MIN X17OTH
## [1,] -0.215237 -0.6053822
## [2,] -0.215237 -0.6053822
## [3,] -0.215237 2.3883482
## [4,] 0.179807 0.2555066
## [5,] -0.215237 0.9016834
From this information we gather that:
To plot the result of our cluster analysis, fviz_cluster() is not an ideal command (PC1 and PC2 do not encode enough variability). With a little effort, we can figure out how to do this by hand:
df <- as.data.frame(percent17.scaled)
# Plot results
plot(df$X17CON, df$X17LAB, type="n", xlab="Conservative", ylab="Labour")
text(x=df$X17CON, y=df$X17LAB, labels=ele$Name, col=pam.res5$cluster+1, cex=0.75)
legend("topright", c("1","2","3","4","5"), col=c(2:6), lwd=5, cex=0.5)
abline(v=0, col=1, lwd=2)
abline(h=0, col=1, lwd=2)
There are too many data points to make this readable, so an appropriate selection might be a wiser choice:
df <- as.data.frame(percent17.scaled)[(76:111),]
# Plot results
plot(df$X17CON, df$X17LAB, type="n", xlab="Conservative", ylab="Labour")
text(x=df$X17CON, y=df$X17LAB, labels=ele$Name[76:111], col=pam.res5$cluster[76:111]+1, cex=0.75)
legend("topright", c("1","2","3","4","5"), col=c(2:6), lwd=5, cex=0.5)
abline(v=0, col=1, lwd=2)
abline(h=0, col=1, lwd=2)
Finding clusters is a good things in itself: it allows to uncover hidden pattern in a data set as long as it is done in a reasonable way (e.g. encoding enough variance or using prior knowledge that the observations are naturally split into groups). However, once appropriate clusters are discovered, they can be used for additional data exploration. In this exercise you will see how to apply the results of cluster analysis obtained in Exercise 2 to analyze your data in a more structured way.
Perform the k-medoids cluster analysis shown above using the 2019 election data and comment on possible differences with respect to the the 2017 data.
Solution to Task 3a
You first prepare the data:
ele.19 <- ele[,c(1,15:22)]
ele2019_total <- rowSums(ele.19[,2:9])
# Calculate what percentage of votes went to each party in each constituency
percent2019 <- as.data.frame(apply(ele.19[,2:9], 2, function(x) (x/ele2019_total)*100))
Then scale it and perform k-medoid clustering
library(factoextra)
library(cluster)
percent19.scaled <- as.data.frame(scale(percent2019))
par(mfrow=c(1, 2))
fviz_nbclust(percent19.scaled, kmeans, method = "wss")
fviz_nbclust(percent19.scaled, clara, method = "silhouette")+
labs(title = "K-medoids")
Again, any number from two to five seems to be a reasonable choice (seven clusters look fine too, but the analysis risks to be cumbersome).
pam.res2 <- clara(percent19.scaled, k=2)
pam.res3 <- clara(percent19.scaled, k=3)
pam.res4 <- clara(percent19.scaled, k=4)
pam.res5 <- clara(percent19.scaled, k=5)
Size of clusters and centers
pam.res2$clusinfo
## size max_diss av_diss isolation
## [1,] 277 11.86485 2.080532 3.95653
## [2,] 296 17.49336 1.869738 5.83345
pam.res3$clusinfo
## size max_diss av_diss isolation
## [1,] 298 11.87253 2.083716 4.419493
## [2,] 274 13.07943 1.728457 4.868756
## [3,] 1 0.00000 0.000000 0.000000
pam.res4$clusinfo
## size max_diss av_diss isolation
## [1,] 269 11.872530 2.068149 4.706319
## [2,] 194 13.055427 1.497947 6.761947
## [3,] 109 9.383266 1.586106 4.859983
## [4,] 1 0.000000 0.000000 0.000000
pam.res5$clusinfo
## size max_diss av_diss isolation
## [1,] 264 9.000684 1.921623 3.5679077
## [2,] 194 13.055427 1.497947 6.7619474
## [3,] 109 9.383266 1.586106 4.8599829
## [4,] 5 4.849306 2.438202 0.4084476
## [5,] 1 0.000000 0.000000 0.0000000
pam.res2$i.med
## [1] 192 73
pam.res3$i.med
## [1] 460 298 76
pam.res4$i.med
## [1] 460 454 349 76
pam.res5$i.med
## [1] 460 454 349 164 76
This situation looks even stabler than the one for 2017: from three clusters on, adding one cluster essentially splits a big cluster into two. By contrast, from two to three, a new cluster containing a mix of the previous two seem to emerge (hinting that a third party with a profile between Labour and Conservative was not accounted for in the “two clusters scenario”).
Now it is reasonable to make a choice of clusters (from 3 to 5) and explore their features. A good choice is 4, since in the 5 clusters scenario there is a cluster of size 5 that we might want to ignore (it might be a good idea to remove the observation n.76 too, but you have to do it manually and doesn’t change the final outcome).
pam.res4$medoids
## X19CON X19LAB X19LIB X19Brexit X19Green X19NAT
## [1,] -0.6460132 0.8111299 -0.3596244 0.2243878 0.1197174 -0.1771745
## [2,] 1.0283159 -0.7612310 0.0335357 -0.5994159 0.3101897 -0.1771745
## [3,] 1.0500274 -0.7673488 -0.1296122 -0.5994159 0.1141432 -0.1771745
## [4,] -1.7810188 -0.7287781 -1.1196533 -0.2901117 17.3546892 -0.1771745
## X19MIN X19OTH
## [1,] -0.1498742 -0.2407367
## [2,] -0.1498742 -0.7083936
## [3,] -0.1498742 1.2052729
## [4,] -0.1498742 0.4192323
According to this analysis:
Plotting some of the labelled observation gives an idea of the positioning of these clusters.
df <- as.data.frame(percent19.scaled)[(76:111),]
# Plot results
plot(df$X19CON, df$X19LAB, type="n", xlab="Conservative", ylab="Labour")
text(x=df$X19CON, y=df$X19LAB, labels=ele.19$Name[76:111], col=pam.res4$cluster[76:111]+1, cex=0.75)
legend("topright", c("1","2","3","4","5"), col=c(2:6), lwd=5, cex=0.5)
abline(v=0, col=1, lwd=2)
abline(h=0, col=1, lwd=2)
Several things can be said about the difference with 2017. The most interesting might be the different divide between conservatives. This is mainly due to Brexit Party in 2019 being much less popular than UKIP in 2017.
A word of caution: the cluster associated with Labour votes in 2019 is much bigger than the one in 2017. This doesn’t mean that Labour performed better, just that the constituencies that had a higher percentage of labour votes were more similar to each other, so the algorithm clustered them together. But those red labels don’t correspond to seats won by labour by any means!Clustering gives you a way to associate a new categorical variable
(the cluster type) to your observations. This, in turn, allows to use
Correspondence Analysis to check association of the cluster type with
other categorical variables. Unfortunately, the original
Elections_UK
data set contains only numerical variables.
Fortunately, we can easily turn these into categorical ones.
Choose a variable that you want to know if it is associated with any
particular cluster (The solution will use X.seekingwork
associated with people seeking work, and X.Students
associated with student population). Change the value of each
observation for that variable from numerical to categorical. A good way
to do this is to put observations into buckets (e.g. “Low”, “Average”
,“High” number of students) according to their numerical values.
You need to add a new variable to the ele
data frame. You
can do this using the within() command.
Here is an example that uses the X.seekingwork
variable:
ele <- within(ele, {
seekingwork.cat <- NA # need to initialize variable
seekingwork.cat[X.seekingwork < quantile(ele$X.seekingwork)[2]] <- "Low"
seekingwork.cat[X.seekingwork >= quantile(ele$X.seekingwork)[2] & X.seekingwork < quantile(ele$X.seekingwork)[3]] <- "Below Average"
seekingwork.cat[X.seekingwork >= quantile(ele$X.seekingwork)[3] & X.seekingwork < quantile(ele$X.seekingwork)[4]] <- "Above Average"
seekingwork.cat[X.seekingwork >= quantile(ele$X.seekingwork)[4]] <- "High"
} )
Same can be done with the X.Students
variable
ele <- within(ele, {
Students.cat <- NA # need to initialize variable
Students.cat[X.Students < quantile(ele$X.Students)[2]] <- "Low"
Students.cat[X.Students >= quantile(ele$X.Students)[2] & X.Students < quantile(ele$X.Students)[3]] <- "Below Average"
Students.cat[X.Students >= quantile(ele$X.Students)[3] & X.Students < quantile(ele$X.Students)[4]] <- "Above Average"
Students.cat[X.Students >= quantile(ele$X.Students)[4]] <- "High"
} )
We are now almost ready for the correspondence analysis.
Create a contingency table where rows are clusters and columns the buckets (e.g. “Very Low”, “Low”, “Average” ,“High”, “Very High” number of students) created earlier. Finally, perform CA to analyze association between cluster type and your chosen variable.
Solution to Task 3c
First, we
table(ele$Students.cat, pam.res4$cluster)
##
## 1 2 3 4
## Above Average 78 42 23 0
## Below Average 49 68 26 0
## High 119 16 8 1
## Low 23 68 52 0
Then we perform CA and plot the results.
library(FactoMineR)
library(factoextra)
tab <- table(ele$Students.cat, pam.res4$cluster)
student.ca <- CA(tab[,-4])
The correspondence analysis highlights an association between conservative votes (especially Cluster 3 consisting of socially conservative votes) and low number of students registered in the area, while progressive votes live in areas with a high student population.
You can do the same with any variable of your choice, of course:
tab2 <- table(ele$seekingwork.cat, pam.res4$cluster)
seekingwork.ca <- CA(tab2[,-4])
Similarly, progressive votes are higher in areas with high unemployment
rates, while conservative votes are higher in areas with low employment
rates. Interestingly, there is not a big difference between socially
conservative and conservative votes here.