Lab 5: Clustering

In this session, you are going to get some hands-on practice of how clustering works in R.

In exercise 1, you will perform K-means cluster analysis on simulated data;
In exercise 2, you will perform K-means and K-medoids cluster analysis on a data set with higher complexity than the Iris one;
In exercise 3, you will examine how to combine cluster analysis and correspondence analysis.

Exercise 1

In this exercise, we will simulate data, and then perform PCA and K-means clustering on the data.

We are required to generate a data set with 20 observations in each of three classes (i.e. 60 observations in total), and 50 variables. Read and run the code given below which will help you to start:

# Part (a) generate data:
#
K = 3  # the number of classes
n = 20 # the number of samples per class
p = 50 # the number of variables

set.seed(123)
# Create data for class 1:
X_1 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
  X_1[row,] = X_1[row,] + rep( 1, p )
}

# Create data for class 2:
X_2 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
  X_2[row,] = X_2[row,] + rep( -1, p )
}

# Create data for class 3:
X_3 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
  X_3[row,] = X_3[row,] + c( rep( 1, p/2 ), rep( -1, p/2 ) )
}

X <- rbind(X_1, X_2, X_3)
dim(X)

## [1] 60 50

Task 1a

Perform K-means clustering on the observations with \(K = 3\). How well do the clusters that you obtained in K-means clustering compare to the true class labels? Use fviz_cluster function for visualisation.

Solution to Task 1a

Click for solution

# Kmeans on the original data
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Xs <- scale(X) 

kmean.out = kmeans(Xs, centers=3, nstart=25)
fviz_cluster(kmean.out, Xs, ellipse.type = "norm")

What is clear from this result is that the clusters are well separated. Indeed we generated non-overlapping clusters.

Task 1b

To check the optimal number of clusters, we perform the silhouette analysis. First run \(K\)-means clustering for different sizes of \(K = 2, 3, \ldots, 6\).

Solution to Task 1b

Click for solution

kmean1 = kmeans(Xs, centers=1, nstart=25)
kmean2 = kmeans(Xs, centers=2, nstart=25)
kmean3 = kmeans(Xs, centers=3, nstart=25)
kmean4 = kmeans(Xs, centers=4, nstart=25)
kmean5 = kmeans(Xs, centers=5, nstart=25)
kmean6 = kmeans(Xs, centers=6, nstart=25)

Or there is another way of doing this.

set.seed(2023)
for(i in 1:6){
  assign(paste("kmean", i, sep=""), kmeans(Xs, centers=i, nstart=25))
}

Task 1c

Now we compute the silhouette width for each clustering output and plot it to find the optimal \(K\) where the silhouette width is maximised.

Solution to Task 1c

Click for solution

Here we can use the for loop again.

library(cluster)
sil.width <- rep(NA, 6)
sil.width[1] <- 0

for(i in 2:6){
  kmean.out <- get(paste("kmean", i, sep=""))
  sil <- silhouette(kmean.out$cluster, dist(Xs))
  sil.width[i] <- summary(sil)$avg.width
}

plot(1:6,  sil.width, type="b", pch=19, xlab="k", ylab="silhouette width")
abline(v=which.max(sil.width), col=2, lty=2)

It seems that \(k=3\) is a reasonable choice. We can get the similar plot by running below.

library(mclust)

## Package 'mclust' version 6.0.1
## Type 'citation("mclust")' for citing this R package in publications.

#K-means
fviz_nbclust(Xs, kmeans, method = "silhouette")+labs(title = "K-means")

However, the silhoutte width is maximised at \(k=2\) here. Why? because the function fviz_nbclust() use the kmeans() inside but with the default inputs e.g. nstart is set to 1 and iter.max is set to 10. If we run the above code built on silhouette() function after tweaking the objects kmean1, kmean2, …, kmean6 again without set.seed() and using different nstart e.g. nstart=1, then it will give different results whenever you run the code. To get a robust result, it is important to try different initial clusters and select the best results which you can control by changing nstart option in kmeans() function.

Task 1d

Repeat the previous Tasks with the updated data shown below (only the variables \(X_2\) and \(X_3\) are updated). Are the two optimal \(K\)s obtained as the same?

# Part (a) generate data:
#
K = 3  # the number of classes
n = 20 # the number of samples per class
p = 50 # the number of variables

# Create data for class 1:
X_1 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
  X_1[row,] = X_1[row,] + rep( 1, p )
}

# Create data for class 2:
X_2 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
  X_2[row,] = X_2[row,] + rep( -5, p )
}

# Create data for class 3:
X_3 = matrix( rnorm(n*p), nrow=n, ncol=p )
for( row in 1:n ){
  X_3[row,] = X_3[row,] + c( rep( 5, p/2 ), rep( -5, p/2 ) )
}

X <- rbind(X_1, X_2, X_3)
Xs <- scale(X) 

kmean.out = kmeans(Xs, centers=3, nstart=25)
fviz_cluster(kmean.out, Xs, ellipse.type = "norm")

Solution to Task 1d

Click for solution

It is highly likely that the optimal \(K\)s are obtained as the same due to stronger signal (i.e. clusters are more clearly detected even by eyes).

Exercise 2

Start by reading and exploring the UK elections dataset

ele <- read.csv('UK_elections.csv')

The number of variables in this data set (58!) is too high to explore them all at once (unless they all belonged to the same low dimensional vector space). It is also too diverse, containing a range of different socio-economic variables (e.g. car ownership, rental status, SeC, work seeking) and demographic variables (age, cohabitation, lone parents, household size) on top of election results. The most obvious thing is to do a cluster analysis on a subset of variables, for example the results in the 2017 election.

# We will create an object that contains only 'Name' (of the constituency) 
# and number of votes per party
ele_sub <- ele[,c(1,6:13)]

Task 2a

For each party in each constituency, convert raw numbers of votes into a percentage share of votes, to take account of differences in population between constituencies.

Solution to Task 2a

Click for solution

# Calculate how many total votes were cast in each constituency 
ele2017_total <- rowSums(ele_sub[,2:9])
# Calculate what percentage of votes went to each party in each constituency
percent2017 <- as.data.frame(apply(ele_sub[,2:9], 2, function(x) (x/ele2017_total)*100))

The resulting data shows that two parties (Conservative and Labour) dominate the elections in most constituencies. It makes sense to start our cluster analysis using only these two numerical variables.

Task 2b

Compute the optimal number of clusters and perform k-means cluster analysis on the data set where only the Labour and Conservative votes are considered.

Solution to Task 2b

Click for solution

We start by analyzing the optimal number of clusters:

library(factoextra)
perc17.scaled <- scale(percent2017[,(1:2)])
par(mfrow=c(1, 2))
fviz_nbclust(perc17.scaled, kmeans, method = "wss")

fviz_nbclust(perc17.scaled, kmeans, method = "silhouette")+
  labs(title = "K-means")

There seems to be little doubt that 2 is an ideal number of clusters. Let us see how they look like in the data space.

km.res <- kmeans(perc17.scaled, 2, nstart = 25)
# Visualise kmeans clustering
fviz_cluster(km.res, perc17.scaled, ellipse.type = "norm") + 
  geom_abline(intercept = 0.3, slope = 1.1, col="blue")

Interpretation is not too difficult here: cluster 1 corresponds to constituencies where Conservative votes are higher, while cluster 2 corresponds to constituencies where Labour votes are higher.

Clustering on Labour and Conservative votes is not particularly useful, as there is quite a simple relationship between them: when one goes up, the other will usually go down. So before we assess our results further, we will include the other political parties.

library(factoextra)
percent17.scaled <- scale(percent2017)
par(mfrow=c(1, 2))
fviz_nbclust(percent17.scaled, kmeans, method = "wss")

fviz_nbclust(percent17.scaled, kmeans, method = "silhouette")+
  labs(title = "K-means")

In this case, there does not seem to be an optimal number of clusters as above. No clear elbow, a faint indication that 8 is the number that optimizes the algorithm. This is an example of a situation where hyerarchical clustering might be useful, but we haven’t seen how to perform that, so just keep this in mind for the next workshop.

Task 2c

Perform k-medoids cluster analysis on the data set where all the parties are considered for 4, 5 and 6 clusters. For each output, compare size and centers.

Solution to Task 2c

Click for solution

library(cluster)
pam.res4 <- clara(percent17.scaled, k=4)
pam.res5 <- clara(percent17.scaled, k=5)
pam.res6 <- clara(percent17.scaled, k=6)

To find the size of clusters, you should look into the clusinfo category of output:

pam.res4$clusinfo

##      size  max_diss  av_diss isolation
## [1,]  205 10.322825 2.025727  5.466321
## [2,]  202 16.371215 2.041803 10.930066
## [3,]  165  9.391773 1.763984  6.270316
## [4,]    1  0.000000 0.000000  0.000000

pam.res5$clusinfo

##      size  max_diss  av_diss isolation
## [1,]  196 10.227159 1.876164  4.348961
## [2,]  272 16.399774 1.682666  6.973782
## [3,]   68  4.090901 1.873592  1.282469
## [4,]   36  7.489162 1.927639  1.620978
## [5,]    1  0.000000 0.000000  0.000000

pam.res6$clusinfo

##      size  max_diss  av_diss isolation
## [1,]  194  6.518499 1.786001 2.7719036
## [2,]  270 16.399774 1.616775 6.9737821
## [3,]   68  4.090901 1.873592 1.2824688
## [4,]    7  5.205576 3.089669 0.5089953
## [5,]   33  2.794454 1.586896 0.5651787
## [6,]    1  0.000000 0.000000 0.0000000

To look at the centers, look at the i.med category of output:

pam.res4$i.med

## [1] 569 203 450  76

pam.res5$i.med

## [1] 166 351 511 106  76

pam.res6$i.med

## [1] 166 351 511   8 116  76

At five clusters the situation seems to get stable (4 out of 5 of the centers are in common with the six clusters analysis). To understand the clusters, we can look at the centers as prototypical constituencies for the corresponding cluster.

pam.res5$medoids

##          X17CON     X17LAB     X17LIB    X17UKIP    X17Green    X17NAT
## [1,] -1.0488208  1.2798079 -0.5524193 -0.1774475 -0.10718192 -0.191713
## [2,]  0.6327855 -0.2765234 -0.4204159  0.3319728 -0.04995112 -0.191713
## [3,]  0.8549990 -0.9475646  0.3063692  0.2632802  0.37454950 -0.191713
## [4,] -0.3887219 -1.4057111  3.8959074 -0.9937119 -0.27287256 -0.191713
## [5,] -1.6666415 -0.9340065 -0.8322616 -0.4853175 19.58942683 -0.191713
##         X17MIN     X17OTH
## [1,] -0.215237 -0.6053822
## [2,] -0.215237 -0.6053822
## [3,] -0.215237  2.3883482
## [4,]  0.179807  0.2555066
## [5,] -0.215237  0.9016834

From this information we gather that:

Cluster 1 seems to be associated with Labour voters
Cluster 2 seems to be associated with a mix of Conservative and UKIP voters
Cluster 3 seems to be associated with Conservative voters not counted in cluster 2 (likely “moderate conservatives”)
Cluster 4 seems to be associated with LiberalDemocratic votes
Cluster 5 seems to be associated with Green votes (this is a funny one because it consists of a single point, quite less informative than the others).

To plot the result of our cluster analysis, fviz_cluster() is not an ideal command (PC1 and PC2 do not encode enough variability). With a little effort, we can figure out how to do this by hand:

df <- as.data.frame(percent17.scaled)

# Plot results
plot(df$X17CON, df$X17LAB, type="n", xlab="Conservative", ylab="Labour")
text(x=df$X17CON, y=df$X17LAB, labels=ele$Name,    col=pam.res5$cluster+1, cex=0.75)
legend("topright", c("1","2","3","4","5"), col=c(2:6), lwd=5, cex=0.5)
abline(v=0, col=1, lwd=2)
abline(h=0, col=1, lwd=2)

There are too many data points to make this readable, so an appropriate selection might be a wiser choice:

df <- as.data.frame(percent17.scaled)[(76:111),]

# Plot results
plot(df$X17CON, df$X17LAB, type="n", xlab="Conservative", ylab="Labour")
text(x=df$X17CON, y=df$X17LAB, labels=ele$Name[76:111], col=pam.res5$cluster[76:111]+1, cex=0.75)
legend("topright", c("1","2","3","4","5"), col=c(2:6), lwd=5, cex=0.5)
abline(v=0, col=1, lwd=2)
abline(h=0, col=1, lwd=2)

Exercise 3

Finding clusters is a good things in itself: it allows to uncover hidden pattern in a data set as long as it is done in a reasonable way (e.g. encoding enough variance or using prior knowledge that the observations are naturally split into groups). However, once appropriate clusters are discovered, they can be used for additional data exploration. In this exercise you will see how to apply the results of cluster analysis obtained in Exercise 2 to analyze your data in a more structured way.

Task 3a

Perform the k-medoids cluster analysis shown above using the 2019 election data and comment on possible differences with respect to the the 2017 data.

Solution to Task 3a

Click for solution

You first prepare the data:

ele.19 <- ele[,c(1,15:22)] 
ele2019_total <- rowSums(ele.19[,2:9])
# Calculate what percentage of votes went to each party in each constituency
percent2019 <- as.data.frame(apply(ele.19[,2:9], 2, function(x) (x/ele2019_total)*100))

Then scale it and perform k-medoid clustering

library(factoextra)
library(cluster)
percent19.scaled <- as.data.frame(scale(percent2019))
par(mfrow=c(1, 2))
fviz_nbclust(percent19.scaled, kmeans, method = "wss")

fviz_nbclust(percent19.scaled, clara, method = "silhouette")+
  labs(title = "K-medoids")

Again, any number from two to five seems to be a reasonable choice (seven clusters look fine too, but the analysis risks to be cumbersome).

pam.res2 <- clara(percent19.scaled, k=2)
pam.res3 <- clara(percent19.scaled, k=3)
pam.res4 <- clara(percent19.scaled, k=4)
pam.res5 <- clara(percent19.scaled, k=5)

Size of clusters and centers

pam.res2$clusinfo

##      size max_diss  av_diss isolation
## [1,]  277 11.86485 2.080532   3.95653
## [2,]  296 17.49336 1.869738   5.83345

pam.res3$clusinfo

##      size max_diss  av_diss isolation
## [1,]  298 11.87253 2.083716  4.419493
## [2,]  274 13.07943 1.728457  4.868756
## [3,]    1  0.00000 0.000000  0.000000

pam.res4$clusinfo

##      size  max_diss  av_diss isolation
## [1,]  269 11.872530 2.068149  4.706319
## [2,]  194 13.055427 1.497947  6.761947
## [3,]  109  9.383266 1.586106  4.859983
## [4,]    1  0.000000 0.000000  0.000000

pam.res5$clusinfo

##      size  max_diss  av_diss isolation
## [1,]  264  9.000684 1.921623 3.5679077
## [2,]  194 13.055427 1.497947 6.7619474
## [3,]  109  9.383266 1.586106 4.8599829
## [4,]    5  4.849306 2.438202 0.4084476
## [5,]    1  0.000000 0.000000 0.0000000

pam.res2$i.med

## [1] 192  73

pam.res3$i.med

## [1] 460 298  76

pam.res4$i.med

## [1] 460 454 349  76

pam.res5$i.med

## [1] 460 454 349 164  76

This situation looks even stabler than the one for 2017: from three clusters on, adding one cluster essentially splits a big cluster into two. By contrast, from two to three, a new cluster containing a mix of the previous two seem to emerge (hinting that a third party with a profile between Labour and Conservative was not accounted for in the “two clusters scenario”).

Now it is reasonable to make a choice of clusters (from 3 to 5) and explore their features. A good choice is 4, since in the 5 clusters scenario there is a cluster of size 5 that we might want to ignore (it might be a good idea to remove the observation n.76 too, but you have to do it manually and doesn’t change the final outcome).

pam.res4$medoids

##          X19CON     X19LAB     X19LIB  X19Brexit   X19Green     X19NAT
## [1,] -0.6460132  0.8111299 -0.3596244  0.2243878  0.1197174 -0.1771745
## [2,]  1.0283159 -0.7612310  0.0335357 -0.5994159  0.3101897 -0.1771745
## [3,]  1.0500274 -0.7673488 -0.1296122 -0.5994159  0.1141432 -0.1771745
## [4,] -1.7810188 -0.7287781 -1.1196533 -0.2901117 17.3546892 -0.1771745
##          X19MIN     X19OTH
## [1,] -0.1498742 -0.2407367
## [2,] -0.1498742 -0.7083936
## [3,] -0.1498742  1.2052729
## [4,] -0.1498742  0.4192323

According to this analysis:

Cluster 1 seems to be associated with Labour voters
Cluster 2 seems to be associated with a mix of Conservative, Liberal and Green voters
Cluster 3 seems to be associated with Conservative voters not counted in cluster 2 (possibly “social conservatives”, since they are farther away from liberal democratics)
Cluster 4 seems to be associated with Green votes (the usual Brighton Pavillion constituency).

Plotting some of the labelled observation gives an idea of the positioning of these clusters.

df <- as.data.frame(percent19.scaled)[(76:111),]

# Plot results
plot(df$X19CON, df$X19LAB, type="n", xlab="Conservative", ylab="Labour")
text(x=df$X19CON, y=df$X19LAB, labels=ele.19$Name[76:111], col=pam.res4$cluster[76:111]+1, cex=0.75)
legend("topright", c("1","2","3","4","5"), col=c(2:6), lwd=5, cex=0.5)
abline(v=0, col=1, lwd=2)
abline(h=0, col=1, lwd=2)

Several things can be said about the difference with 2017. The most interesting might be the different divide between conservatives. This is mainly due to Brexit Party in 2019 being much less popular than UKIP in 2017.

A word of caution: the cluster associated with Labour votes in 2019 is much bigger than the one in 2017. This doesn’t mean that Labour performed better, just that the constituencies that had a higher percentage of labour votes were more similar to each other, so the algorithm clustered them together. But those red labels don’t correspond to seats won by labour by any means!

Clustering gives you a way to associate a new categorical variable (the cluster type) to your observations. This, in turn, allows to use Correspondence Analysis to check association of the cluster type with other categorical variables. Unfortunately, the original Elections_UK data set contains only numerical variables. Fortunately, we can easily turn these into categorical ones.

Task 3b

Choose a variable that you want to know if it is associated with any particular cluster (The solution will use X.seekingwork associated with people seeking work, and X.Students associated with student population). Change the value of each observation for that variable from numerical to categorical. A good way to do this is to put observations into buckets (e.g. “Low”, “Average” ,“High” number of students) according to their numerical values.

Solution to Task 3b

Click for solution

You need to add a new variable to the ele data frame. You can do this using the within() command.

Here is an example that uses the X.seekingwork variable:

ele <- within(ele, {   
  seekingwork.cat <- NA # need to initialize variable
  seekingwork.cat[X.seekingwork < quantile(ele$X.seekingwork)[2]] <- "Low"
  seekingwork.cat[X.seekingwork >= quantile(ele$X.seekingwork)[2] & X.seekingwork < quantile(ele$X.seekingwork)[3]] <- "Below Average"
  seekingwork.cat[X.seekingwork >= quantile(ele$X.seekingwork)[3] & X.seekingwork < quantile(ele$X.seekingwork)[4]] <- "Above Average"
  seekingwork.cat[X.seekingwork >= quantile(ele$X.seekingwork)[4]] <- "High"
   } )

Same can be done with the X.Students variable

ele <- within(ele, {   
  Students.cat <- NA # need to initialize variable
  Students.cat[X.Students < quantile(ele$X.Students)[2]] <- "Low"
  Students.cat[X.Students >= quantile(ele$X.Students)[2] & X.Students < quantile(ele$X.Students)[3]] <- "Below Average"
  Students.cat[X.Students >= quantile(ele$X.Students)[3] & X.Students < quantile(ele$X.Students)[4]] <- "Above Average"
  Students.cat[X.Students >= quantile(ele$X.Students)[4]] <- "High"
   } )

We are now almost ready for the correspondence analysis.

Task 3c

Create a contingency table where rows are clusters and columns the buckets (e.g. “Very Low”, “Low”, “Average” ,“High”, “Very High” number of students) created earlier. Finally, perform CA to analyze association between cluster type and your chosen variable.

Solution to Task 3c

Click for solution

First, we

table(ele$Students.cat, pam.res4$cluster)

##                
##                   1   2   3   4
##   Above Average  78  42  23   0
##   Below Average  49  68  26   0
##   High          119  16   8   1
##   Low            23  68  52   0

Then we perform CA and plot the results.

library(FactoMineR)
library(factoextra)
tab <- table(ele$Students.cat, pam.res4$cluster)
student.ca <- CA(tab[,-4])

The correspondence analysis highlights an association between conservative votes (especially Cluster 3 consisting of socially conservative votes) and low number of students registered in the area, while progressive votes live in areas with a high student population.

You can do the same with any variable of your choice, of course:

tab2 <- table(ele$seekingwork.cat, pam.res4$cluster)
seekingwork.ca <- CA(tab2[,-4])

Similarly, progressive votes are higher in areas with high unemployment rates, while conservative votes are higher in areas with low employment rates. Interestingly, there is not a big difference between socially conservative and conservative votes here.