In this lecture we’ll focus on expanding our techniques to find structure among multiple variables. Specifically, we’ll consider the first of the following aspects of working with multivariate data:
For example, let’s consider the Weight and Height of the 10,384 athletes competing in the London 2012 Olympics:
library(VGAMdata)
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:AER':
##
## tobit
## The following object is masked from 'package:lmtest':
##
## lrtest
## The following object is masked from 'package:car':
##
## logit
##
## Attaching package: 'VGAMdata'
## The following object is masked _by_ '.GlobalEnv':
##
## crime.us
data(oly12)
We can draw a scatterplot of these variables using the
plot
function:
plot(x=oly12$Height, y=oly12$Weight)
Note the choice of which variable is drawn as the horizontal coordinate and which is the vertical.
The plot
function accepts the usual arguments to
customise the graphic.
xlab
, ylab
, main
- axis
labels and main titlexlim
, ylim
- axis ranges, a vector of
length twopch
- changes the plot character, integercol
- changes the point colour, either specifying one
colour for all points or one colour for each pointcex
- relative point size, defaults to 1In particular, the plot character (pch
) can be changed
to something more solid to give a clearer picture.
plot(x=oly12$Height,y=oly12$Weight,xlab='Height', ylab='Weight',main='',pch=20)
So, what do features do we see in this plot:
Overplotting is a problem in a scatterplot where the drawn data points overlap one another. This typically occurs when there are a large number of data points and/or a small number of unique values in the dataset.
As multiple stacked points look the same as a single point, this makes it difficult to identify areas of high density. In the scatterplot above, we cannot tell if there is one person with a weight over 200 or one hundred.
Possible solutions include:
We can apply transparency to the scatterplot of the Olympic athletes as follows:
library(scales)
plot(x=oly12$Height,y=oly12$Weight,xlab='Height', ylab='Weight',main='',pch=20,
col=alpha('black',0.2)) ## this tells R to use 20% transparent black for each point
Now the areas of high density in the main cloud of data stand out as darker, and the more unusual values fade out.
library(MASS)
data(geyser)
The Old Faithful geyser in Yellowstone National Park, Wyoming, USA which a very regular pattern of eruption.
Consider the duration of the eruptions and the waiting time until the next eruption.
plot(y=geyser$waiting,x=geyser$duration, pch=16, xlab='duration',ylab='waiting')
What can we see?
Download data: movies
Most of the data sets we’ve seen so far are not very big. Consider instead the movies data set, which contained 24 variables and 58 788 cases corresponding to various attributes of different movies taken from IMDB. Let’s focus on two of the variables: the average IMDB user rating, and the number of users who rated that movie on IMDB.
plot(x=movies$votes, y=movies$rating, ylab="Rating", xlab="Votes")
What do we see here?
We can extract a lot of information from a scatterplot!
Patterns and trends observed within a scatterplot can often be explained by the action of other variables.
In particular, other categorical variables may induce different behaviours for each group (e.g. male and female).
Consider the average values of fertility (number of children per mother) versus percentage of contraceptors among women of childbearing age in developing nations around 1990.
library(carData)
data(Robey)
plot(y=Robey$tfr, x=Robey$contraceptors, pch=20,
ylab='Total fertility rate (children per woman)',
xlab='Percent of contraceptors among married women of childbearing age')
The fertility data set also contains a third variable representing the region of the world, which we can indicate in colour.
plot(y=Robey$tfr, x=Robey$contraceptors, pch=20, col=Robey$region,
ylab='Total fertility rate (children per woman)',
xlab='Percent of contraceptors among married women of childbearing age')
legend(x='topright',pch=20,lwd=NA, col=1:nlevels(Robey$region), legend=levels(Robey$region))
We could have used a different symbol for each region, but this is generally less effective than colour:
plot(y=Robey$tfr, x=Robey$contraceptors, pch=as.numeric(Robey$region),
ylab='Total fertility rate (children per woman)',
xlab='Percent of contraceptors among married women of childbearing age')
legend(x='topright',lwd=NA, pch=1:nlevels(Robey$region), legend=levels(Robey$region))
The standard measure of the strength of linear
relationship between two variables is the correlation
coefficient . Assuming that we have \(n\) pairs of observations \((x_1,y_1)\), …,\((x_n,y_n)\), the correlation coefficient is
defined to be: \[
r=\frac{1}{n-1}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{s_x}\right)\left(\frac{y_i-\bar{y}}{s_y}\right)
\] In R, we can use the cor
function to compute
this.
The correlation can take any value between \(-1\) and \(+1\), inclusive, with positive values representing positive correlation (as one variable increases, so does the other), and negative values representing negative correlation (as one variable increases, the other decreases). A correlation of \(0\) means uncorrelated: no association (as one variable increases, the other doesn’t consistently increase or decrease).
Thus the value of \(r\) can range from \(r=-1\) (perfect negative correlation) through weaker negative correlation until \(r=0\) (no correlation) through weak positive correlation to \(r=1\) (perfect +ve correlation).
The correlation tells us two things:
For the fertility data:
cor(y=Robey$tfr, x=Robey$contraceptors)
## [1] -0.9203109
The value here is negative reflecting the negative aassociation - the increase in contraception corresponds to a reduction in numbers of children. The value itself is quite large (close to -1) which indicates a strong association - this is reflected by the fact the data are roughly organised along a straight line.
Difficulties with correlation:
Recall, for example, Anscombe’s quartet of data points:
library(datasets)
data(anscombe)
cor(anscombe$x1, anscombe$y1)
## [1] 0.8164205
cor(anscombe$x2, anscombe$y2)
## [1] 0.8162365
cor(anscombe$x3, anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4, anscombe$y4)
## [1] 0.8165214
Their correlations are almost the same (up to 2 decimal places), but the actual patterns of the points are quite different:
par(mfrow=c(2,2),mai = c(0.3, 0.3, 0.3, 0.3))
plot(x=anscombe$x1, y=anscombe$y1,xlab='',ylab='',pch=20,xlim=c(0,20),ylim=c(0,14))
plot(x=anscombe$x2, y=anscombe$y2,xlab='',ylab='',pch=20,xlim=c(0,20),ylim=c(0,14))
plot(x=anscombe$x3, y=anscombe$y3,xlab='',ylab='',pch=20,xlim=c(0,20),ylim=c(0,14))
plot(x=anscombe$x4, y=anscombe$y4,xlab='',ylab='',pch=20,xlim=c(0,20),ylim=c(0,14))
The Olympic athletes data set also could be investigated by using colour to represent Gender. This shows an expected differentiation between the two:
plot(x=oly12$Height,y=oly12$Weight,xlab='Height', ylab='Weight',main='',pch=16,
col=alpha(c('#000000','#ff0000'),0.2)[oly12$Sex])
legend(x='topleft', legend=levels(oly12$Sex), pch=16, col=c(1,2))
Colouring the 42 sport categories is less helpful.
plot(x=oly12$Height,y=oly12$Weight,xlab='Height', ylab='Weight',main='',pch=16,
col=oly12$Sport)
Unfortunately, with many categories the differences between 42 colours become too subtle to detect easily. Colour is only useful when indicating a small number of groups.
In this case, a better solution is to break the data up and draw separate mini scatterplots for each of the 42 sports. This is called a lattice, grid or trellis of plots: