To begin with, we’ll focus on using standard techniques to explore a single variable at a time. Specifically:
These techniques are quite simple, so our focus is on using them effectively to learn about data features and how to do so using R.
There are a number of Graphics packages in R that we could use to visualise our data:
R
covers most standard statistical
visualisationsggplot2
- GGPlot and related packages provide more
modern graphics, but it has unusual syntax that can be more difficult to
learnplotly
- similar to ggplot. However, a bit easier to
useWe will focus on the base R
functions, as those always
available. However, you should experiment with the other packages, but
be aware they work differently.
library(MASS)
data(Boston)
This data set contains the various information on housing values and related quantities for the 506 suburban areas in Boston. The main interest is in the `median values of owner-occupied homes’, but there are 14 variables to explore here.
Looking first at the median housing value, we can produce the histogram below. Some obvious features of note are:
Choosing the width of the bars can substantially affect the detail of the histogram. If we choose too few bins, then we can obscure key features by over-smoothing the data. Alterntaively, if we have too many bins then we can introduce too much noise to the plot that it obscures more general features and patterns. Unfortunately, the only way to find a good compromise is to experiment – R and other software will default to a ‘best guess’, but this invariably needs adjusting. The histograms below show the same data, but using bar widths of 5, 2.5, and 1 unit respectively. We’ll see more sophisticated methods for smoothing the data later.
With 14 variables in the data set, we could inspect the histograms of each of the variables. We could do this one-at-a-time, or arrange them in a grid or matrix as below:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The other variables in the data set show a variety of the features discussed above that we may want to investigate further:
In a full analysis, we would want to look at the relationships between the variables to determine how the features observed relate to each other. We’ll return to this later…
library(MASS)
data(geyser)
The geyser
data set contains 272 observations of the Old
Faithful geyser in Yellowstone National Park, Wyoming, USA. The
variables are:
duration
- Length of eruption in minswaiting
- Waiting time to next eruptionTo draw a histogram, we use the hist
function:
hist(geyser$waiting)
Here we can see the data appear to come in two groups: one group with smaller values (shorter waiting times), and one group of larger (longer waiting times).
We can add a rug plot to an existing histogram to add more detail and to show where the individual data values fall inside the bars:
hist(geyser$waiting)
rug(geyser$waiting)
Now, each individual mark on the horizontal axis shows where we have observed a data value.
Histograms can be customised in many ways, but the most important one
is changing the configuration of the bars drawn. This is controlled by
the breaks
parameter. We can set breaks
to a
number to indicate an approximate number of bars to draw:
hist(geyser$waiting,breaks=20)
Or we can be more specific and state where each individual bar should begin and end by listing the breakpoints of the bars as a vector:
hist(faithful$waiting, breaks=c(30,45,47,50,52,55,60,75,80,85,95,105))
Additionally, almost all plot functions can take the following arguments to customise the plot:
xlab
, ylab
- sets the x and y axis
labelsmain
- sets the main titlexlim
, ylim
- set x and y axis limits,
e.g. c(0,10)
col
- sets the plot colour(s)In R
, the fivenum
function gives the
standard 5-number summary:
fivenum(geyser$waiting)
## [1] 43 59 76 83 108
The summary function adds a 6th number (the mean):
summary(geyser$waiting)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 59.00 76.00 72.31 83.00 108.00
Min. 1st Qu. Median Mean 3rd Qu. Max. 43.0 58.0 76.0 70.9 82.0 96.0
A boxplot (or box-and-whisker plot) is constructed from the 5-number summary by: * Drawing a box with lower boundary at \(Q_1\) and upper boundary \(Q_3\). * Drawing a line inside the box at the median (\(Q_2\)). * Drawing lines (whiskers) from the edges of the box to the most extreme data point that is within \(1.5\times IQR\) of the edge of the box (often the minimum and maximum values).
We can draw a boxplot by using the boxplot
function on a
vector of data values:
boxplot(geyser$waiting)
Or we can pass all the columns of a data set to boxplot
to draw everything at once:
boxplot(geyser)
Now, all variables are shown together on a common axis scale. This can be useful if all the variables take values of a similar size, but - as we see here - when the variables are quite different it can obscure the features of some variables by drawing everything together.
Optional arguments for boxplot include:
horizontal
- if TRUE
the boxplots are
drawn horizontally rather than vertically.varwidth
- if TRUE
the boxplot widths are
drawn proportional to the square root of the samples sizes, so wider
boxplots represent more data. Though usually, you have the same number
of data points in each column of the data set so this is not often very
helpful.To show the relationship between a histogram and a boxplot, we have a data set comprising the length (mm) of 100 cuckoo eggs. Drawng the histogram and boxplot together, we can see that the histogram gives far more detail on the shape of the distribution and the boxplot is more of a summary visualisation. The median is indicated in red and the upper/lower quartiles in green, which shows how these quantities align between the plots. Note that the smallest value in the data is flagged as an outlier and drawn separately as a circle on the boxplot.
One advantage of the boxplot, is that as it is a simple summary plot it is much easier to use boxplots to compare many variables at once. For example, the data plotted below are boxplots of the heights in inches of the singers in the New York Choral Society in 1979. The data are grouped according to the voice part they play in the choir. The vocal range for each voice part increases in pitch according to the following order: Bass 2, Bass 1, Tenor 2, Tenor 1, Alto 2, Alto 1, Soprano 2, Soprano 1. We can see immediately that the lowest pitch voices are associated with the taller singers and the higher pitches with smaller singers, which makes a lot of intuitive sense. There will also be a strong correspondance with Gender here too, though that information is not recorded.
We should examine a single boxplot for the following features:
We should examine several boxplots for the following features:
Many statistical methods require our data be approximately Normally distributed - but how can we tell whether the approximation is reasonable?
Normal-quantile plots provide a simple and informal way of doing this, without having to do a formal hypothesis test — though that may be a natural next step.
A Normal Quantile (or Q-Q plot) can be used to informally assess the normality of a data set. We construct the plot as follows:
The basic idea is that these plots have the property that plotted points for Normally distributed data should fall roughly on a straight line.
This is because \(x_{(k)}\) are the sample quantiles of our data, and \(Z_k\) are the theoretical quantiles of a Normal distribution. If our sample distribution is approximately normal, these pairs of values will be in agreement.
Systematic deviations from the straight line indicate non-normality.
We don’t need the points to lie on a perfect straight line – we often use the `fat pen test’, meaning that if the points are covered by placing a fat pen over the top, then that’s enough to conclude approximate normality!
Let’s inspect the cuckoo egg data for normality. We can draw a histogram first, and compare its shape to a normal curve (blue line). While there looks to be approximate agreement (and we only ever need approximate Normality…), we see some divergence for small values of the data.
Drawing the Normal quantile plot confirms these observations - things look reasonably close to a straight line, but deviate from Normality for small values. However, this would probably be enough to pass our ‘fat pen test’.
In R
, we can use the qqnorm
function to
draw a Normal quantile plot of a single variable. Calling
qqline
after adds the theoretical straight line for
comparison
qqnorm(mtcars$mpg)
qqline(mtcars$mpg)
For this variable (miles per gallon of various cars) we see strong departure from Normality as the points lie far from the desired straight line.
To illustrate what happens when our data are very non-Normal, consider this data set of ‘monthly mean relative sunspot numbers’ recorded from from 1749 to 1983. The data are counts of a quantity that is usually quite small in magnitude, this means the data are usually concentrated on small values with an ‘invisible wall’ at 0 - since we cannot observe negative counts! This gives rise to a heavily skewed distribution, as we see in the histogram and boxplot below. The Normal quantile plot (right) now shows strong curvature - not the straight-line feature we would expect if the data were Normally distributed.
A stem and leaf plot presents the numerical values of the data in a similar form to a histogram.
stem(geyser$waiting)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 4 | 3
## 4 | 577888889999999
## 5 | 00000000000011111222223333333444444444
## 5 | 5556677777777788888999
## 6 | 0000001112222234
## 6 | 5555555668889999
## 7 | 01111122222233333344444444
## 7 | 5555555556666666677777777778888888888888888899999999999
## 8 | 00000000000001111111111112222222233333344444444444
## 8 | 5555555666667777777777777788888889999999
## 9 | 0011222333333334
## 9 | 668
## 10 |
## 10 | 8
This plot resembles a sideways histogram, only it shows extra information on the values within each bar.
A stripplot is similar again, but displays the data as points rather than the numerical value
stripchart(geyser$waiting, method='stack')
By default, it draws a hollow box for each data point which can make it
difficult to read. We can modify the shape of the points by the
pch
(p
lot ch
aracter) argument can
help improve readability:
stripchart(geyser$waiting, method='stack', pch=16)
A beeswarm plot is similar to stripplot, but uses various techniques to separate nearby points such that each point remains visible. It also draws the plot symmetric about a central axis, rather than stacking up points from a baseline.
library(beeswarm)
beeswarm(geyser$waiting,horizontal = TRUE)