hist
boxplot
qqnorm
summary
and
fivenum
To being with, let’s first see how to use R to produce standard plots of a variable, namely histograms, boxplots, and quantile (or QQ) plots.
For illustration, let us use the mtcars
data set which
contains information on the characteristics of 23 cars.
data(mtcars)
A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a quantitative variable. The area of each bar is proportional to the frequency of items found in each class. A histogram is useful to look at when we want to see more detail on the full distribution of the data, and features relating to its shape.
To plot a histogram, we use apply the hist
function to
the data vector. We can extract the mpg
(miles-per-gallon)
variable from the mtcars
data set using the $
operator, and so we can draw a histogram as follows:
hist(mtcars$mpg)
This seems to suggest that we have a peak somewhere between 15 and 20 mpg, and potentially another peak between 30 and 35 mpg - perhaps suggesting groups of ‘fuel efficient’ and ‘fuel inefficient’ cars.
One way to assess if the number of bars in the histogram is appropriate is to show the location of the data points on the horizontal axis. We can add a ‘rug plot’ to our histogram, which marks the positions of the data with lines on the axis:
hist(mtcars$mpg)
rug(mtcars$mpg) ## Note: the 'rug' function draws on top of an existing histogram
Now we can also see where the data fall within the bars!
The default settings of hist
will determine the number
of bars to display algorithmically, and in this case it has drawn only
5. In general, this is probably too few to show any detail, but we don’t
have many data points here. Fortunately, the display of the histogram
can be adjusted by a number of arguments:
breaks
- allows us to control the number of bars in the
histogram. breaks
can take a variety of different inputs:
breaks
is set to a single number, this will be used
to (suggest) the number of bars in the histogram.breaks
is set to a vector, the values will be used
to indicate the endpoints of the bars of the histogram.freq
- if TRUE
the histogram shows the
simple frequencies or counts within each bar; if FALSE
then
the histogram shows probability densities rather than counts.hist
function to draw histograms of
miles-per-gallon which match those shown above (don’t worry about the
labels).
R Help: hist
A five-number numerical summary can be computed with the
fivenum
function, which takes a vector of numbers as input.
Here, we compute a five-number summary of the mpg
data in
the mtcars
dataset.
fivenum(mtcars$mpg)
## [1] 10.40 15.35 19.20 22.80 33.90
The values returned are the sample minimum, lower quartile, median, upper quartile and maximum. We can see that the median across all the cars in the dataset is about 20 miles per gallon. This is pretty terrible by modern standards, but the data are from 1974 and the USA.
To add a little more information, the summary
function
includes the mean of the data for a 6-number summary
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
The inclusion of the mean can be quite helpful. If the data have an approximately symmetric distribution then the mean and median values should be close, which can be used as a quick check for any potential skewness in the data. Given that the mean is fairly close to the median, there doesn’t appear to be a dramatic amount of skewness in the distribution of MPG.
A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set. In many ways, it is simply a direct visualisation of the five number summary constructed above. The whiskers (vertical lines) capture roughly 99% of a normal distribution, and observations outside this range are plotted as points representing outliers (see the figure below).
Boxplots are created for single variables using the
boxplot
function, but can be used to easily compare many
variables or groups within the data. To draw a boxplot of a single
variable, multiple variables, or all variables in a data frame, we
simply pass the data directly to the boxplot
function:
boxplot(mtcars$mpg)
As the boxplot is based on the simple 5-number summary, it lacks the detail of a histogram. However, we can inspect it for features such as symmetry or skewness - a symmetric distribution will give a boxplot with a whiskers of equal length, a centrally-positioned box evenly divided by the median line.
Here, we see the box is slightly off-centre, suggesting some slight skewness. We also note that there are no obvious outliers.
We can also draw boxplots of all the variables in a data set by passing the entire data frame to theboxplot
function.
However, if the scales of our variables differ substantially then it can
be difficult to detect much useful information.
The boxplot is most useful when for comparing how a variable behaves in different groups (i.e., the levels of a categorical variable). For example, we can compare the MPG with the number of engine cylinders
cyl <- factor(mtcars$cyl) # make the 'cyl' variable categorical
boxplot(mtcars$mpg ~ cyl)
Optional arguments for boxplot
include:
horizontal
- if TRUE
the boxplots are
drawn horizontally rather than vertically.varwidth
- if TRUE
the boxplot widths are
drawn proportional to the square root of the samples sizes, so wider
boxplots represent more data.R Help: boxplot
Histograms leave much to the interpretation of the viewer. A better graphical way in R to tell whether the data is distributed normally is to look at a so-called Normal quantile (also known as a quantile-quantile, or QQ) plot.
With this technique, we plot the quantiles of the data (i.e. the
ordered data values) against the quantiles of a normal distribution. If
the data are normally distributed, then the points of the QQ plot will
lie on a straight line. Deviations from a straight line suggest
departures from the normal distribution. This technique can be applied
to any distribution, though R supports Normal quantile plots
with the qqnorm
function:
qqnorm(mtcars$mpg)
qqline(mtcars$mpg) ## add the straight line for reference
While the middle chunk of the data lie close to the line, the values away from the middle start to deviate. This suggests that the tails of the distribution (i.e. the extremes) don’t match the Normal distribution shape - in this case, as both tails are pointing the same way, we suspect skewness.
R Help: qqnorm
mpg
data to make a variable skewed in the
opposite direction to mpg
- check its histogram and
quantile plot. Does it look like you expected?
Download data: galton
Francis
Galton famously developed his ideas on correlation and regression
using data which included the heights of parents and their children.
This galton
data set include data on heights for 928
children and their 205 ‘midparents’. Each ‘midparent’ height is the
average of the father’s height and 1.08 times the mother’s height (to
adjust for the usual gender differences). Similarly, the daughter’s
heights have also been multiplied by 1.08. Note that we have one
midparent height for each child, so that many midparent heights are
repeated.
The variables are child
and parent
for the
different heights recorded in inches.
load
function in the console and specifying
the path to the file as the argument to the functionR makes it easy to combine multiple plots into one overall
graph, using either the par
or layout
functions.
With the par
function, we specify the argument
mfrow=c(nr, nc)
to split the plot window into a grid of
\(nr \times nc\) plots that are filled
in by row. For example, to divide the plot window into a 2x2 grid we
call par(mfrow=c(2,2))
. The next four plots we draw will
then fill the respective quarters of the plot. For example,
Similarly, for 3 plots in a single column we would call
par(mfrow=c(3,1))
.
To return to the usual single-plot display, we must call
par(mfrow=c(1,1))
.
When we don’t want to arrange plots in a simple regular grid, we can
use the layout
function. See the R help for more
details.
One of the difficulties of comparing multiple independent plots is we need to do more work to ensure consistency of presentation. In particular, we should ensure that our histogram intervals and axis ranges are the same for both plots, as the default presentation will change from plot to plot.
Many high level plotting functions (plot
,
hist
, boxplot
, etc.) allow you to include
additional options to customise how the plot is drawn (as well as other
graphical parameters). We have seen examples of these already with the
axis label arguments xlab
and ylab
, however we
can customise the following plot features for finer control of how a
plot is drawn.
Axis limits
To control the ranges of the horizontal and vertical axes, we can add
the xlim
and ylim
arguments to our original
plotting function To set the horixontal axis limits, we pass a vector of
two numbers to represent the lower and upper limits,
xlim = c(lower, upper)
specifying numerical values for
upper
and lower
, and repeat the same for
ylim
to customise the vertical axis.
Axis labels
To specify a label for the x- and y-axes we can supply a string to
the xlab
and ylab
arguments. To give a plot a
title, we pass the title as a string to the main
argument.
It is easier to compare the shape of distributions in histograms when they are arranged vertically, they use equal horizontal axis limits, and the same binwidths.
par
to setup a column of two plots.child
and then parent
from galton
using:
In interpreting the data, it is worth noting that:
You might expect that if we had the individual un-adjusted heights and the genders of the parents and children we would find that the height data distributions would be neatly bimodal with one peak for females and one for males. They are not. Apparently, height distributions are rarely like that.
Download data: movies
Data science inevitably involves working with large data sets. The
effort involved in preparing and making a large dataset usable for
analysis should not be underestimated, but thankfully we’re going to
look at a dataset “prepared earlier”. This movies
data set
is reasonably large(ish), containing 24 different attribues of 28819
movies gathered from IMDB. One of the
variables is the movie length in minutes, and it is interesting to look
at this variable in some detail.
movies
data set and draw a histogram of the
data.length
variable.Although it’s tempting to dismiss these as simple errors, it is worth checking if possible.
r1
to r10
give the
percentage of reviews which rated the movie as a 1
up to a
10
out of 10
. Are these movies particularly
popular?Incidentally, this data set is no longer up-to-date and there are some even longer films now (though it’s a mystery why.)
Clearly, the extreme outliers should be ignored, and for exploring the main distribution of movie lengths it makes sense to set some kind of upper limit. Over 99% of the data are less than three hours in length, so let’s restrict ourselve to those.
Useful context:
Using colour in a plot can be very effective, for example to
highlight different groups within the data. Colour is adjusted by
setting the col
optional arugment to the plotting function,
and what R does with that information depends on the value we
supply.
col
is assigned a single value: all points on a
scatterplot, all bars of a histogram, all boxplots are coloured with the
new colourcol
is a vector:
col
is a vector of the same length
as the number of data points then each data point is coloured
individuallycol
is a vector of the same length
as the number of bars then each bar is coloured individuallycol
is a vector of the same length as
the number of boxplots then each boxplot is coloured individuallyNow that we know how the col
argument works, we need to
know how to specify colours. Again, there are a number of ways and you
can mix and match as appropriate
1:8
are interpreted as
colours (black, red, green, blue, …) and can be used as a quick
shorthand for a common colour. Type palette()
to see the
sequence of colours R uses."steelblue"
,
"darkorange"
. You can see the list of recognised names by
typing colors()
, and a document showing the actual colors
is available here"#ff0000"
and cyan as
"#00ffff"
.rainbow
, heat.colors
, and
terrain.colors
and all take the number of desired colours
as argument.## Colour example
## 3 plots in one row
par(mfrow=c(1,3))
## colour the cars data by number of gears
plot(x=mtcars$wt, y=mtcars$mpg, col=mtcars$gear, xlab="Weight", ylab="MPG",
main="MPG vs Weight")
## manually colour boxplots
boxplot(mpg~cyl, data=mtcars, col=c("orange","violet","steelblue3"),
main="Car Milage Data", xlab="Number of Cylinders",
ylab="Miles Per Gallon")
## use a colour function to shade histogram bars
hist(mtcars$mpg,col=rainbow(5),main='MPG')
col
argument to add colour to
your histograms.
‘Stripplots’ or ‘Stripcharts’ are very similar to the rugplot we applied to our histograms, and display the individual data points along a single axis. They can be used in much the same way as a boxplot, but rather than showing the data summaries they display everything!
The built-in faithful
data set contains measurements on
the waiting times between the eruptions of the Old Faithful
geyser.
data(faithful)
stripchart(faithful$waiting,ylab='Waiting Time', pch=16)
Plotting symbols
The symbols used for points in plots can be changed by specifying a value for the argumentpch
{#pch} (which stands for
plot character). Specifying values for
pch
works in the same way as col
, though
pch
only accepts integers between 1 and 20 to represent
different point types. The default is usually pch=1
which
is a hollow circle, in the plot above we changed it to 15
which is a filled circle.
However, when we have a lot of data points concentrated in a small interval, the stripplot suffers from problems of ‘overplotting’ where many points with similar values are drawn on top of each other.
A partial solution to this is to add random noise (known as ‘jittering’) to spread out the points.
A better solution is to stack the dots that fall close together, producing an alternative plot to a histogram - sometimes called a ‘dotplot’ or ‘stacked dotplot’
stripchart(faithful$waiting, method='stack',pch=16)
An evolution of the stripplot is the ‘beeswarm’ plot, available from
the beeswarm
package. A bee swarm plot is similar to
stripplot, but with various methods to separate nearby points such that
each point is visible.
library(beeswarm)
beeswarm(faithful$waiting)
One limitation of the beeswarm plot is that the computations to arrange all the points do not scale well with large data sets - do not try this with the movies data, or you will be waiting for a very long time!
Download data: bundestag
These data contain the results of the 2009 elections for the German Bundestag, the first chamber of the German parliament. The contains the number of votes cast for the various political parties, for each state (“Bundesland”). Amongst the German political parties there are two on the left of the political spectrum, the SPD - similar to the UK’s Labour party - and Die Linke (“The Left”), a party even further to the left. Suppose we’re interested in the support for this “Die Linke” party.
LINKE1
variable using:
horizontal=TRUE
A stem and leaf plot is a technique for displaying the data in a similar fashion to a histogram, while preserving the information of the individual numerical values. Where the histogram summarises the data by the counts in its various intervals, the stem and leaf plot retains the original data values up to two significant figures.
To see how this works, let’s look at the Old Faithful data, sorted from smallest to largest.
sort(faithful$waiting)
## [1] 43 45 45 45 46 46 46 46 46 47 47 47 47 48 48 48 49 49 49 49 49 50 50 50 50 50 51 51 51 51 51 51 52 52
## [35] 52 52 52 53 53 53 53 53 53 53 54 54 54 54 54 54 54 54 54 55 55 55 55 55 55 56 56 56 56 57 57 57 58 58
## [69] 58 58 59 59 59 59 59 59 59 60 60 60 60 60 60 62 62 62 62 63 63 63 64 64 64 64 65 65 65 66 66 67 68 69
## [103] 69 70 70 70 70 71 71 71 71 71 72 73 73 73 73 73 73 73 74 74 74 74 74 74 75 75 75 75 75 75 75 75 76 76
## [137] 76 76 76 76 76 76 76 77 77 77 77 77 77 77 77 77 77 77 77 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
## [171] 79 79 79 79 79 79 79 79 79 79 80 80 80 80 80 80 80 80 81 81 81 81 81 81 81 81 81 81 81 81 81 82 82 82
## [205] 82 82 82 82 82 82 82 82 82 83 83 83 83 83 83 83 83 83 83 83 83 83 83 84 84 84 84 84 84 84 84 84 84 85
## [239] 85 85 85 85 85 86 86 86 86 86 86 87 87 88 88 88 88 88 88 89 89 89 90 90 90 90 90 90 91 92 93 93 94 96
Note that the smallest value is 43, followed by three values of 45. A stem and leaf plot of these data looks like this
stem(faithful$waiting)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 4 | 3
## 4 | 55566666777788899999
## 5 | 00000111111222223333333444444444
## 5 | 555555666677788889999999
## 6 | 00000022223334444
## 6 | 555667899
## 7 | 00001111123333333444444
## 7 | 555555556666666667777777777778888888888888889999999999
## 8 | 000000001111111111111222222222222333333333333334444444444
## 8 | 55555566666677888888999
## 9 | 00000012334
## 9 | 6
Each row of this plot is called a ‘stem’ and the values to the right of the ‘|’ symbol are the leaves. Be sure to read where R places the decimal point for the output. For this result, the decimal is placed one digit to the right of the vertical bar. Thus, the first row of the table then consists of data values of the form \(4x\), and the only leaf is a \(3\) corresponding to the value \(43\) in the data. The next stem groups the values \(45-49\), and we notice the three observations of \(45\) are represented by the \(555\) at the start of the second stem.
Notice that each stem part is representing an interval of width 5, much like a histogram. As usual, R figures out how best to increment the stem part unless you specify otherwise. Finally, notice how the shape of the stem and leaf plot mirrors that of a histogram with interval width 5 - the only difference is that here we can see the values inside the bars.
As with the beeswarm plot, the stem and leaf plot is only suitable for relatively modestly sized data sets due to the fact it is literally writing out all of the data values on the screen!
The data come from an old survey of 237 students taking their first statistics course. The dataset is called survey in the package MASS.
data(survey, package='MASS')
Download data: diamonds
The set diamonds includes information on the weight in
carats (carat
) and price
of 53,940
diamonds.
Download data: zuni
The zuni dataset seems quite simple. There are three pieces of information about each of 89 school districts in the US State of New Mexico: the name of the district, the average revenue per pupil in dollars, and the number of pupils. The apparent simplicity hides an interesting story. The data were used to determine how to allocate substantial amounts of money and there were intense legal disgreements about how the law should be interpreted and how the data should be used. Gastwirth was heavily involved and has written informatively about the case from a statistical point of view Gastwirth, 2006 and Gastwirth, 2008.
One statistical issue was the rule that before determining whether district revenues were sufficiently equal, the largest and smallest 5% of the data should first be deleted.
Download data: engine
These data record the amounts of three pollutants - carbon monoxide
CO
, hydrocarbons HC
, and nitrogen oxide
NO
- in grammes emitted per mile by 46 light-duty
engines.
Download data: chlorph
These data come from a semi-automated process for measuring the actual amount of chlorpheniramine maleate in tablets which are supposed to contain a 4mg dose.
The tablets used for the study were made by two different manufacturers. For each manufacturer, a composite was produced by grinding together a number of tablets. Each composite was split into seven pieces each of the same weight as a tablet and the pieces were sent to seven different laboratories. Each laboratory made 10 separate measurements on each composite.
The data contain three variables: * chlorpheniramine - the amount measured * manufacturer - the tablet manufacturer as a factor (A or B) * laboratory - the laboratory which performed the measurement as a factor (1 to 7)
boxplot(chlorpheniramine~laboratory*manufacturer, data=chlorph)
Try this (or some abbreviated version of it).