Today we will look at the techniques for producing some standard visualisations of single continuous data variables:
- Plotting a histogram with hist
- Plotting a boxplot with boxplot
- Examining normality using a Normal quantile plot with qqnorm
- Extracting simple numerical summaries with summary and fivenum
- Customising plots with labels, titles, colour, etc.

1 Standard Plots in R

To being with, let’s first see how to use R to produce standard plots of a variable, namely histograms, boxplots, and quantile (or QQ) plots.

For illustration, let us use the mtcars data set which contains information on the characteristics of 23 cars.

data(mtcars)

1.1 Histograms

A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a quantitative variable. The area of each bar is proportional to the frequency of items found in each class. A histogram is useful to look at when we want to see more detail on the full distribution of the data, and features relating to its shape.

To plot a histogram, we use apply the hist function to the data vector. We can extract the mpg (miles-per-gallon) variable from the mtcars data set using the $ operator, and so we can draw a histogram as follows:

hist(mtcars$mpg)

This seems to suggest that we have a peak somewhere between 15 and 20 mpg, and potentially another peak between 30 and 35 mpg - perhaps suggesting groups of ‘fuel efficient’ and ‘fuel inefficient’ cars.

One way to assess if the number of bars in the histogram is appropriate is to show the location of the data points on the horizontal axis. We can add a ‘rug plot’ to our histogram, which marks the positions of the data with lines on the axis:

hist(mtcars$mpg)
rug(mtcars$mpg) ## Note: the 'rug' function draws on top of an existing histogram

Now we can also see where the data fall within the bars!

The default settings of hist will determine the number of bars to display algorithmically, and in this case it has drawn only 5. In general, this is probably too few to show any detail, but we don’t have many data points here. Fortunately, the display of the histogram can be adjusted by a number of arguments:

breaks - allows us to control the number of bars in the histogram. breaks can take a variety of different inputs:
- If breaks is set to a single number, this will be used to (suggest) the number of bars in the histogram.
- If breaks is set to a vector, the values will be used to indicate the endpoints of the bars of the histogram.
freq - if TRUE the histogram shows the simple frequencies or counts within each bar; if FALSE then the histogram shows probability densities rather than counts.

Try using the hist function to draw histograms of miles-per-gallon which match those shown above (don’t worry about the labels).

R Help: hist

1.2 Numerical summaries

A five-number numerical summary can be computed with the fivenum function, which takes a vector of numbers as input. Here, we compute a five-number summary of the mpg data in the mtcars dataset.

fivenum(mtcars$mpg)

## [1] 10.40 15.35 19.20 22.80 33.90

The values returned are the sample minimum, lower quartile, median, upper quartile and maximum. We can see that the median across all the cars in the dataset is about 20 miles per gallon. This is pretty terrible by modern standards, but the data are from 1974 and the USA.

To add a little more information, the summary function includes the mean of the data for a 6-number summary

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

The inclusion of the mean can be quite helpful. If the data have an approximately symmetric distribution then the mean and median values should be close, which can be used as a quick check for any potential skewness in the data. Given that the mean is fairly close to the median, there doesn’t appear to be a dramatic amount of skewness in the distribution of MPG.

1.3 Boxplots

A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set. In many ways, it is simply a direct visualisation of the five number summary constructed above. The whiskers (vertical lines) capture roughly 99% of a normal distribution, and observations outside this range are plotted as points representing outliers (see the figure below).

Boxplots are created for single variables using the boxplot function, but can be used to easily compare many variables or groups within the data. To draw a boxplot of a single variable, multiple variables, or all variables in a data frame, we simply pass the data directly to the boxplot function:

boxplot(mtcars$mpg)

As the boxplot is based on the simple 5-number summary, it lacks the detail of a histogram. However, we can inspect it for features such as symmetry or skewness - a symmetric distribution will give a boxplot with a whiskers of equal length, a centrally-positioned box evenly divided by the median line.

Here, we see the box is slightly off-centre, suggesting some slight skewness. We also note that there are no obvious outliers.

We can also draw boxplots of all the variables in a data set by passing the entire data frame to the boxplot function. However, if the scales of our variables differ substantially then it can be difficult to detect much useful information.

Try it now! Can you see anything useful in the plot?

The boxplot is most useful when for comparing how a variable behaves in different groups (i.e., the levels of a categorical variable). For example, we can compare the MPG with the number of engine cylinders

cyl <- factor(mtcars$cyl) # make the 'cyl' variable categorical
boxplot(mtcars$mpg ~ cyl)

What do you conclude about fuel efficiency in cars with more engine cylinders?

Optional arguments for boxplot include:

horizontal - if TRUE the boxplots are drawn horizontally rather than vertically.
varwidth - if TRUE the boxplot widths are drawn proportional to the square root of the samples sizes, so wider boxplots represent more data.

R Help: boxplot

1.4 Quantile plot

Histograms leave much to the interpretation of the viewer. A better graphical way in R to tell whether the data is distributed normally is to look at a so-called Normal quantile (also known as a quantile-quantile, or QQ) plot.

With this technique, we plot the quantiles of the data (i.e. the ordered data values) against the quantiles of a normal distribution. If the data are normally distributed, then the points of the QQ plot will lie on a straight line. Deviations from a straight line suggest departures from the normal distribution. This technique can be applied to any distribution, though R supports Normal quantile plots with the qqnorm function:

qqnorm(mtcars$mpg)
qqline(mtcars$mpg) ## add the straight line for reference

While the middle chunk of the data lie close to the line, the values away from the middle start to deviate. This suggests that the tails of the distribution (i.e. the extremes) don’t match the Normal distribution shape - in this case, as both tails are pointing the same way, we suspect skewness.

R Help: qqnorm

Suppose you had data that were skewed in the opposite direction. What do you think the quantile plot would look like?
Use the mpg data to make a variable skewed in the opposite direction to mpg - check its histogram and quantile plot. Does it look like you expected?

2 Data Set: Galton’s Heights

Download data: galton

Francis Galton famously developed his ideas on correlation and regression using data which included the heights of parents and their children. This galton data set include data on heights for 928 children and their 205 ‘midparents’. Each ‘midparent’ height is the average of the father’s height and 1.08 times the mother’s height (to adjust for the usual gender differences). Similarly, the daughter’s heights have also been multiplied by 1.08. Note that we have one midparent height for each child, so that many midparent heights are repeated.

The variables are child and parent for the different heights recorded in inches.

Download the data from the link above. Then either load the file by either:
- Using the Files tab in RStudio to locate the file and clicking it to load
- In the Environment tab in RStudio, clicking the “Load Workspace” button, finding the file, and clicking “Open”
- Using the load function in the console and specifying the path to the file as the argument to the function
Draw a boxplot of both variables in the data set.
What features do you see? How do the heights compare in terms of location and spread?
Does this agree with what you expected?
Draw and compare histograms of the two variables - can you detect any noticeable similarities or differences?
Redraw your histograms and add a rugplot to each. Does this give you more information?

2.1 Combining Plots

R makes it easy to combine multiple plots into one overall graph, using either the par or layout functions.

With the par function, we specify the argument mfrow=c(nr, nc) to split the plot window into a grid of $nr \times nc$ plots that are filled in by row. For example, to divide the plot window into a 2x2 grid we call par(mfrow=c(2,2)). The next four plots we draw will then fill the respective quarters of the plot. For example,

Similarly, for 3 plots in a single column we would call par(mfrow=c(3,1)).

To return to the usual single-plot display, we must call par(mfrow=c(1,1)).

When we don’t want to arrange plots in a simple regular grid, we can use the layout function. See the R help for more details.

R Help: par, layout

Use the technique above to draw Normal quantile plots of both heights, again side-by-side - do the data look reasonably normal?
Now draw both of the histograms side-by-side on the same plot (you may need to adjust the width of your plot window.)
Do your plots reveal anything interesting?

One of the difficulties of comparing multiple independent plots is we need to do more work to ensure consistency of presentation. In particular, we should ensure that our histogram intervals and axis ranges are the same for both plots, as the default presentation will change from plot to plot.

2.2 Customising Plots

Many high level plotting functions (plot, hist, boxplot, etc.) allow you to include additional options to customise how the plot is drawn (as well as other graphical parameters). We have seen examples of these already with the axis label arguments xlab and ylab, however we can customise the following plot features for finer control of how a plot is drawn.

Axis limits

To control the ranges of the horizontal and vertical axes, we can add the xlim and ylim arguments to our original plotting function To set the horixontal axis limits, we pass a vector of two numbers to represent the lower and upper limits, xlim = c(lower, upper) specifying numerical values for upper and lower, and repeat the same for ylim to customise the vertical axis.

Axis labels

To specify a label for the x- and y-axes we can supply a string to the xlab and ylab arguments. To give a plot a title, we pass the title as a string to the main argument.

It is easier to compare the shape of distributions in histograms when they are arranged vertically, they use equal horizontal axis limits, and the same binwidths.

Use par to setup a column of two plots.
Plot a histogram of child and then parent from galton using:
- x-axis limits of 60 to 75
- A bar width of 1 unit
- An appropriate x-axis label and plot title
Does the comparison yield any new information that wasn’t conveyed in the boxplot?
Try reducing your bar widths - what do you find?

In interpreting the data, it is worth noting that:

Galton obtained this data “through the offer of prizes” for the “bext Extracts from their own Family Records”, so the sample is hardly a random one
the data are clearly heavily rounded for tabulation
family sizes vary from 1 child up to 15, so that there is a lot of repetition in the midparent heights.

You might expect that if we had the individual un-adjusted heights and the genders of the parents and children we would find that the height data distributions would be neatly bimodal with one peak for females and one for males. They are not. Apparently, height distributions are rarely like that.

2.3 Data Set 2: How Long is a Movie?

Download data: movies

Data science inevitably involves working with large data sets. The effort involved in preparing and making a large dataset usable for analysis should not be underestimated, but thankfully we’re going to look at a dataset “prepared earlier”. This movies data set is reasonably large(ish), containing 24 different attribues of 28819 movies gathered from IMDB. One of the variables is the movie length in minutes, and it is interesting to look at this variable in some detail.

Load the movies data set and draw a histogram of the data.
Plot a histogram of the length variable.
What features can you see?
Add a rugplot to the histogram - what problems does this highlight?
Let’s try a boxplot of the data - do we learn anything more?
Are there any obvious outliers?

Clearly, our data are distorted by some particularly egregious outliers. Let’s look at these a little more closely.

What are the lengths of the two longest movies?
Does that seem sensible?

Although it’s tempting to dismiss these as simple errors, it is worth checking if possible.

Extract the subset of the dataframe containing all the variables for both of these movies
Inspect the variable values:
- The variables r1 to r10 give the percentage of reviews which rated the movie as a 1 up to a 10 out of 10. Are these movies particularly popular?
- What are the names of the movies? Do a quick Google search to see if you can find out more.

Incidentally, this data set is no longer up-to-date and there are some even longer films now (though it’s a mystery why.)

Clearly, the extreme outliers should be ignored, and for exploring the main distribution of movie lengths it makes sense to set some kind of upper limit. Over 99% of the data are less than three hours in length, so let’s restrict ourselve to those.

Extract the all the movies of length at most three hours.
Draw a histogram - what do you find?

Useful context:

The Oscars define a “short film” as anything under 40 minutes.
Animated shorts are typically between 5 and 8 minutes long, and are counted as individual movies (so, e.g., every ‘Tom and Jerry’ cartoon has its own entry).

Redraw your histogram using a bin-width of 1 minute (you may need to enlarge your plot window).
What do you see? How does the information above help explain the data?
Is there any heaping in the data? At what values?

2.4 Using Colour

Using colour in a plot can be very effective, for example to highlight different groups within the data. Colour is adjusted by setting the col optional arugment to the plotting function, and what R does with that information depends on the value we supply.

col is assigned a single value: all points on a scatterplot, all bars of a histogram, all boxplots are coloured with the new colour
col is a vector:
- in a scatterplot, if col is a vector of the same length as the number of data points then each data point is coloured individually
- in a histogram, if col is a vector of the same length as the number of bars then each bar is coloured individually
- in a boxplot, if col is a vector of the same length as the number of boxplots then each boxplot is coloured individually
- if the vector is not of the correct length, it will be replicated until it is and the above rules apply

Now that we know how the col argument works, we need to know how to specify colours. Again, there are a number of ways and you can mix and match as appropriate

Integers: The integers 1:8 are interpreted as colours (black, red, green, blue, …) and can be used as a quick shorthand for a common colour. Type palette() to see the sequence of colours R uses.
Names: R recognises over 650 named colours is specified as text, e.g."steelblue", "darkorange". You can see the list of recognised names by typing colors(), and a document showing the actual colors is available here
Hexadecimal: R can recognise colours specified as hexadecimal RGB codes (as used in HTML etc), so pure red can be specified as "#ff0000" and cyan as "#00ffff".
Colour functions: R has a number of functions that will generate a number of colours for use in plotting. These functions include rainbow, heat.colors, and terrain.colors and all take the number of desired colours as argument.

R Help: colors, palette, color functions

## Colour example
## 3 plots in one row
par(mfrow=c(1,3))
## colour the cars data by number of gears
plot(x=mtcars$wt, y=mtcars$mpg, col=mtcars$gear, xlab="Weight", ylab="MPG", 
     main="MPG vs Weight")
## manually colour boxplots
boxplot(mpg~cyl, data=mtcars, col=c("orange","violet","steelblue3"),
        main="Car Milage Data", xlab="Number of Cylinders", 
        ylab="Miles Per Gallon")
## use a colour function to shade histogram bars
hist(mtcars$mpg,col=rainbow(5),main='MPG')

Show the histograms of length of short movies next to that for ‘non-short’ movies, using a different colour for each histogram.
Experiment with using the col argument to add colour to your histograms.

3 Variations of standard plots

3.1 Stripplots or Stripcharts

‘Stripplots’ or ‘Stripcharts’ are very similar to the rugplot we applied to our histograms, and display the individual data points along a single axis. They can be used in much the same way as a boxplot, but rather than showing the data summaries they display everything!

The built-in faithful data set contains measurements on the waiting times between the eruptions of the Old Faithful geyser.

data(faithful)
stripchart(faithful$waiting,ylab='Waiting Time', pch=16)

Plotting symbols

The symbols used for points in plots can be changed by specifying a value for the argument pch {#pch} (which stands for plot character). Specifying values for pch works in the same way as col, though pch only accepts integers between 1 and 20 to represent different point types. The default is usually pch=1 which is a hollow circle, in the plot above we changed it to 15 which is a filled circle.

However, when we have a lot of data points concentrated in a small interval, the stripplot suffers from problems of ‘overplotting’ where many points with similar values are drawn on top of each other.

A partial solution to this is to add random noise (known as ‘jittering’) to spread out the points.

Make a stripplot of the movie length data - how does the overplotting problem manifest here? You may want to compare to your histogram.

A better solution is to stack the dots that fall close together, producing an alternative plot to a histogram - sometimes called a ‘dotplot’ or ‘stacked dotplot’

stripchart(faithful$waiting, method='stack',pch=16)

Try this out with the movies data.

3.2 Beeswarm plots

An evolution of the stripplot is the ‘beeswarm’ plot, available from the beeswarm package. A bee swarm plot is similar to stripplot, but with various methods to separate nearby points such that each point is visible.

library(beeswarm)
beeswarm(faithful$waiting)

One limitation of the beeswarm plot is that the computations to arrange all the points do not scale well with large data sets - do not try this with the movies data, or you will be waiting for a very long time!

3.3 Data Set 3: German Opinion Polls

Download data: bundestag

These data contain the results of the 2009 elections for the German Bundestag, the first chamber of the German parliament. The contains the number of votes cast for the various political parties, for each state (“Bundesland”). Amongst the German political parties there are two on the left of the political spectrum, the SPD - similar to the UK’s Labour party - and Die Linke (“The Left”), a party even further to the left. Suppose we’re interested in the support for this “Die Linke” party.

Explore the election support for ‘Die Linke’ party by examining the LINKE1 variable using:
- A histogram, with a rugplot
- A stacked dotplot
- A beewswarm plot, setting horizontal=TRUE

3.4 Stem and leaf plots

A stem and leaf plot is a technique for displaying the data in a similar fashion to a histogram, while preserving the information of the individual numerical values. Where the histogram summarises the data by the counts in its various intervals, the stem and leaf plot retains the original data values up to two significant figures.

To see how this works, let’s look at the Old Faithful data, sorted from smallest to largest.

sort(faithful$waiting)

##   [1] 43 45 45 45 46 46 46 46 46 47 47 47 47 48 48 48 49 49 49 49 49 50 50 50 50 50 51 51 51 51 51 51 52 52
##  [35] 52 52 52 53 53 53 53 53 53 53 54 54 54 54 54 54 54 54 54 55 55 55 55 55 55 56 56 56 56 57 57 57 58 58
##  [69] 58 58 59 59 59 59 59 59 59 60 60 60 60 60 60 62 62 62 62 63 63 63 64 64 64 64 65 65 65 66 66 67 68 69
## [103] 69 70 70 70 70 71 71 71 71 71 72 73 73 73 73 73 73 73 74 74 74 74 74 74 75 75 75 75 75 75 75 75 76 76
## [137] 76 76 76 76 76 76 76 77 77 77 77 77 77 77 77 77 77 77 77 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
## [171] 79 79 79 79 79 79 79 79 79 79 80 80 80 80 80 80 80 80 81 81 81 81 81 81 81 81 81 81 81 81 81 82 82 82
## [205] 82 82 82 82 82 82 82 82 82 83 83 83 83 83 83 83 83 83 83 83 83 83 83 84 84 84 84 84 84 84 84 84 84 85
## [239] 85 85 85 85 85 86 86 86 86 86 86 87 87 88 88 88 88 88 88 89 89 89 90 90 90 90 90 90 91 92 93 93 94 96

Note that the smallest value is 43, followed by three values of 45. A stem and leaf plot of these data looks like this

stem(faithful$waiting)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   4 | 3
##   4 | 55566666777788899999
##   5 | 00000111111222223333333444444444
##   5 | 555555666677788889999999
##   6 | 00000022223334444
##   6 | 555667899
##   7 | 00001111123333333444444
##   7 | 555555556666666667777777777778888888888888889999999999
##   8 | 000000001111111111111222222222222333333333333334444444444
##   8 | 55555566666677888888999
##   9 | 00000012334
##   9 | 6

Each row of this plot is called a ‘stem’ and the values to the right of the ‘|’ symbol are the leaves. Be sure to read where R places the decimal point for the output. For this result, the decimal is placed one digit to the right of the vertical bar. Thus, the first row of the table then consists of data values of the form $4x$, and the only leaf is a $3$ corresponding to the value $43$ in the data. The next stem groups the values $45-49$, and we notice the three observations of $45$ are represented by the $555$ at the start of the second stem.

Notice that each stem part is representing an interval of width 5, much like a histogram. As usual, R figures out how best to increment the stem part unless you specify otherwise. Finally, notice how the shape of the stem and leaf plot mirrors that of a histogram with interval width 5 - the only difference is that here we can see the values inside the bars.

Draw a stem and leaf plot of the German election support for ‘Die Linke’ data.
How does the stem and leaf plot represent these data? You may want to look at the data value to help understand.
Now draw a histogram, adjusting the histogram to have axis range and bar width to match the stem and leaf plot.

As with the beeswarm plot, the stem and leaf plot is only suitable for relatively modestly sized data sets due to the fact it is literally writing out all of the data values on the screen!

4 More Practice

4.1 Student survey

The data come from an old survey of 237 students taking their first statistics course. The dataset is called survey in the package MASS.

Load the data with data(survey, package='MASS')
Draw a histogram of student heights - do you see evidence of bimodality?
Experiment with different binwidths for the histogram. Which choice do you think is the best for conveying the information in the data?
Compare male and femal heights using separate histograms with a common scale and binwidths.

4.2 Diamonds

Download data: diamonds

The set diamonds includes information on the weight in carats (carat) and price of 53,940 diamonds.

Is there anything unusual about the distribution of diamond weights? Which plot do you think shows it best? How might you explain the pattern you find?
What about the distribution of prices? With a bit of detective work you ought to be able to uncover at least one unexpected feature. How you discover it, whether with a histogram, a dotplot, or whatever, is unimportant, the important thing is to find it. Having found it, what plot would you draw to present your results to someone else? Can you think of an explanation for the feature?

4.3 Zuni educational funding

Download data: zuni

The zuni dataset seems quite simple. There are three pieces of information about each of 89 school districts in the US State of New Mexico: the name of the district, the average revenue per pupil in dollars, and the number of pupils. The apparent simplicity hides an interesting story. The data were used to determine how to allocate substantial amounts of money and there were intense legal disgreements about how the law should be interpreted and how the data should be used. Gastwirth was heavily involved and has written informatively about the case from a statistical point of view Gastwirth, 2006 and Gastwirth, 2008.

One statistical issue was the rule that before determining whether district revenues were sufficiently equal, the largest and smallest 5% of the data should first be deleted.

Are the lowest and highest 5% of the revenue values extreme? Do you prefer a histogram or boxplot for showing this?
Remove the lowest and highest 5% of the cases, draw a plot of the remaining data and discuss whether the resulting distribution looks symmetric.
Draw a Normal quantile plot of the data after removal of the 5% at each end and comment on whether you would regard the remaining distribution as normal.

4.4 Pollutants from engines

Download data: engine

These data record the amounts of three pollutants - carbon monoxide CO, hydrocarbons HC, and nitrogen oxide NO - in grammes emitted per mile by 46 light-duty engines.

Are the distributions for the three pollutants similar? To make this an easier question to answer, try to produce histograms of the variables, using the same class intervals and range for the horizontal axis in each case.

4.5 Dosage of chlorpheniramine maleate

Download data: chlorph

These data come from a semi-automated process for measuring the actual amount of chlorpheniramine maleate in tablets which are supposed to contain a 4mg dose.

The tablets used for the study were made by two different manufacturers. For each manufacturer, a composite was produced by grinding together a number of tablets. Each composite was split into seven pieces each of the same weight as a tablet and the pieces were sent to seven different laboratories. Each laboratory made 10 separate measurements on each composite.

The data contain three variables: * chlorpheniramine - the amount measured * manufacturer - the tablet manufacturer as a factor (A or B) * laboratory - the laboratory which performed the measurement as a factor (1 to 7)

Produce box-plots of the chlorpheniramine measurements split by laboratory. What, if anything, do they suggest?
Now produce box-plots by manufacturer. Anything noticeable?
The problem here is that we really need a separate box-plot for each combination of manufacturer and laboratory. We can do this by boxplot(chlorpheniramine~laboratory*manufacturer, data=chlorph) Try this (or some abbreviated version of it).
Try reversing the order of laboratory and manufacturer. Which way round is better? Try colouring the box-plots by manufacturer. Does that help?
Do the data suggest that the two manufacturers actually put different amounts of the drug into supposed 4mg tablets?
Are there obvious differences between different laboratories? If so, what kind of differences do you observe?
If you had to choose a single laboratory to make some measurements for you, which would you choose and why?

Workshop 2 - Exploring Continuous Variables