This workshop is intended for self-study after completing Lectures 1-10 and Workshops 1-6.

The goal is to use the techniques we’ve seen so far on some unseen data sets, and to introduce some more advanced visualisation techniques, suchas animation.

New techniques: * Use plot_ly to produce interactive plots * Use the manipulate package to create interactive plots

You will need to install the following packages for today’s workshop:

  • corrplot for drawing correlation plots
  • gplots for drawing heatmaps (note this is different to ggplot2)
  • aplpack for drawing Chernoff faces (this has a ridiculous number of dependencies, so you may want to skip it)
  • manipulate and plotly for creating interactive plots
install.packages(c("corrplot","gplots","aplpack","manipulate","plotly"))

1 Case Study 1: Gapminder Data and Interactive Plots

Download data: gapminder

The data contain values for population, life expectancy, GDP per capita, population, fertility and infant mortality, every years, from 1900 to 2020. The variables are:

  • country - the country, factor with 142 levels
  • continent - the continent, factor with 5 levels
  • year - year of observation ranges from 1900 to 2020
  • lifeExp - life expectancy at birth, in years
  • pop - population
  • gdpPercap - GDP per capita (US$, inflation-adjusted)
  • infantMort - Death of children under 5 years of age per 1000 live births
  • fertility - The number of children that would be born to each woman with prevailing age-specific fertility rates

This data set is quite complex, as we have five time series observed for multiple countries (i.e. a categorical variable with many levels). To get a feeling for the data, let’s focus on a particular picture of the state of the world in one year - let’s take 2015.

  • Load the data, and have a look to see how it is laid out.
  • Extract the subset of the data corresponding to the year of 2015, and save this as a new variable.
  • Try and get a quick overview of the data looking at both the behaviour of the different variables with respect to each other, and the behaviour of different cases in relation to each other.
Let’s focus on a single variable:
  • Look at the distribution of life expectancy, what features do you see?
  • How does the distribution of life expectancy change with different continents? Try a variety of different techniques to visualise your results.

If we had more time, we would explore all the variables like this. However, let’s start scrutinising the behaviour in different countries:

  • How does the life expectancy for the United Kingdom change over time? How about the GDP? Draw both plots side-by-side.
  • Take a look at some other countries. Note that for many countries the data is not available before 1900.

What we really want to do is get a complete picture of all the time series in all the countries! For this we’ll need to use a loop to draw the time series for each country.

library(scales)
cols <- alpha(c("#F44336", "#2196F3", "#5fd53f", "#9C27B0", "#FF9800"),0.75)
plot(x=range(gapminder$year),y=range(gapminder$lifeExp),
     xlab='Year',ylab='Life Expectancy', pch=NA)
for(co in levels(gapminder$country)){
  lines(x=gapminder$year[gapminder$country==co], 
        y=gapminder$lifeExp[gapminder$country==co], 
        col=cols[gapminder$continent[gapminder$country==co]])
}
legend(x='topleft',legend=levels(gapminder$continent),col=cols,lty=1)
  • Run the code above. Do you understand what each line is doing?
  • What general patterns can you see?
  • Modify the code to plot the time series of GDP. What do you see?
  • With the substantial variation in scale in the GDP variable, using a logarithmic axis may help here. Add log='y' to the plot command and redraw the plot - does this help?

Given we have multiple associated time series evolving at the same time, a scatterplot of one against the other would give a useful snapshot of a slice through all the time series.

  • Draw a scatterplot of life expectancy vertically against GDP for the 2015 data. Colour your points by the continent. What features do you see?
  • Let’s add the population size to the plot using the plot symbol size. Use the symbols function to draw a bubbleplot of the same data, with the radius of the circles determined by the country’s population. You may want to adjust the maximum circle size with the inches argument.
  • Redraw the bubbleplot for 1945 - what differences do you see?

1.1 Animation with ‘plotly’

The Gapminder video used animation to show how the bubbleplot evolved over time. There are a variety of options to allow us to animate our graphics. However, the techniques we have seen so far are designed only for drawing single static images. So, we’ll need to use a different method - thankfully the plotly package provides great support for animation.

To make this work, we’ll need to use plotly’s plotting function. Thankfully, it’s a fairly sensible function, however it will take longer to produce a plot than usual so you may need to wait a few seconds.

p <- plot_ly(x= gapminder$gdpPercap, # x value
             y = gapminder$lifeExp, # y value
             size = sqrt(gapminder$pop),  # bubble sizes and min/max
             color = gapminder$continent,  # colours
             frame = gapminder$year, # each year is a separate animation frame
             text =gapminder$country, # text labels will show when hovering
             type = 'scatter', # type of plot
             mode = 'markers', marker = list(sizemode = 'diameter'), fill = ~'',
             hoverinfo = "text")
p

The first line creates the plot ‘object’ here called p. We then need to evaluate that object to draw the plot by evaluating p at the console.

  • After a short delay, the plot will appear in the plot window, initially showing the data for 1800.
  • The slider at the bottom governs which year of data is being shown. Experiment with moving the slider to find different years.
  • The Play button will animate the plot, moving smoothly from one year’s data to the next.
  • You can hover the mouse over individual data points to see the text label, which we configured to be the country.
  • There are a variety of other controls in the top right of the window, allowing you to highlight points or zoom in.
  • The plot used in the YouTube video uses a log scale for its GDP axes. To add a log scale x axis, we need to change the xaxis argument to plot_ly, or modify the existing plot object p with the layout function like so: layout(p, xaxis = list(type = "log")).
  • Similarly, we can add axis labels as well using the layout function: layout(p, xaxis = list(title='GDP',type = "log"), yaxis=list(title='Life expectancy')).
  • Now re-evaluate p.
  • Which version of the plot gives the best differentiation among the countries?
  • Explore and experiment with the plot, and see if you can identify the same key features highlighted in the video.
  • Try plotting some of the other variables on the axes instead of GDP of life expectancy

Unfortunately, modifying the individual frames of the animation to, e.g. add a smoothed trend, is not quite so easy. However, plotly can be very effective at animating a relatively simple graphic such as this.

1.2 Using the ‘manipulate’ package

An alternative, albeit rather less fancy, approach is to use the manipulate package. While this doesn’t animate the plots, it does offer similar slider control as plotly and allows us to introduce more customisation in what is graphed.

  • Return to your bubbleplot you created before we looked into plotly.
  • Create a custom R function called bubble which takes a single argument year and runs your code to draw the bubbleplot for that year.
  • Try it out by running bubble(1945) and bubble(2010).

Given a pre-defined plotting function, we can use the manipulate library to add a slider to our plot.

manipulate(bubble(year), year=slider(1800,2020))

This will draw the default plot at 1800, and add a small cog icon in the top left. Click this to reveal the slider. Now, when we move the slider, it will change the value of year, pass this into the bubble function and re-draw the plot. Similarly, you can use the left/right arrows to step through the years.

Unfortunately, manipulate doesn’t support all of the features that plotly did - there is no animation, labels, or zooming in here. It is comparatively basic, but does allow us to have a bit more control over what is drawn.

  • Modify your bubble function to add text labels on the points for the UK, USA, and China.
  • Add a loess smoothed trend for each continent.

2 Case Study 2: Historical Weather in Durham

Download data: weather

The data set above comprise the Met Office’s historical weather station data for Durham, containing monthly observations from 1880 on:

  • Mean daily maximum temperature (tmax)
  • Mean daily minimum temperature (tmin)
  • Average daily temperature (tave) - defined as the mean of tmax and tmin
  • Total rainfall (rain)

Like our previous data set, we have some long time series with a potentially interesting structure. An obvious question to try and investigate would be whether there has been any long-term changes in the weather, as possible signs of the effects of climate change.

A graphical exploratory analysis might proceed as follows:
  • Inspect the data, and have a quick look at summaries of the individual variables. Does anything stand out?
  • Plot the time series in a column of plots.
    • Are there any obvious features?
    • Investigate a log transformation of the rainfall values. Why might this be appropriate?
  • Try fitting a loess smoother to the time series:
    • Add the smoothed trend to your plots.
    • What long-term behaviour is suggested?
  • Try a time series decomposition on the data:
    • Which series would be suitable for this technique?
    • How does the extracted trend compare to the loess smooth?
    • Look at the seasonal component - which month is, on average, Durham’s hottest and coldest?
    • Do the trend and seasonal components explain most of the variation in the data, or do the data remain noisy?
  • Given that seasonality is an important factor for the temperature series:
    • Draw a new plot which plots the time series for the average temperature, but showing each month as a separate series (i.e. plot all the January data, then February, etc)
    • Add a loess smoothed trend.
    • How large is the long-term change in trend relative to the average differences in temperature in the different months of the year?
  • One alternative way of dealing with the seasonality within the year is to compute the yearly average.
    • The aggregate function can be used to combine our data in this way. To take the yearly mean of the average monthly temperatures, run the code aggregate(weather$tave, by=list(weather$year), FUN=mean). Save this to a new variable.
    • Have a look at the object returned by aggregate, and plot its contents.
    • How does this compare to (a) the original time series, (b) your smoothed loess trend, (c) the components of the decomposed time series?
  • Use manipulate to make an interactive plot with a slider for the year of the data, which draws the annual time series in average temperature for that year.
    • Investigate the changes in the annual temperature profiles over time.
    • Try modifying your plot to:
      • Add the time series for previous years to the plot in a different colour.
      • Use transparency to make the historical data fade away, with more distant years being more transparent.
      • Check and indicate whether the chosen year has recorded the highest historical temperature up to that point.
      • Use the picker function from the manipulate package to change which variable you’re plotting.