[]
barplot
to show the
distribution of a categorical variablepie
to draw pie charts to represent proportions
For these workshops, we will provide RData files which will be indicated as below:
Download data: Marriage
Files with an extension “Rda” are R DAta files, and can be loaded directly into R. First, download the file and save it somewhere easily accessible (you can always move it later). You may want to save it to the same place as your R code and set that place to be R’s working directory. Once saved, there are various options to load the data file:
Session
menu and click
Load Workspace...
. Select the data file to load and click
Open.File
tab in the bottom right panel
of the screen. Use the file browser to locate the file, and simply click
it to load. If you can’t find your saved file, you may need to move it
to a place RStudio can find (usually this excludes external drives,
cloud storage, or similar.) load(file.choose())
C:/data/file.Rda
), you can simply use the load
function and type the file path:load("C:/data/file.Rda")
Advanced: To make life easier in the long-term, you can set your computer to automatically open .Rda files using RStudio. This is a more advanced solution, so you’ll have to figure out the details from the appropriate link below. Once set up, you’ll be able to load .Rda files into RStudio just by double-clicking.
The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama from the late 1990s. Looking at the first few rows of the data, we can see the data contains a lot of variables, some categorical, some numerical, some dates, and some other junk.
head(Marriage)
bookpageID | appdate | ceremonydate | delay | officialTitle | person | dob | age | race | prevcount | prevconc | hs | college | dayOfBirth | sign |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B230p539 | 1996-10-29 | 1996-11-09 | 11 | CIRCUIT JUDGE | Groom | 2064-04-11 | 32.60274 | White | 0 | NA | 12 | 7 | 102 | Aries |
B230p677 | 1996-11-12 | 1996-11-12 | 0 | MARRIAGE OFFICIAL | Groom | 2064-08-06 | 32.29041 | White | 1 | Divorce | 12 | 0 | 219 | Leo |
B230p766 | 1996-11-19 | 1996-11-27 | 8 | MARRIAGE OFFICIAL | Groom | 2062-02-20 | 34.79178 | Hispanic | 1 | Divorce | 12 | 3 | 51 | Pisces |
B230p892 | 1996-12-02 | 1996-12-07 | 5 | MINISTER | Groom | 2056-05-20 | 40.57808 | Black | 1 | Divorce | 12 | 4 | 141 | Gemini |
B230p994 | 1996-12-09 | 1996-12-14 | 5 | MINISTER | Groom | 2066-12-14 | 30.02192 | White | 0 | NA | 12 | 0 | 348 | Saggitarius |
B230p1209 | 1996-12-26 | 1996-12-26 | 0 | MARRIAGE OFFICIAL | Groom | 1970-02-21 | 26.86301 | White | 1 | NA | 12 | 0 | 52 | Pisces |
While we usually always start looking at all the data in a data set
at once, we very often will want to focus on a particular set of data
points (such as marriages after a particular date, or only the brides’
records) from a data frame. We can do so using the [
and
]
characters.
The data set (in R these objects are called data frames
)
is effectively a matrix - in the mathematical sense - so it can be
treated as a grid of data values, where each row is a data point and
each column a variable. We can use this idea to extract parts of the
data set by selecting particular rows or columns. For example, to
extract the 10th row in the data, we can type
Marriage[10,]
## bookpageID appdate ceremonydate delay officialTitle person dob age race prevcount
## 10 B231p48 1997-03-12 1997-03-22 10 MINISTER Groom 2051-07-02 45.75342 White 3
## prevconc hs college dayOfBirth sign
## 10 Divorce 12 6 183 Cancer
The comma here is important - anything after the comma indicates which columns we want to extract. Here we leave it blank, which returns all the columns. Alternatively, we could specify specific columns by name or number. Extending this idea, typing
y = Marriage[c(3,6,9,12),5:10]
will create a new data frame consisting of columns 5 to 10 of the
3rd, 6th, 9th and 12th rows of the Marriage
data frame, all
saved to a new variable called y
. This can be useful when
you want to regularly work with a particular subset of data and want to
save typing.
We can also use a logical test to extract only those rows which match a particular condiation. For example, to get only the marriage records of those with more than 1 previous marriage we could type
Marriage[Marriage$prevcount>1,]
Or all the marriages officiated by a judge:
Marriage[Marriage$officialTitle=="CIRCUIT JUDGE",]
Note the use of ==
with two =
signs - this
indicates a test for equality, rather than an
assignment of one variable to another.
We can apply the same techniques to extract elements of a vector,
only now we drop the ,
in the []
as a vector
only has a length. So, the 10th recorded value of age
could
be found by typing:
Marriage$age[10]
## [1] 45.75342
The barplot
is useful for summarizing categorical data.
Continuing with the Marriage
data, let’s look at who
officiated at these weddings - as indicated by the
officialTitle
variable. If we look at the first few values
in the officialTitle
variable, we find:
Marriage$officialTitle[1:10]
## [1] CIRCUIT JUDGE MARRIAGE OFFICIAL MARRIAGE OFFICIAL MINISTER MINISTER
## [6] MARRIAGE OFFICIAL MARRIAGE OFFICIAL MARRIAGE OFFICIAL MARRIAGE OFFICIAL MINISTER
## 9 Levels: BISHOP CATHOLIC PRIEST CHIEF CLERK CIRCUIT JUDGE ELDER MARRIAGE OFFICIAL MINISTER ... REVEREND
We can see that there are a variety of different officials. Notice
also that below the data is a line labelled Levels
- here R
is being helpful and listing all of the possibilities for this
particular categorical variable. To visualise the distribution we must
find the counts (or proportions) of each in the data set.
Before doing so, which do you think would be most (least) common?
Now we use the xtabs
function to do the counting of the
different genders before plotting:
xtabs(~officialTitle, data=Marriage) ## this makes the table of counts for the officialTitle variable in the Marriage data set
## officialTitle
## BISHOP CATHOLIC PRIEST CHIEF CLERK CIRCUIT JUDGE ELDER MARRIAGE OFFICIAL
## 2 2 2 2 2 44
## MINISTER PASTOR REVEREND
## 20 22 2
The syntax is a little odd here, note the use of the ~
(tilde) sign, which you may recognise from fitting linear regressions
with lm
. The variable name to the right of the
~
is the one we want summarising as counts.
Did this agree with what you expected?
We can then pass this summary table to the barplot
function for plotting. Since we’ll be working with this table for a
while, we can also save it to a variable to save repetition.
offs <- xtabs(~officialTitle, data=Marriage) ## save the table to `offs`
barplot(offs, col='skyblue')
Unfortunately, not all of our category labels are displayed due to
the restricted space. We can make the labels perpendicular to the axis
with the optional argument las=2
, which improves things
enough to be readable:
barplot(offs, las=2, col='skyblue')
Note that some data sets may give you the counts of the categories
directly and you can skip the call to xtabs
, so it is
always important to look at your data before plotting.
Like the other plotting functions we have seen, we can customise this
with axis labels and colours to fill the bars. The barplot
also supports the following additional arguments to customise the
plot:
names
- use this to pass a vector of labels for each
barhorizontal
- set to TRUE
to show a
horizontal barplot. Defaults to FALSE
and a vertical
presentationwidth
- use this to supply a vector of values to
specify the widths of the barsspace
- use this to supply a vector of values to
specify the spacing between barsThe many issues
with pie charts notwithstanding, generating a pie chart is
relatively easy with the function pie
. Note that the pie
chart emphasises showing the data as proportions of the whole,
rather than separate counts.
pie
takes the same data input as the barplot, and can be
supplied with labels
, and given custom col
ours
for the wedges. We’ll stick with the defaults for now, but feel free to
experiment.
pie(offs)
If your goal is compare each category with the the whole (e.g., what portion of weddings are officiated by a Bishop compared to all participants), and the number of categories is small, then pie charts may work for you. However, the best alternative to a piechart is the barchart, but if you really want to show proportions of a whole via area, then a treemap is a better choice (see below).
R makes it easy to combine multiple plots into one overall
graph, using either the par
or layout
functions.
With the par
function, we specify the argument
mfrow=c(nr, nc)
to split the plot window into a grid of nr
x nc plots that are filled in by row. For example, to divide the plot
window into a 2x2 grid we call par(mfrow=c(2,2))
as
below
Similarly, for 3 plots in a single column we use we
par(mfrow=c(3,1))
The plot window will remain divided like this until you reset it, and
any new plots draw will fit into your grid. To return to the usual
single-plot display, we must call par(mfrow=c(1,1))
.
If we don’t want to arrange plots in a simple regular grid and have
something more complex in mind, we can use the layout
function.
Download data: alzheimer
Some simple practice to get started.
Three groups of patients (one with Alzheimer’s, one with other forms
of dementia, and a control group with other diagnoses) were studied.
Counts are given in the alzheimer data linked above. Create a graphical
summary of barplots of each of the three variables, smoking
(cigarettes per day), disease
, and gender
, in
a single window.
Download data: beps
In many surveys there is a string of questions to which respondents give answers on integer scales from, say, 1 to 5. This is so commonplace that one such scale even has a name - the Likert scale.
The beps
dataset includes seven questions put to 1525
voters in the British Election Panel Study for 1997-2001. One question
was on a scale from 1 to 11, one from 0 to 3, and the rest were from 1
to 5.
The leaders of the main political parties at the time were Tony Blair for Labour (the Prime Minister at the time), William Hague for the Conservatives, and Charles Kennedy for the Liberal Democrats. Each surveyed individual assessed each party leader on a scale of 1 to 5 (5=best).
beps
data set from the file above. Use the
head
function to get a quick look at the first few rows of
the data to see what the variables are and how they are
represented.Blair
, Hague
and
Kennedy
. Don’t forget to call the xtabs
function to summarise the data before plotting.Blair
and Hague
? How would you interpret
these patterns?Kennedy
?In addition to assessing the party leaders, two further variables concerned Europe. The first asked the respondents to quantify their knowledge of the parties’ policies on European integration from low knowledge (0) to high knowledge (3), and the second measures the individuals attitudes to European integration from 1 to 11, where higher values are more Eurosceptic.
political.knowledge
variable. Reset your plot window to show a single plot, and plot the
data - how would you describe and explain what you find?Europe
variable representing
individual attitudes towards Europe. Do any features stand out?vote
variable. Plot the voting intentions - can you explain what you find?
(Bear in mind the survey was taken over 1997-2001!)vote
for
Labour
, and plot their attitudes to Europe, using party
colours as above. Add a main
plot label to indicate the
party.Conservative
and
Liberal Democrat
voters.Download data: occupation
The data set above were presented by Yamaguchi (1987), and looks at the categories of occupation of fathers and sons in the USA, UK and Japan. The occupations of the fathers and sons are grouped into categories:
UpNM
- Upper nonmanuals - are professionals, managers,
and officials;LoNM
- lower nonmanuals - are proprietors, sales
workers, and clerical workers;UpM
- upper manuals - are skilled workers;LoM
- lower manuals - are semi-skilled and unskilled
nonfarm workers;Farm
- farmers and farm laborers.The data already contains the counts for the various categories, so we’ll need to work with the data slightly differently.
Freq
) of the
combinations of the three categorial variables (Son
,
Father
, and Country
). The first row of the
data set corresponds to 1275 observations of
(UpNM,UpNM,US)
.Country
then there’s still some more summarising to be
done. To do this, we can still use the xtabs
function, but
we modify the syntax to reflect that the Freq
column
already contains count information. So, to obtain the counts per
country, we type xtabs(Freq~Country, data=occupation)
.Country
is
UK
and save to a new variable.A treemap can be used as an alternative to a pie chart for displaying relative sizes of groups as a proportion of a larger whole. Unlike the pie chart, which divides a circle into wedges, the treemap divides a rectangle into smaller rectangles.
The setup of a treemap is a bit more involved as we need to explicitly identify the data counts, and the labels, and provide them as a data frame. For example, to show the approval ratings for Tony Blair:
# you'll need to install `treemap` if you don't have it
# install.packages("treemap")
library(treemap)
# Create data
data <- data.frame(table(beps$Blair))
# treemap
treemap(data,
index="Var1", ## the name of the variable which contains the labels
vSize="Freq" ## the name of the variable which contains the counts
)
The treemap is most effective at showing data where there are multiple groups, e.g. the five approval ratings split up by Labour, Conservative and Liberal voters. Then each panel will be sub-divided into its subgroups. This requires a fair bit more work, so we’ll skip it for now.