If you work on a CIS computer, please simply open RStudio from the App Hub.
If you work on your own laptop, you will need recent versions of R and RStudio installed.
R is the actual programming engine. RStudio is simply an editor which is tailored to the use of R.
Both R and Rstudio are free software and can be downloaded and installed from the links given in this sentence.
Note that R needs to be installed first, then RStudio.
The full material for this tutorial includes
this html document (this is “knitted” R notebook);
a raw version of this notebook, ASMLTutorial.Rmd
, which you will use for your practical work;
a complete version of this notebook with solutions (to be provided at the end of this meeting); also in HTML.
a Handout (PDF) giving basics of R programming.
using R Notebooks;
workspace handling;
reading in data;
working with vectors, matrices, and data frames;
basic programming devices (such as if…then, for, while, apply, functions);
application to real data sets.
Please create somewhere on your computer an empty directory with title ASMLTutorial
. This will be our working directory.
Download the file ASMLTutorial.Rmd
(use the right mouse button), and place it in the directory that you have just created. Then open it in RStudio, by double-clicking on that file. If you had already opened this file without having created the directory, please close it again, create the directory, place this file in there, and open the file as described.
We are now continuing our work using this .Rmd file. You can of course still use the html file alongside.
This is a R Notebook. It is based on R Markdown and can be used for parallel typesetting and coding. It is best used in RStudio.
Note: For the use of R Notebooks we will need the packages rmarkdown
and knitr
. In principle, RStudio should automatically install these if required. If this does not happen on your system for any reason (and you encounter any problems below), you can do this manually, via install.packages("rmarkdown", "knitr")
.
There are (basically) three ways of executing (`rendering’) code from a R Notebook.
x<-3
x
## [1] 3
D<- date()
D
## [1] "Wed Mar 11 13:49:54 2020"
DayofWeek<- substr(D, 1,x) # extracts the first x letters from date object
cat("Today's day of the week is:", DayofWeek)
## Today's day of the week is: Wed
which you can execute by clicking on the green triangle at the top right of the chunk. DO THIS.
Row-wise. Click on any row of the code and then CTRL-ENTER. DO THIS for any of the rows in the chunk above.
You can render (`knit’) the entire document, it produces a nice and tidy summary of all text, code, and results.
TRY THIS NOW, that is click on the ‘Knit’ button at the top of this window. You can choose any of PDF, HTML, Word, or Preview, as output options. [Note: Preview does not actually execute any chunks, it just shows pre-existing output. My recommendation would be to set this option to Knit to HTML
. This will produce a .html file in your workspace that you can open separately.]
You can, of course, also edit this document yourself. Specifically, you can also create your own chunk, by clicking on the Insert icon (towards the right of top menu bar of this window). DO THIS.
The syntax of R Markdown is largely self-explaining, detailed explanations are available at https://rmarkdown.rstudio.com/authoring_basics.html.
There is no need to save or copy outputs of today’s work into Latex, MS Word, etc. This document itself will be able to reproduce all your work done today.
The best way to work through this tutorial is to go chunk-by-chunk.
Every R session uses a working directory. This is some directory on your hard drive (or USB key, etc.) which R will use in order to save images, data, and your workspace (see below). R will also assume by default that any data sets that you attempt to read in are stored in this directory. By default, R Notebooks will use the location of the .Rmd file as working directory.
Check the content (“workspace”) and location of the current working directory via
ls()
## [1] "D" "DayofWeek" "x"
getwd()
## [1] "C:/Users/jeinb/OneDrive/Documents/DU/Institute for Data Science/Miscada/ASMLTutorial"
This should return you the directory ASMLTutorial
that you have created above. If this is not right, then the easiest way of fixing this is to close this .Rmd file and start again.
Notes:
setwd(``pathname'')
. However, this will not work for R Notebooks, as it will only change the directory of the current chunk! More experienced users can follow these instructions to change the working directory for all R chunks.You can, at any time, save the entire workspace for later use, by using the command save.image(``filename'')
. Let’s do this. Render
save.image("asmltut.RData")
then close RStudio and open it again. Then load the saved workspace back via
load("asmltut.RData")
ls()
## [1] "D" "DayofWeek" "x"
and check whether everything is there! (In our case it should obviously just be x
, D
and DayofWeek
.)
Important (but rather confusing): RStudio opens a new R session to knit your R Notebook file. That is, even if you have some other objects (for instance from previous work) in the global environment (see top right RStudio window) then those objects will not be available when you knit your notebook. To see this, type for instance test<-1
in your R console. You will see it shows up directly in the Global Environment. But, if you create a chunk which refers to test
and then knit the notebook, you get an error.
The first data set that we are going to investigate give the energy use (kg of oil equivalent per capita) over 135 countries from 1960 to 2010.
Energy use is defined as the use of primary energy before transformation to other end-use fuels, which is equal to indigenous production plus imports and stock changes, minus exports and fuels supplied to ships and aircraft engaged in international transport.
Source: Worldbank
You can read the data in via
energy.use <-read.csv("http://www.maths.dur.ac.uk/~dma0je/Data/energy.csv", header=TRUE)
Alternatively, you can download the data from the given web address, place them in your working directory, end then call energy.use <-read.csv("energy.csv", header=TRUE)
.
Check now whether things have gone right. Try
dim(energy.use)
## [1] 135 52
which should give you the dimension \(135 \times 12\). Also try
head(energy.use)
in order to see the first six rows.
The object energy.use
is a data frame. You can check whether or not an object is a data frame by typing class(object)
or is.data.frame(object)
. Try this for the object energy.use
in the chunk below.
class(energy.use)
## [1] "data.frame"
is.data.frame(energy.use)
## [1] TRUE
It is easy to access individual rows, columns, or elements of a data frame. For instance,
energy.use[127,]
energy.use[,49]
## [1] 693.70 1088.75 605.54 1850.19 925.65 5887.67 3996.85
## [8] 1387.90 11551.42 163.29 2890.85 5366.42 343.50 570.95
## [15] 1483.16 1068.47 1238.99 7189.78 2641.20 358.42 390.89
## [22] 8168.64 1850.79 1484.02 664.57 289.32 356.51 1069.57
## [29] 495.86 2100.54 884.02 2854.25 4427.55 3597.77 804.18
## [36] 884.81 839.94 799.77 150.80 4198.49 289.97 6895.24
## [43] 4257.74 1299.69 767.12 4026.64 415.46 2875.07 620.35
## [50] 285.70 661.40 1984.58 2657.97 15707.75 528.91 848.57
## [57] 2603.95 1104.80 3456.56 3058.87 3000.63 1852.16 4019.07
## [64] 1268.90 4292.25 484.84 774.41 4585.54 9463.13 556.47
## [71] 2051.76 959.29 2889.12 2740.24 8789.71 1482.47 2733.47
## [78] 2119.55 1750.20 909.89 1182.10 459.93 418.39 318.53
## [85] 744.97 337.76 11321.17 4909.32 3966.37 620.91 722.19
## [92] 5703.57 5677.66 512.15 844.66 685.86 493.85 450.64
## [99] 2547.47 2362.76 19504.15 1805.74 4730.04 6202.50 224.75
## [106] 2141.28 5830.54 3306.64 3631.59 2783.77 3207.52 463.97
## [113] 362.95 5511.75 3405.85 977.91 579.72 442.82 1552.58
## [120] 390.13 11505.66 864.22 1369.86 3631.02 2953.00 11832.50
## [127] 3465.18 7758.94 952.79 1811.91 2319.43 655.12 323.85
## [134] 604.36 758.92
energy.use[127,49]
## [1] 3465.18
will give you the 127th row; 49th column; and the entry of the 127th row and the 49th column, respectively (this is the UK energy consumption in 2007). You can also access columns directly through their column names, such as
energy.use$X2007
## [1] 693.70 1088.75 605.54 1850.19 925.65 5887.67 3996.85
## [8] 1387.90 11551.42 163.29 2890.85 5366.42 343.50 570.95
## [15] 1483.16 1068.47 1238.99 7189.78 2641.20 358.42 390.89
## [22] 8168.64 1850.79 1484.02 664.57 289.32 356.51 1069.57
## [29] 495.86 2100.54 884.02 2854.25 4427.55 3597.77 804.18
## [36] 884.81 839.94 799.77 150.80 4198.49 289.97 6895.24
## [43] 4257.74 1299.69 767.12 4026.64 415.46 2875.07 620.35
## [50] 285.70 661.40 1984.58 2657.97 15707.75 528.91 848.57
## [57] 2603.95 1104.80 3456.56 3058.87 3000.63 1852.16 4019.07
## [64] 1268.90 4292.25 484.84 774.41 4585.54 9463.13 556.47
## [71] 2051.76 959.29 2889.12 2740.24 8789.71 1482.47 2733.47
## [78] 2119.55 1750.20 909.89 1182.10 459.93 418.39 318.53
## [85] 744.97 337.76 11321.17 4909.32 3966.37 620.91 722.19
## [92] 5703.57 5677.66 512.15 844.66 685.86 493.85 450.64
## [99] 2547.47 2362.76 19504.15 1805.74 4730.04 6202.50 224.75
## [106] 2141.28 5830.54 3306.64 3631.59 2783.77 3207.52 463.97
## [113] 362.95 5511.75 3405.85 977.91 579.72 442.82 1552.58
## [120] 390.13 11505.66 864.22 1369.86 3631.02 2953.00 11832.50
## [127] 3465.18 7758.94 952.79 1811.91 2319.43 655.12 323.85
## [134] 604.36 758.92
Data frames are very important as they are the standard form in which data are expected by many R functions, such as lm
, glm
,….
Let us now simplify the data frame a little bit, so that it is easier to use for the applied work. We reduce our interest to the energy consumption in the years 2001 and 2007. We do this via
energy <- energy.use[,c("X2001", "X2007")]
Also, we would like to give the rows and columns of the new data frame meaningful names. Please type
rownames(energy)<- energy.use[, 1]
colnames(energy)<- c("use01", "use07")
in order to specify row and column names, respectively. Then type energy
to look at your final data frame.
This data frame allows to access information quickly. For instance,
energy["United Kingdom",]
gives you the UK values of energy consumption. DO THIS for a couple of countries.
energy["Spain",]
energy["China",]
One may be interested in looking at these data in a form in which they are ordered by their energy consumption. This can be done using
order(energy$use07)
## [1] 39 10 105 50 26 41 84 133 86 13 27 20 113 120 21 47 83
## [18] 118 98 82 112 66 97 29 94 55 70 14 117 134 3 49 90 132
## [35] 51 25 96 1 91 85 135 45 67 38 35 37 95 56 122 31 36
## [52] 80 5 129 72 116 16 28 2 58 81 17 64 44 123 8 76 15
## [69] 24 119 79 102 130 4 23 62 52 71 30 78 106 131 100 99 57
## [86] 19 53 77 74 110 32 48 73 11 125 61 60 111 108 115 59 127
## [103] 34 124 109 89 7 63 46 40 43 65 33 68 103 88 12 114 93
## [120] 92 107 6 104 42 18 128 22 75 69 87 121 9 126 54 101
which gives you a list of numbers. The first number tells you the index (here: 39) of the country with the smallest per-capita energy consumption (here: Eritrea), and typing energy[order(energy$use07),]
gives you the full ordered list.
In the chunk below, save this ordered data frame into a new data frame `senergy’.
senergy <- energy[order(energy$use07),]
Next, we wish to identify the nations with extremely large energy consumption, say, more than 10000 kg of oil per capita (Intuitively, what do you think, which countries will this be?). Calling
energy$use07 > 10000
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [122] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE
will give you a vector of logical values, with a TRUE
for each country for which this condition is met. The command
sum(energy$use07 > 10000)
## [1] 6
will tell you how many these are, and
which(energy$use07 > 10000)
## [1] 9 54 87 101 121 126
will give you the index numbers of these countries. From this, we would get the data rows corresponding to these countries via
energy[which(energy$use07 > 10000),]
We would like to compare the energy use in 2001 and 2007. Do the same as above but now use the condition energy$use01 > energy$use07
instead. Observe and understand the information that you gain at each step.
energy$use01> energy$use07
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [12] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [34] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## [45] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [67] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [89] TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [100] TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [111] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
## [133] FALSE FALSE TRUE
sum(energy$use01> energy$use07)
## [1] 34
which(energy$use01> energy$use07)
## [1] 8 12 21 31 35 39 43 46 49 52 59 61 67 72 73 87 89
## [18] 92 95 96 98 100 101 105 108 113 114 115 116 127 128 130 131 135
energy[which(energy$use01> energy$use07),]
A very useful tool to carry out repeated operations is the for
command (see Handout!).
Task: Implement a loop which, for all 135 countries, writes a text like
for (i in 1:135){
cat("In 2007, the energy use in ", rownames(energy)[i], " was equivalent to", energy[i,2], "kg oil per capita.", "\n")
}
## In 2007, the energy use in Albania was equivalent to 693.7 kg oil per capita.
## In 2007, the energy use in Algeria was equivalent to 1088.75 kg oil per capita.
## In 2007, the energy use in Angola was equivalent to 605.54 kg oil per capita.
## In 2007, the energy use in Argentina was equivalent to 1850.19 kg oil per capita.
## In 2007, the energy use in Armenia was equivalent to 925.65 kg oil per capita.
## In 2007, the energy use in Australia was equivalent to 5887.67 kg oil per capita.
## In 2007, the energy use in Austria was equivalent to 3996.85 kg oil per capita.
## In 2007, the energy use in Azerbaijan was equivalent to 1387.9 kg oil per capita.
## In 2007, the energy use in Bahrain was equivalent to 11551.42 kg oil per capita.
## In 2007, the energy use in Bangladesh was equivalent to 163.29 kg oil per capita.
## In 2007, the energy use in Belarus was equivalent to 2890.85 kg oil per capita.
## In 2007, the energy use in Belgium was equivalent to 5366.42 kg oil per capita.
## In 2007, the energy use in Benin was equivalent to 343.5 kg oil per capita.
## In 2007, the energy use in Bolivia was equivalent to 570.95 kg oil per capita.
## In 2007, the energy use in Bosnia and Herzegovina was equivalent to 1483.16 kg oil per capita.
## In 2007, the energy use in Botswana was equivalent to 1068.47 kg oil per capita.
## In 2007, the energy use in Brazil was equivalent to 1238.99 kg oil per capita.
## In 2007, the energy use in Brunei Darussalam was equivalent to 7189.78 kg oil per capita.
## In 2007, the energy use in Bulgaria was equivalent to 2641.2 kg oil per capita.
## In 2007, the energy use in Cambodia was equivalent to 358.42 kg oil per capita.
## In 2007, the energy use in Cameroon was equivalent to 390.89 kg oil per capita.
## In 2007, the energy use in Canada was equivalent to 8168.64 kg oil per capita.
## In 2007, the energy use in Chile was equivalent to 1850.79 kg oil per capita.
## In 2007, the energy use in China was equivalent to 1484.02 kg oil per capita.
## In 2007, the energy use in Colombia was equivalent to 664.57 kg oil per capita.
## In 2007, the energy use in Congo, Dem. Rep. of was equivalent to 289.32 kg oil per capita.
## In 2007, the energy use in Congo, Rep. was equivalent to 356.51 kg oil per capita.
## In 2007, the energy use in Costa Rica was equivalent to 1069.57 kg oil per capita.
## In 2007, the energy use in Cote d'Ivoire was equivalent to 495.86 kg oil per capita.
## In 2007, the energy use in Croatia was equivalent to 2100.54 kg oil per capita.
## In 2007, the energy use in Cuba was equivalent to 884.02 kg oil per capita.
## In 2007, the energy use in Cyprus was equivalent to 2854.25 kg oil per capita.
## In 2007, the energy use in Czech Republic was equivalent to 4427.55 kg oil per capita.
## In 2007, the energy use in Denmark was equivalent to 3597.77 kg oil per capita.
## In 2007, the energy use in Dominican Republic was equivalent to 804.18 kg oil per capita.
## In 2007, the energy use in Ecuador was equivalent to 884.81 kg oil per capita.
## In 2007, the energy use in Egypt, Arab Rep. was equivalent to 839.94 kg oil per capita.
## In 2007, the energy use in El Salvador was equivalent to 799.77 kg oil per capita.
## In 2007, the energy use in Eritrea was equivalent to 150.8 kg oil per capita.
## In 2007, the energy use in Estonia was equivalent to 4198.49 kg oil per capita.
## In 2007, the energy use in Ethiopia was equivalent to 289.97 kg oil per capita.
## In 2007, the energy use in Finland was equivalent to 6895.24 kg oil per capita.
## In 2007, the energy use in France was equivalent to 4257.74 kg oil per capita.
## In 2007, the energy use in Gabon was equivalent to 1299.69 kg oil per capita.
## In 2007, the energy use in Georgia was equivalent to 767.12 kg oil per capita.
## In 2007, the energy use in Germany was equivalent to 4026.64 kg oil per capita.
## In 2007, the energy use in Ghana was equivalent to 415.46 kg oil per capita.
## In 2007, the energy use in Greece was equivalent to 2875.07 kg oil per capita.
## In 2007, the energy use in Guatemala was equivalent to 620.35 kg oil per capita.
## In 2007, the energy use in Haiti was equivalent to 285.7 kg oil per capita.
## In 2007, the energy use in Honduras was equivalent to 661.4 kg oil per capita.
## In 2007, the energy use in Hong Kong SAR, China was equivalent to 1984.58 kg oil per capita.
## In 2007, the energy use in Hungary was equivalent to 2657.97 kg oil per capita.
## In 2007, the energy use in Iceland was equivalent to 15707.75 kg oil per capita.
## In 2007, the energy use in India was equivalent to 528.91 kg oil per capita.
## In 2007, the energy use in Indonesia was equivalent to 848.57 kg oil per capita.
## In 2007, the energy use in Iran, Islamic Rep. of was equivalent to 2603.95 kg oil per capita.
## In 2007, the energy use in Iraq was equivalent to 1104.8 kg oil per capita.
## In 2007, the energy use in Ireland was equivalent to 3456.56 kg oil per capita.
## In 2007, the energy use in Israel was equivalent to 3058.87 kg oil per capita.
## In 2007, the energy use in Italy was equivalent to 3000.63 kg oil per capita.
## In 2007, the energy use in Jamaica was equivalent to 1852.16 kg oil per capita.
## In 2007, the energy use in Japan was equivalent to 4019.07 kg oil per capita.
## In 2007, the energy use in Jordan was equivalent to 1268.9 kg oil per capita.
## In 2007, the energy use in Kazakhstan was equivalent to 4292.25 kg oil per capita.
## In 2007, the energy use in Kenya was equivalent to 484.84 kg oil per capita.
## In 2007, the energy use in Korea, D.P.R. of was equivalent to 774.41 kg oil per capita.
## In 2007, the energy use in Korea, Rep. of was equivalent to 4585.54 kg oil per capita.
## In 2007, the energy use in Kuwait was equivalent to 9463.13 kg oil per capita.
## In 2007, the energy use in Kyrgyz Republic was equivalent to 556.47 kg oil per capita.
## In 2007, the energy use in Latvia was equivalent to 2051.76 kg oil per capita.
## In 2007, the energy use in Lebanon was equivalent to 959.29 kg oil per capita.
## In 2007, the energy use in Libya was equivalent to 2889.12 kg oil per capita.
## In 2007, the energy use in Lithuania was equivalent to 2740.24 kg oil per capita.
## In 2007, the energy use in Luxembourg was equivalent to 8789.71 kg oil per capita.
## In 2007, the energy use in Macedonia, FYR was equivalent to 1482.47 kg oil per capita.
## In 2007, the energy use in Malaysia was equivalent to 2733.47 kg oil per capita.
## In 2007, the energy use in Malta was equivalent to 2119.55 kg oil per capita.
## In 2007, the energy use in Mexico was equivalent to 1750.2 kg oil per capita.
## In 2007, the energy use in Moldova was equivalent to 909.89 kg oil per capita.
## In 2007, the energy use in Mongolia was equivalent to 1182.1 kg oil per capita.
## In 2007, the energy use in Morocco was equivalent to 459.93 kg oil per capita.
## In 2007, the energy use in Mozambique was equivalent to 418.39 kg oil per capita.
## In 2007, the energy use in Myanmar was equivalent to 318.53 kg oil per capita.
## In 2007, the energy use in Namibia was equivalent to 744.97 kg oil per capita.
## In 2007, the energy use in Nepal was equivalent to 337.76 kg oil per capita.
## In 2007, the energy use in Netherlands Antilles was equivalent to 11321.17 kg oil per capita.
## In 2007, the energy use in Netherlands, The was equivalent to 4909.32 kg oil per capita.
## In 2007, the energy use in New Zealand was equivalent to 3966.37 kg oil per capita.
## In 2007, the energy use in Nicaragua was equivalent to 620.91 kg oil per capita.
## In 2007, the energy use in Nigeria was equivalent to 722.19 kg oil per capita.
## In 2007, the energy use in Norway was equivalent to 5703.57 kg oil per capita.
## In 2007, the energy use in Oman was equivalent to 5677.66 kg oil per capita.
## In 2007, the energy use in Pakistan was equivalent to 512.15 kg oil per capita.
## In 2007, the energy use in Panama was equivalent to 844.66 kg oil per capita.
## In 2007, the energy use in Paraguay was equivalent to 685.86 kg oil per capita.
## In 2007, the energy use in Peru was equivalent to 493.85 kg oil per capita.
## In 2007, the energy use in Philippines was equivalent to 450.64 kg oil per capita.
## In 2007, the energy use in Poland was equivalent to 2547.47 kg oil per capita.
## In 2007, the energy use in Portugal was equivalent to 2362.76 kg oil per capita.
## In 2007, the energy use in Qatar was equivalent to 19504.15 kg oil per capita.
## In 2007, the energy use in Romania was equivalent to 1805.74 kg oil per capita.
## In 2007, the energy use in Russian Federation was equivalent to 4730.04 kg oil per capita.
## In 2007, the energy use in Saudi Arabia was equivalent to 6202.5 kg oil per capita.
## In 2007, the energy use in Senegal was equivalent to 224.75 kg oil per capita.
## In 2007, the energy use in Serbia was equivalent to 2141.28 kg oil per capita.
## In 2007, the energy use in Singapore was equivalent to 5830.54 kg oil per capita.
## In 2007, the energy use in Slovak Republic was equivalent to 3306.64 kg oil per capita.
## In 2007, the energy use in Slovenia was equivalent to 3631.59 kg oil per capita.
## In 2007, the energy use in South Africa was equivalent to 2783.77 kg oil per capita.
## In 2007, the energy use in Spain was equivalent to 3207.52 kg oil per capita.
## In 2007, the energy use in Sri Lanka was equivalent to 463.97 kg oil per capita.
## In 2007, the energy use in Sudan was equivalent to 362.95 kg oil per capita.
## In 2007, the energy use in Sweden was equivalent to 5511.75 kg oil per capita.
## In 2007, the energy use in Switzerland was equivalent to 3405.85 kg oil per capita.
## In 2007, the energy use in Syrian Arab Republic was equivalent to 977.91 kg oil per capita.
## In 2007, the energy use in Tajikistan was equivalent to 579.72 kg oil per capita.
## In 2007, the energy use in Tanzania was equivalent to 442.82 kg oil per capita.
## In 2007, the energy use in Thailand was equivalent to 1552.58 kg oil per capita.
## In 2007, the energy use in Togo was equivalent to 390.13 kg oil per capita.
## In 2007, the energy use in Trinidad and Tobago was equivalent to 11505.66 kg oil per capita.
## In 2007, the energy use in Tunisia was equivalent to 864.22 kg oil per capita.
## In 2007, the energy use in Turkey was equivalent to 1369.86 kg oil per capita.
## In 2007, the energy use in Turkmenistan was equivalent to 3631.02 kg oil per capita.
## In 2007, the energy use in Ukraine was equivalent to 2953 kg oil per capita.
## In 2007, the energy use in United Arab Emirates was equivalent to 11832.5 kg oil per capita.
## In 2007, the energy use in United Kingdom was equivalent to 3465.18 kg oil per capita.
## In 2007, the energy use in United States was equivalent to 7758.94 kg oil per capita.
## In 2007, the energy use in Uruguay was equivalent to 952.79 kg oil per capita.
## In 2007, the energy use in Uzbekistan was equivalent to 1811.91 kg oil per capita.
## In 2007, the energy use in Venezuela, R.B. de was equivalent to 2319.43 kg oil per capita.
## In 2007, the energy use in Vietnam was equivalent to 655.12 kg oil per capita.
## In 2007, the energy use in Yemen, Rep. of was equivalent to 323.85 kg oil per capita.
## In 2007, the energy use in Zambia was equivalent to 604.36 kg oil per capita.
## In 2007, the energy use in Zimbabwe was equivalent to 758.92 kg oil per capita.
Another command for repeated operations is while
. It does not have a fixed number of loops, but proceeds until a certain condition is met. For instance, consider the ordered frame senergy
created above. Assume we are interested in the following question: If we take exactly one person from each of the countries with the smallest energy use, i.e. one person from Eritrea, one person from Bangladesh, etc., then how many persons are needed in order to achieve the same use of energy as a single person in Qatar?
To answer this, create objects i
and sum07
and assign them the initial value 0. Then use the while
function (see Handout) with condition sum07< senergy["Qatar",2]
and action i <- i+1; sum07 <- sum07+ senergy[i,2]
. Make it clear to yourself what each row does. Also, interpret the result.
energy["Qatar",]
i <-0
sum07 <-0
while(sum07< senergy["Qatar",2] ){
i=i+1
sum07<- sum07+ senergy[i,2]
}
i
## [1] 42
sum07
## [1] 20265.34
# So individuals from the 41 least-consuming countries use less energy per captita than one single individual in Qatar!
Use apply
to compute a vector which contains, for each country, the larger of the two energy consumption values given for 2001 and 2007. Consult the see Handout and the corresponding help file (via help(apply)
or ?apply
) if you are unsure how to do this.
apply(energy,1,max)
## Albania Algeria Angola
## 693.70 1088.75 605.54
## Argentina Armenia Australia
## 1850.19 925.65 5887.67
## Austria Azerbaijan Bahrain
## 3996.85 1404.86 11551.42
## Bangladesh Belarus Belgium
## 163.29 2890.85 5672.21
## Benin Bolivia Bosnia and Herzegovina
## 343.50 570.95 1483.16
## Botswana Brazil Brunei Darussalam
## 1068.47 1238.99 7189.78
## Bulgaria Cambodia Cameroon
## 2641.20 358.42 393.04
## Canada Chile China
## 8168.64 1850.79 1484.02
## Colombia Congo, Dem. Rep. of Congo, Rep.
## 664.57 289.32 356.51
## Costa Rica Cote d'Ivoire Croatia
## 1069.57 495.86 2100.54
## Cuba Cyprus Czech Republic
## 1000.48 2854.25 4427.55
## Denmark Dominican Republic Ecuador
## 3597.77 861.84 884.81
## Egypt, Arab Rep. El Salvador Eritrea
## 839.94 799.77 194.64
## Estonia Ethiopia Finland
## 4198.49 289.97 6895.24
## France Gabon Georgia
## 4413.42 1299.69 767.12
## Germany Ghana Greece
## 4219.15 415.46 2875.07
## Guatemala Haiti Honduras
## 631.31 285.70 661.40
## Hong Kong SAR, China Hungary Iceland
## 2053.23 2657.97 15707.75
## India Indonesia Iran, Islamic Rep. of
## 528.91 848.57 2603.95
## Iraq Ireland Israel
## 1104.80 3737.54 3058.87
## Italy Jamaica Japan
## 3006.39 1852.16 4019.07
## Jordan Kazakhstan Kenya
## 1268.90 4292.25 484.84
## Korea, D.P.R. of Korea, Rep. of Kuwait
## 889.59 4585.54 9463.13
## Kyrgyz Republic Latvia Lebanon
## 556.47 2051.76 1384.04
## Libya Lithuania Luxembourg
## 3113.33 2740.24 8789.71
## Macedonia, FYR Malaysia Malta
## 1482.47 2733.47 2119.55
## Mexico Moldova Mongolia
## 1750.20 909.89 1182.10
## Morocco Mozambique Myanmar
## 459.93 418.39 318.53
## Namibia Nepal Netherlands Antilles
## 744.97 337.76 11431.36
## Netherlands, The New Zealand Nicaragua
## 4909.32 4366.96 620.91
## Nigeria Norway Oman
## 722.19 5772.44 5677.66
## Pakistan Panama Paraguay
## 512.15 921.83 717.88
## Peru Philippines Poland
## 493.85 496.47 2547.47
## Portugal Qatar Romania
## 2410.88 19794.22 1805.74
## Russian Federation Saudi Arabia Senegal
## 4730.04 6202.50 255.00
## Serbia Singapore Slovak Republic
## 2141.28 5830.54 3456.65
## Slovenia South Africa Spain
## 3631.59 2783.77 3207.52
## Sri Lanka Sudan Sweden
## 463.97 392.13 5682.03
## Switzerland Syrian Arab Republic Tajikistan
## 3597.74 996.48 579.72
## Tanzania Thailand Togo
## 442.82 1552.58 390.13
## Trinidad and Tobago Tunisia Turkey
## 11505.66 864.22 1369.86
## Turkmenistan Ukraine United Arab Emirates
## 3631.02 2953.00 11832.50
## United Kingdom United States Uruguay
## 3804.08 7855.12 952.79
## Uzbekistan Venezuela, R.B. de Vietnam
## 2030.76 2336.16 655.12
## Yemen, Rep. of Zambia Zimbabwe
## 323.85 604.36 776.57
Use hist
and boxplot
to create histograms and boxplots of the variables use01
and use07
. Comment on the distributional shape.
boxplot(energy$use01, energy$use07)
par(mfrow=c(2,1))
hist(energy$use01)
hist(energy$use07)
Next, add logarithmic versions of these variables, say luse01
andluse07
, to the data frame via
energy$luse01<- log(energy$use01)
and foruse07
analogously. Repeat the previous question using the transformed variables. What can we say about the distribution of these transformed variables, compared to the original ones?
energy$luse07 <- log(energy$use07)
boxplot(energy$luse01, energy$luse07)
par(mfrow=c(2,1))
hist(energy$luse01)
hist(energy$luse07)
Next, we consider a data set featuring \(n=82\) observations of galaxy velocities. Load the galaxies
data, read the associated help file, and create a histogram using the option breaks =18
in function hist
.
data(galaxies, package="MASS")
?galaxies
## No documentation for 'galaxies' in specified packages and libraries:
## you could try '??galaxies'
hist(galaxies, breaks=18)
For both data sets, the dominating feature is the presence of multiple modes or `clusters’. It is a relevant problem in Statistics and Machine Learning to identify such clusters, and also find the corresponding cluster centers. A simple method to do this is the k-means algorithm. See for instance this resource for a quick introduction into this algorithm.
In R, this algorithm is implemented in the function kmeans
. The algorithm requires the specification of the number of clusters in advance, through the argument centers
. Study the help file of kmeans
and then apply this function onto the luse01
and galaxies
data. You will need only the first two arguments of kmeans
. For the choice of the number of clusters, you can use visual inspection as a guide for your choice.
?kmeans
## starting httpd help server ... done
kmeans(energy$luse01, centers=2)
## K-means clustering with 2 clusters of sizes 65, 70
##
## Cluster means:
## [,1]
## 1 8.200233
## 2 6.351430
##
## Clustering vector:
## [1] 2 2 2 1 2 1 1 2 1 2 1 1 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 1 2 1 1 1 2
## [36] 2 2 2 2 1 2 1 1 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2 1 1 1 1 1 2 1 2 2 1 1 2
## [71] 1 2 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 1 2 2 1 1 2 2 2 2 2 1 1 1 1 1 1 2
## [106] 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 1 2 2 1 1 1 1 1 2 1 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 20.86167 18.08677
## (between_SS / total_SS = 74.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
kmeans(galaxies, centers=5)
## K-means clustering with 5 clusters of sizes 7, 3, 29, 17, 26
##
## Cluster means:
## [,1]
## 1 9710.143
## 2 33044.333
## 3 23614.276
## 4 18828.765
## 5 20611.654
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5
## [36] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [71] 3 3 3 3 3 3 3 3 3 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 1249607 2548691 45056732 18801667 12136904
## (between_SS / total_SS = 95.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Have a look at the produced output and try to understand and interpret it in the light of the graphical representations of the data presented earlier.
This was an example for the application a very simple clustering technique, in the one-dimensional (univariate) case.
In one of the lectures of ASML2, we will pick up from here and consider a more elaborated clustering method (mixture models).
Finally, note that clustering is an `unsupervised’ learning technique, since the algorithm needs to make decisions without having seen (not even for training samples) the true allocation of samples to clusters/classes. Such methods are called unsupervised learning. Most of the material dealt with in ASML will be set in the world of supervised learning, where training samples with true and known class labels (or output values) are available.
If you would like to do do a bit more to improve your R skills, we recommend the following resources:
ASML1: Ian H. Jermyn, i.h.jermyn@durham.ac.uk
ASML2: Jochen Einbeck, jochen.einbeck@durham.ac.uk
ASML3: Louis Aslett, louis.aslett@durham.ac.uk