# # PRACTICAL 1-1: Introduction to R # --------------------------------- # # # This practical will introduce you to working with `R` as a statistical programming and analysis language, # as well as going through the basics of how to use the `RStudio` editor which is a convenient and helpful # interface to `R`. # # If you've studied Statistics I, then some of this material will be familiar and just revision. Feel free to # skip over sections that you already know. # New R techniques: # * Basic R skills with arithmetic, functions, etc # * Manipulating and creating vectors: `c`, `seq`, # * Calculating data summaries: `mean`, `sd`, `var`, `min`, `max` # * Plotting a scatterplot with `plot`, a histogram with `hist`, and a boxplot with `boxplot`. # * Customising plots with labels, titles, colour, etc. # # # 1. Getting Started # ================================================================================== # # _R_ is an open source programming language for statistical computing and graphics that is widely # used among statisticians for data analysis. As _R_ is a programming language and not a program # itself (like Excel) we will need some programming skills ,as even the simplest data analysis will # require writing and evaluating small fragments of code to carry out the analysis we require. To # make life easier, we will be using _RStudio_ as an interface to write, edit, and evaluate our _R_ # code. # # 1.1 RStudio # ---------------------------------------------------------------------------------- # Both _R_ and RStudio are available on the University network. You will find R Studio on the AppHub # on your Desktop. # # Open the AppHub, search for "RStudio", and launch it. It may take a little while to load fully the # first time, so be patient. # # When the program starts, a new window will open. Before we explain how RStudio works, you should download # the script file for the practical. If you're reading this, then well done! # You'll now see that the default RStudio environment is divided in 4 panels, with the default # arrangement illustrated below. Going anticlockwise, the four panels are: Code; Console; # Plots, Help, and Files; and Workspace and History. # CODE EDITOR -- the main editor for your _R_ code. This should contain the practical script file that # you have just downloaded and opened. This pane is hidden at first, however when you load or # start a new _R_ script file it will be displayed. R code can be entered here, and it provides # support such as auto-complete using [Tab], colour highlighting, and additional buttons and menu # items to help edit and evaluate your code. In particular, a single line of code (or any selected # block of code) can be evaluated in one go by typing [Ctrl]+[Enter], and the whole file can be # evaluated with [Ctrl]+[Shift]+[S]. # R CONSOLE -- this pane is where your code from the Editor is evaluated by _R_. You can also use the # console here to execute quick calculations that you don't need to save. # Commands entered in the *Console* tab are immediately executed by _R_, and the results displayed # on the following line. In this way, _R_ can be used as a simple calculator by typing directly # into the *Console* window. However, for more complex calculations with many steps it is preferable # to write the code in a script file using the *Code* pane first, and then evaluate it in the # *Console*. # Note: when in the *Console* pane, you can use the [up arrow] and [down arrow] keys to navigate # through previous commands (e.g. to correct mistakes). # PLOTS, HELP, and FILES -- this pane has multiple roles indicated by the tabs along its top. The *Plots* # tab will show the results of any plots you produce in _R_. The *Help* tab is where RStudio will # display any help files. The *Files* tab provides a simple file viewer to quickly navigate between # and open files. # WORKSPACE and HISTORY -- this pane has two functions, also indicated by tabs. The *Workspace* (or # Environment) tab lists all the variables you have currently available in this session of R, along # with their types and values. The *History* tab shows a list of all the _R_ commands you have # evaluated in the console. # For more detailed help on the RStudio programming environment, see the RStudio cheatsheet at https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf # 1.2 First Steps # ---------------------------------------------------------------------------------- # When we just want to do small and quick calculations, we can type R commands directly into the console # window. These commands will be evaluated immediately and the answer returned on the next line. #Let's try some simple arithmetic: # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ At the `>` prompt in the console you can type numerical expressions as you would into a calculator, # ^ hit and `R` will print the answer. # ^ # ^ _R_ can be used as a calculator to perform the usual simple arithmetic operations. The operators are: # ^ # ^ + addition # ^ - subtraction # ^ * multiplication # ^ / division # ^ ^ raise to the power (alternatively `**` also works) # ^ %% modulus, e.g. 5 %% 3 is 2 # ^ %/% integer division, e.g. 5 %/% 3 is 1 # ^ # ^ Many standard mathematical functions are also available: # ^ # ^ abs(x) - the absolute value of x # ^ sqrt(x) - the square root of x # ^ log(x) - the natural logarithm of x (use `log10` for base-10) # ^ exp(x) - the exponential of x, i.e. e^x. # ^ sin(x), cos(x), tan(x) - the sine, cosine, and tangent # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # Experiment using the R Console to evaluate simple arithmetic expressions. # For example, find the sum and difference of 45.23 and 3.59, and the square root of 7. # # When you're comfortable with how simple commands work, proceed to the next question. # # When we want to save the code for our calculations or the calculation is long and requires many steps, # its better to write the code in a script file in the Editor pane. Then, when the code is ready, we can # evaluate it either line-by-line, or all in one go. # # Exercise: # ~~~~~~~~~~~~~ # * Use the gap below to write the `R` command to find the sum from i=1 to 5 of i^i. # * There are many ways to evaluate the code. Rather than re-typing in the console, or copy/pasting # the code, `RStudio` gives us some shortcuts. # * First, position the text cursor on the line with the code and: # * Click the `Run` button at the top of the Editor window to execute the current line # * Move the cursor back to the line of code and press `[Ctrl]+[Enter]` # In both cases, the line of code is copied to the console and evaluated, and the cursor advances to the # next line of executable code. Pressing [Ctrl]+[Enter] will step through your code one line at a time, # which is useful to find any bugs. Alternatively, to run multiple lines at once simply highlight the # lines you want to run and click `Run` or press [Ctrl]+[Enter]. We can also save and evaluate an entire # script file by pressing [Ctrl]+[Shift]+[S]. This is useful when you've finished a long calculation and # don't want to step through it line-by-line. # # # 2. Variables and Vectors # ================================================================================== # During an R session we work with and create variables. Variables can be a scalars, vectors, matrices, # functions, or lists. In order to create a new object we use the assign symbol `<-`, although one can # often use `=` instead. For example, to create the object `a` with a value of 2 we can type: a <- 2 # or a = 2 # R has many variables and functions available by default, and many many more that can be downloaded # using special libraries. # For each of these standard objects R provides online help, which can be dislpayed in the Help window. # To access this help type `?` followed immediately by the name of the object in question. # # Exercise: # ~~~~~~~~~~~~~ # * R stores the value of pi (3.141593) in a constant called `pi`. # * Display the value of pi by typing and evaluating `pi` in the console. # * Use the `?` to bring up the help for `pi` (you don't need to read it) # # Vectors are the most basic type of variable in R, and all numerical variables are created as vectors - # even scalars are vectors of length 1. Formally, `R` stores a vector as an ordered list numbers, and # `R` contains many linear algebra tools that allow us to perform basic vector calculus (one of the # many reasons why `R` is preferred to Excel). # # The object you have just created, `a`, is a vector of length 1. When you type `a` and press `[Enter]` # you will see the following output (without the `#`): # [1] 2 # The [1] is a number indicating which element of the vector starts the current line. In this example this # is unnecessary as we only have a vector of length one. However, if we consider a bigger vector, such as # the integers from 1 to 50: # [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 # [47] 47 48 49 50 # we can see that the vector covers multiple lines, the second line begins with element 24 of the vector, and # the final line begins at element 47. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ The most basic way to create a vector is to use th `c` function to '`c`ombine' several values # ^ of the same type into a vector. For example, # ^ # ^ x <- c(1,2,3) # ^ # ^ _R_ provides some simple functions for quickly creating numerical vectors. # ^ # ^ We can use the colon `:` operator to create integer sequences between two values and return the # ^ result as a vector: # ^ # ^ y <- 1:10 # ^ # ^ Vectors can also be combined together with the `c` function: # ^ # ^ z <- c(x,y) # ^ # ^ The `seq` function also generates a vector containing a sequence between its two arguments `from` # and `to`, but is more sophisticated than `:`. We can dictate the length of the sequence by supplying # ^ the optional `length` argument, or the step size in the sequence by passing a value to the `by` # ^ argument. If we supply neither `length` nor `by`, then `seq` gives an integer sequence like `:`. # ^ # ^ y <- seq(1,9) ## same as 1:9 # ^ # ^ seq(1,10,length=25) ## sequence of given length # ^ # ^ seq(15,45,by=3) ## sequence of given step # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # Using the techniques above: # # * Create the vector (2, 4, 6) in `R` and call it `v1`. # * Create the same vector using the object `a` that you defined before, and without typing the # numbers 4 or 6. _Hint_: multiplication also works for vectors. # * Create a vector containing all of the integers from 1 to 100. # * Create a vector containing the sequence of odd numbers between 0 and 100. # * Create a vector containing your forename, middle names, and surname as a string. # Hint: A string is simply some text surrounded by the quotation marks "text" or 'text'. # Assigning names to the objects you create is very important so that they can be re-used. To edit # or repeat previous commands in the console, we can simply press the [up arrow] and the [down arrow] keys # to navigate through previous commands (e.g. to correct mistakes). However, it is often much easier to # use the Editor, rather than the console, to debug any problems. # # 3. The `durham` library # ================================================================================== # # # Most real data will have to be imported into `R` from a suitable data file, or from a library or # package. In these practicals, we will work with data sets stored in the 'durham' library. To access # our datasets, we need to do some one-off setup. Run the following line of code source("T:/MATHS/R/Rprofile.R") # Note: You will only need to run this code once, and you will not need to do this if you are running R # on your own computer. # # Now, to load the `durham` libarary, we use the `library` function and specify the library we want: library(durham) # There are many datasets in this library - to see names of them all type data(package ='durham') # For the rest of this practical, we’ll be working with the dataset called `hospital`. To load a # particular data set from a package that we have alreaded loaded with `library`, we use the `data` function: data(hospital) # You should now view the data (by typing `hospital`) and have a look at the online help using `?` as # before to learn about this data set. # # 4. Data frames # ================================================================================== # # A _data frame_ which is a two-dimensional table of data (like a matrix) where each column contains # values of one variable, and each row contains one set of values from each observation. Each column # is labelled with a variable name, and the values within each column must be of the same type but the # types of data held in each column can differ according to the type of variable it represents. The data # set `hospital` that we have just loaded is a data frame. # # The rows and columns of data frames can be extracted as vectors allowing us to easily perform calculations # of summary statistics, for example. In the `hospital` data, we see that the data contains 3 columns each # with different names. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ We can extract individual columns from a data frame by using the variable name and the dollar-sign `$`. # ^ For example, to extract the `beds` column from `hospital` we would type: # ^ # ^ hospital$beds # ^ # ^ We can extract a vector containing all of the column names for a given data frame using the `names` function: # ^ # ^ names(hospital) # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # # * Use the `names` function to extract the names of the variables in the hospital data. # * Create new vectors called `beds` and `discs` containing the data for the number of beds, and number of discharges in the hospital data respectively. # * Compute the mean, standard deviation and range of the discharges data. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ When a vector represents numerical data, there are a number of standard functions that will be # ^ useful for any statistical calculations: # ^ # ^ `sum` - sums all values in the vector # ^ `mean` - computes the sample mean, i.e. 1/n \sum_{i=1}^n x_i # ^ `median` - computes the sample median value # ^ `min` and `max` - compute the sample minimum and maximum # ^ `range` - computes the sample range, i.e. the difference between the max and min value # ^ `var` and `sd` - compute the _sample_ variance and standard deviation, i.e. # ^ s^2= 1/(n-1) \sum_{i=1}^n (x_i-xbar)^2 # ^ `quantile` - computes the min, max, median, and lower and upper quartiles. Other quantiles can # ^ be computed using the `probs` argument # ^ `summary` - calculates the min, max, mean, median, and quantiles. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` # # Exercise: # ~~~~~~~~~~~~~ # # * Compute the mean, standard deviation and range of the discharges data. # # 5. Simple plots # ================================================================================== # # One of `R`’s greatest strengths is the facility it provides for producing attractive graphics. # In the following tasks we will use some of the key plotting tools to study the `hospital` data. # # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ The `plot` function produces a scatterplot of its two arguments. Suppose we have saved our x # ^ coordinates in a vector `a`, and our y coordinates in a vector `b`, then to draw a scatterplot # ^ of (x,y) we type # ^ # ^ plot(x=a, y=b) # ^ # ^ If the argument labels `x` and `y` are not supplied, _R_ will assume the first argument is `x` # ^ and the second is `y`. If only one vector of data is supplied, this will be taken as the y value # ^ and will be plotted against the integers `1:length(y)`, i.e. in the sequence in which they appear # ^ in the data. # ^ # ^ All of the standard plot functions can be customised by passing additional arguments to the function. # ^ For instance, we can add a plot title and axis labels by supplying optional arguments: # ^ # ^ `main` - provides a title to display at the top of the plot # ^ `xlab` - provides a label for the horizontal axis # ^ `ylab` - provides a label for the vertical axis # ^ # ^ For example, # ^ # ^ plot(x=a,y=b, xlab='A', ylab='B', main='Plot of B vs A') # ^ # ^ *Note*: Once a plot has been drawn, it is not possible to erase any features from it, we can only add # ^ extra lines or points to it. So, if you make a mistake drawing your plot then you'll need to start # ^ over with a fresh one by calling `plot` again. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # # * Produce a scatter-plot of discharges (vertically) against beds (horizontally) using `plot`. # The scatterplot will appear in the 'Plots' sub-window. # * Redraw the plot, labelling the axes `Beds` and `Discharges`. # * What do you notice about the relationship? # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ A histogram(https://en.wikipedia.org/wiki/Histogram) consists of parallel vertical bars that # ^ graphically shows the frequency distribution of a quantitative variable. The area of each bar # ^ is proportional to the frequency of items found in each class. To plot a histogram, we use the # ^ `hist` function and apply it to a single vector of data # ^ # ^ hist(x) # ^ # ^ As with `plot`, we can use `main` and `xlab` to set the plot title and horizontal axis label. # ^ # ^ Histogram also takes a number of arguments specific to the plotting of histograms: # ^ # ^ `breaks` - allows us to control the number of bars in the histogram. If `breaks` is set to a # ^ single number, this will be used to (suggest) the number of bars in the histogram. # ^ If `breaks` is set to a vector, the values will be used to indicate the endpoints # ^ of the bars of the histogram. _Note_: `R` interprets this number as a suggestion only. # ^ # ^ `freq` - if `TRUE` the histogram shows the simple frequencies or counts within each bar; if # ^ `FALSE` then the histogram shows probability densities rather than counts. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # # * Produce a histogram of `discharges` from the `hospital` data. # * Produce a histogram of the `beds` data. What do you notice about the two variables? # * Increase the number of bars in the discharges histogram. Does this change your perception # of the distribution of the discharges data. How about if you decrease the number of bars? # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set. # ^ A box plot is a bit like a histogram seen from above. The full name is 'box and whisker' plot. # ^ The 'box' shows the middle 50% of the observations, from the first to the third quartile. Inside # ^ the box, the line shows the median (which is the second quartile). The 'whiskers' show the full # ^ range of the data, although in `R` there is the possibility of excluding outliers, which appear # ^ as individual points. # ^ # ^ To draw a boxplot of a single variable we simply pass the data directly to the `boxplot` function: # ^ # ^ boxplot(x) # ^ # ^ To draw multiple boxplots within the same plot, we can pass multiple vectors separated by commas # ^ as arguments (e.g. `boxplot(x,y,z)`). We can even pass an entire dataframe to the boxplot function, # ^ and it will draw boxplots of all the columns within a single plot. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # # * Draw a boxplot of the discharges data. Compare this to the Histogram of discharges. # * Draw a boxplot of the beds and discharges data on the same y-axis using the boxplot function only once. # # 5. Simple plots # ================================================================================== # # When performing a statistical analysis it is important that you make your graphics informative as # possible. This will aid you, as the statistician, in spotting relation- ships and performing the # analysis, and will aid those to whom the analysis is going to be communicated. Often this involves # adding simple things like colour or tweak- ing some of the default settings of the plotting region. # In other cases we might plot things together to facilitate easy comparison. We’ll now explore some # simple ways of improving the plots we constructed earlier. # # Exercise: # ~~~~~~~~~~~~~ # # * Read the linked help pages below and experiment with customising your plots. # * Read the page on using colour in plot and experiment with redrawing your scatterplot using # colour for the points. # http://www.maths.dur.ac.uk/stats/people/jac/sc2-practicals/r_8_advplots.html#col # * Now try changing the plot symbols to something that you think looks best. # http://www.maths.dur.ac.uk/stats/people/jac/sc2-practicals/r_8_advplots.html#pch # * Read up about using the `par` function to [combine multiple plots](r_7_plots.html#par-mfrow). # http://www.maths.dur.ac.uk/stats/people/jac/sc2-practicals/r_7_plots.html#par-mfrow # * Divide the plotting region into two panes (one on top of the other). # * Now redraw the histogram of discharges above of the histogram of beds from the `hospital` data. # Does this make comparison of the distributions easier?