#
# PRACTICAL 1-2: Sampling Distributions
# ---------------------------------
#
#
# In this session we perform simple simulation exercises in order to investigate the properties
# of the sample mean introduced in the lectures.
# 
# Today we will learn R techniques beyond those seen in the first year (if you were a Statistics I
# student), including the use of loops and writing your own functions. 
  

# You will need the following skills from previous practicals:
#   *   Basic R skills with arithmetic, functions, etc
#   *   Manipulating and creating vectors: `c`, `seq`, 
#   *   Calculating data summaries: `mean`, `sd`, `var`, `min`, `max`
#   *   Plotting a scatterplot with `plot`, and a histogram with `hist`
#
# New R techniques:
#   *   Sampling with and without replacement from a vector using `sample`
#   *   Creating new functions with `function`
#   *   Creating traditional `for` loops to iterate calculations
#   *   Using `sapply` to repeat calculations over a vector of values
#   *   Using `par(mfrow=c(a,b))` to setup a grid of plots
#   *   Adding straight lines to plots with `abline`

#
#
# 1. Sampling without replacement
# ==================================================================================
#
# We'll be using the `hospital` data set from the `durham` library in our sampling
# experiments. So, before we begin you'll need to load the data:


#
# Exercise:
# ~~~~~~~~~~~~~
# * Refer to the worksheet from last time and either:
#    * Load the `durham` library and the `hospital` data set.
#    * If you have problems loading the library, follow the link on the webpage to create
#      the data set without the library
# * Create a new vector called `disch` (or similar) containing the data for the number of 
#   discharges in the `hospital` data.


# Recall that our general approach to sampling is to sample without replacement, so we don't
# repeatedly observe the same individual from the population. We can use `R`'s `sample` function
# to take random samples from a specified population vector, and we can do so either with or 
# without replacement.

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ The `sample` function allows us to take a sample of a specified size from a vector of values
# ^ which is treated as the population. 
# ^ 
# ^ If we have a population vector `pop` and we want to take a sample of size `n` without replacement,
# ^ then we use the following command:
# ^   
# ^ `sample(pop, n)`
# ^ 
# ^ If we want to sample _with_ replacement, then we can add the additional optional argument
# ^ `replace=TRUE`.
# ^ 
# ^ Finally, the default is for all of the elements of the population to have equal probability. This
# ^ can be changed to specified probability weights using the `prob` argument, but we won't be looking
# ^ into that any further. See the help file (by running `?sample`) for more details on the `sample`
# ^ function.
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise:
# ~~~~~~~~~~~~~
# * Draw a sample (without replacement) of size n=4 from the discharges data.
# * Draw another sample of size n=4  - do you get the same sample values?
# * Let's experiment with changing the sample size. For sample sizes n_1=4, n_2=16, and n_3=64
#   generate a random sample, and compute its sample `mean`. Do this in three lines, one for each
#   sample size.
# * Compare the three means - what would you expect to see?


# We will be performing a repeated sampling experiment where we repeat the sampling many times
# to generate different samples each time. Then we can use the many samples to produce many 
# different sample means, which we can explore to check their behaviour against what we would
# expect from the theory seen in lectures.

#
#
# 2. Functions and Loops
# ==================================================================================
#
#
# To simplify the repeated sampling calculation, we will use a loop to repeat the calculations
# and we will write our own function to take care of the calculations we wish to perform during 
# each iteration.


#
# 2.1 Functions
# ----------------------------------------------------------------------------------
# 
# As with other programming languages (such as Python), we can combine multiple commands in `R` 
# into our own custom functions that perform more complicated tasks or calculations. Throughout
# the year, we will be writing our own functions to solve specific problems and today we will learn
# the basic syntax of how to create a function.


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ When creating a new function, it needs to have a name, probably at least one argument 
# ^ (although it doesn’t have to), and a body of code that does something. At the end it usually
# ^ should (although doesn’t have to) return a value or object out of the function. 
# ^ 
# ^ The general syntax for writing your own function is
# ^ 
# ^    name.of.function <- function(arg1, arg2, arg3=2) {
# ^       # function code to do some useful stuff
# ^       return(something) # return value 
# ^    }
# ^ 
# ^  + `name.of.function`: is the function’s name. This can be any valid variable name, but you 
# ^     should avoid using names that are used elsewhere in R, such as `mean`, `function`, 
# ^     `plot`, etc.
# ^  + `arg1`, `arg2`, `arg3`: these are the arguments of the function. You can write a 
# ^     function with any number of arguments, or none at all. These can be any _R_ object: numbers,
# ^     strings, arrays, data frames, or even other functions. Essentially, this is the list of 
# ^     everything that is needed for the `name.of.function` function to run.
# ^     Some arguments have default values specified, such as `arg3` in our example, which is set to
# ^     `2` unless otherwise specified. Arguments without a default must have a value supplied for 
# ^     the function to run. You do not need to provide a value for those arguments with a default 
# ^     as they are considered as optional, and when omitted the function will simply use the default
# ^     value in its definition.
# ^  + Function body: The function code between the within the `{}` brackets is run every time the
# ^    function is called. Note that unlike Python where the code _inside_ the function is _indented_,
# ^    with `R` the code inside the function must be enclosed in curly braces `{}`. This code might
# ^    be very long or very short. Ideally functions are short and do just one thing – problems are 
# ^    rarely too small to benefit from some abstraction. Sometimes a large function is unavoidable,
# ^    but usually these can be in turn constructed from a number of small functions. 
# ^  + Return value: The last line of the code is the value that will be returned by the function. 
# ^    It is not necessary that a function return anything, for example a function that makes a plot
# ^    might not return anything (and so either state `return()` or omitting the `return` statement
# ^    entirely), whereas a function that does a mathematical operation might return a number, or a 
# ^    vector.
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


# For example, we can write a function to compute the sum of squares of two numbers as

sum.of.squares <- function(x,y) {
  return(x^2 + y^2)
}

# and we can then evaluate

sum.of.squares(3,4)


#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Create a function called `MyFun` that takes a single argument `x` and returns the 
#   value of 1 + x + 0.1 x^2.
# * Use your function to evaluate 1+x+0.1x^2 when x=0.
# * Create a vector of the integers from 0 to 100, and evaluate 1+x+0.1x^2 for x=0,...,100. 
#   You should do this in one line without resorting to any loops or iterative methods, or 
#   temporary variables.
#   _Hint:_ See the previous practical for how to create vectors of integer sequences, and how
#           to do arithmetic with vectors.
# * Create another function called `MyFunPlotter`, which takes a single argument `x`, and which
#   draws a scatter`plot` of the points `(x, MyFun(x))`. Evalate your function for x=0,...,100.
#   _Hint:_ See the previous practical for how to draw scatterplots, and use your function from 
#           the previous part of the question.


#
# 2.2 Simple Loops
# ----------------------------------------------------------------------------------
# 
# 
# A "loop"" is a way to repeat a sequence of instructions under certain conditions. They allow
# you to automate parts of your code that are in need of repetition. We will look at two ways 
# of creating loops in our code: 
# 
# * Traditional loops which execute for a prescribed number of times, as controlled by a counter
#   or an index, incremented at each iteration cycle are represented as `for` loops in _R_. You will
#   probably have seen these before with Python.
# * Loops that take a function and apply that function to every element of a vector (or other array)
#   are represented by the various `apply` functions in _R_.


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ In _R_ a "for"  loop takes the following general form:
# ^ 
# ^    for (variable in sequence) { 
# ^      ## code to repeat goes here
# ^    }
# ^ 
# ^ where `variable` is a name given to the iteration variable and which takes each possible value
# ^ in the vector `sequence` at each pass through the loop.
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# Here is a quick trivial example, printing the square root of the integers one to ten:

for (x in 1:10) {
  print(sqrt(x))
}

#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Sometimes big loops aren't necessary! Write a single line of code that computes the square root 
#   of the integers 1 to 10 without using `for`. 
# * Use a `for` loop to:
#    * Print the values of the integers $1$ to $5$.
#    * Print the squares of the integers from $10$ to $3$.
#    * Print your first name 10 times.


# While `for` loops are helpful when we need to explictly repeat a block of code, sometimes we
# just want to apply the same function (or calculations) to all of the elements of a vector. We
# can use `for` to do this, but _R_ provides a function which does exactly that - `sapply`. 


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ `sapply` applies a specified function to every element of a vector and returns a vector formed
# ^ from the results.
# ^ 
# ^ sapply(X, FUN)
# ^ 
# ^ applies the function `FUN` to every element of the vector `X` it then returns a vector containing
# ^ the values of `FUN(X[1])`, `FUN(X[2])`, and so on. 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# So, to replicate the square-root example above using `sapply`, we would write

sapply(1:10, FUN=sqrt)

# or alternatively

sapply(1:10, FUN=function(i){sqrt(i)})

# which applies the square root function to each of the integers 1 to 10. Unlike a loop, `sapply`
# automatically returns its results as a vector without us having to write code for that (our 
# `for` loop above only printed the values). 

# There are other variations of `apply` which work with matrices and other data structures, and we 
# will see those later.

# If we combine this technique of looping with the ability to write our own functions, then we have
# a very flexible way of re-writing a standard loop in a vectorised way. In general, using the 
# `sapply` function is to be preferred to a `for` loop particularly when we want to keep the results
# of the calculations from each iteration. However, `for` loops are still useful and more natural in
# certain cases (where we do not want the output values, or where each iteration has a dependency on
# the calculations at the previous step).


#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Use `sapply` to re-write your for-loops. You should write your answers in one line of code for
#   each part:
#    * Return the values of the integers 1 to 5.
#    * Return the squares of the integers from 10 to 3.
#    * Return a vector of your first name 10 times.


#
#
# 3. The Sampling Experiment
# ==================================================================================
#

#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Make a vector of the means of 30 different samples of size 4 from the hospital discharges
#   vector using a combination of `sapply` and the functions `sample` and `mean`.
# * Create a new function with argument `n` that returns a vector of the means of 1000 samples,
#   each of size `n`, from the discharges data. 
#   _Hint:_ This is a one line function that places a version of the command used in the previous
#   question within the function.
# * Using your new function, create three vectors containing the means for 1000 samples of sizes
#   n_1=4, n_2=16 and n_3=64 from the discharges data (giving each a name).


# We now have samples of size 1000 from each of three sampling distributions of Xbar, for
# when n=4, n=16, n=64. As our samples of Xbar values are quite large in size, we might
# expect that each of these distributions is quite similar to the distribution of the underlying 
# theoretical population of sample means.

#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Find the mean and standard deviation for each collection of sample means. Do the results
#   agree with what you would expect?
# * Draw a `boxplot` comparing each of the three samples on the same scale. (_Hint_: see the 
#   previous practical for how to draw a boxplot.)


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ _R_ makes it easy to combine multiple plots into one overall graph, using either the `par` or
# ^ `layout` functions.
# ^ 
# ^ With the `par` function, we specify the argument `mfrow=c(nr, nc)` to split the plot window 
# ^ into a grid of nr x nc plots that are filled in by row. For example, to divide the plot window 
# ^ into a 2x2 grid we call
# ^ 
# ^ par(mfrow=c(2,2))
# ^ 
# ^ This has no immediate effect, but the next four plots drawn (using `plot`, `hist`, `boxplot`, etc)
# ^ will be drawn in the 2x2 grid.
# ^ 
# ^ To revert to a single plot layout, we ask for a 1x1 grid via `par(mfrow=c(1,1))`.
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Use the `par` function with the `mfrow` argument to configure to plot window for a column of
#   three vertical plots.
#   _Note:_ this function will have no effect until you start drawing new plots.
# * Plot the samples of $\bar{X}$ as separate histograms, one on top of the other. Make sure the
#   horizontal axes are the same. 
#   _Hint:_ use the `xlim` argument to specify the horizontal axis limits, and `range` will determine
#   the min and max of a vector.


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ It is often useful to add simple straight lines to lines to plots, which can be achieved
# ^ using the `abline` function. `abline` can be used in three different ways:
# ^   
# ^ + Draw a horizontal line: pass a value to the `h` argument, `abline(h=3)` draws a horizontal
# ^   line at y=3
# ^ 
# ^ + Draw a vertical line: pass a value to the `v` argument, `abline(v=5)` draws a vertical 
# ^   line at x=5
# ^ 
# ^ + Draw a line with given intercept and slope: pass value to the `a` and `b` arguments
# ^   representing the intercept and slope respectively; `abline(a=1,b=2)` draws the line y=1+2x
# ^   
# ^ `abline` can be customised using any of the usual colour and line modifications using colour
# ^ (`col`), and line types and widths (`lty`, and `lwd`) - refer to the Advanced Plots help page
# ^ http://www.maths.dur.ac.uk/stats/people/jac/sc2-practicals/r_8_advplots.htm
# ^   
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Redraw your histograms, this time adding a vertical line showing the population mean.


# According to theory seen in lectures, the sample mean is an unbiased estimator of the 
# population mean, mu, and has variance denoted sigma_Xbar. For Normal data or large 
# sample sizes, the sampling distribution is either exactly or approximately Normal with the above
# mean and variance.


#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * What do you observe about the three distributions? Does the picture correspond to the theory?
# * What do you notice about the mean and standard deviation of your samples relative to the 
#   population for the three different sample sizes.


#
#
# 3. The variance of the sample mean
# ==================================================================================
#

# We can use `R` to investigate the properties of the distribution of the sample mean more
# closely. The theory tells us that the variance of the sample mean, Xbar, for a sample
# of size n from a _finite_ population of size N is given by:
# 
#                                  (N − n)   sigma^2
#       Var[Xbar] = sigma^2_Xbar = -------   -------  .
#                                  (N - 1)      n
#
#  where sigma^2 is the population variance of the original data, X. An unbiased estimator 
# of this standard error is the estimated standard error, given by
#
#                  (     n )  s^2
#       S^2_Xbar = ( 1 - - )  ---   .
#                  (     N )   n
#
# where $s^2$ is the sample variance.

#
# Exercise:
# ~~~~~~~~~~~~~
# 
# * Calculate the true variance of the sample mean, sigma^2_Xbar for the samples of 
#   size 4, 16 and 64.
# * Write a new function with argument `n` that returns a vector of 1000 realisations 
#   of S^2_Xbar for samples of `n` from the discharges data. 
# * Evaluate your function for n=4, n=16, and n=64 and store the result to three variables.
# * Display each of these three samples as histograms, one on top of the other in the same
#   plot. Make sure the horizontal axes are the same.
# * Draw a vertical line on each one showing the location of the true variance of the 
#   sample mean in each case.
# * Is the result of this experiment consistent with the theory? Why?