#
# PRACTICAL 1-3: Inference for the Sample Mean
# ---------------------------------
#
#
# In this practical, we will investigate the behaviour and distributions of summary statistics
# calculated from samples of normally-distributed data. To do this, we will use R to simulate large
# numbers of Normal random samples, and from each of which we will compute the sample mean and
# variance. By repeating this experiment for a large number of such samples, we will accumulate a
# large number of sample means and variances which we will investigate and compare to the results
# we would expect to find from the theory in lectures.
  
# You will need the following skills from previous practicals:
#   *   Basic R skills with arithmetic, functions, etc
#   *   Manipulating and creating vectors: `c`, `seq`, 
#   *   Calculating data summaries: `mean`, `sd`, `var`, `min`, `max`
#   *   Plotting a scatterplot with `plot`, and a histogram with `hist`
# New R techniques:
#   *   Random number generation with `rnorm`, `runif`, etc
#   *   Evaluation of standard density functions via `dnorm`, `dunif`, etc
#   *   Creating a matrix with `matrix`
#   *   Applying a function to every row/column of a matrix with `apply`
#   *   Drawing a curve on an existing plot using `lines`

#
#
# 1. Normal Sampling Theory
# ==================================================================================
#
# -----------------------------------------------------------------------------------------
# 1.1 The Theory
# -----------------------------------------------------------------------------------------
# First, let's recap (or introduce) a little theory. If you haven't seen this yet in lectures, you
# will soon (Lecture 4). Consider a sample of n Normally-distributed data points: X_1, ... ,X_n 
# i.i.d random variables from N(μ, σ^2/n). We can say:
#
# * The sample mean Xbar has distribution:
#
#          n
#  Xbar =  ∑ X_i ∼ N(μ,σ^2/n).
#         i=1
#
# * The sample variance S^2 is independent of Xbar, and is such that
#
#  (n−1) 
#  ----- S^2 ∼ χ^2_{n−1}.
#   σ^2
#
# * If we standardise Xbar using the sample sd, S, instead of the population sd, σ, we don't get a
# Normal distribution -  we get the following t distribution instead: 
#
#     (Xbar − μ)  
# T = ----------  ∼ t_{n−1}
#        S/√n


# ========================================================================================
#
# 2. Generating the samples
#
#
# In order to illustrate this theory, we require a large number of samples from normal distributions that will act as our data in the following calculations. To generate the random numbers from a normal distribution, we will use the `rnorm` function.
#
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ We can use the `rnorm` function to generate random numbers from a normal distribution. To 
# ^ generate `n` random normal variates from a Normal distribution with mean `m` and standard 
# ^ deviation `s`, we use the command:
# ^  
# ^  rnorm(n, mean=m, sd=s)
# ^
# ^ The `mean` and `sd` arguments are optional, and default to 0 and 1 respectively if missing.
# ^
# ^ For more details, see the section on random number generation in "R Help 5: Basic Stats".
# ^
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                 
#
# Exercise 2.1:
# ~~~~~~~~~~~~~
#  * Use the `rnorm` function to try generating:
#
#    a. 5 random standard Normal numbers
#
#    b. 20 random numbers from N(5, 5^2)
#
#  * Check the mean and standard deviation of your sample using the `mean` and `sd` functions - are the values what you expected?


# Exercise 2.2:
# ~~~~~~~~~~~~~
#
#  * Now we're ready to construct our data set. Using `rnorm`, generate a vector of 500 random
#    standard normal values, and save it to a variable `zs`. Check the mean and standard 
#    deviation are consistent with the population distribution.


# Exercise 2.3:
# ~~~~~~~~~~~~~
#
# Let’s change the mean and variance of our sample to make things (fractionally) more interesting.
# Lets suppose we want our data to come from a normal population with mean 17 and standard 
# deviation 5.
#
#  * Recall that if Z ~ N(0,1), then (a+bZ) ~ N(a,b^2) - use this result to apply the appropriate
#    transformation to `zs` to correspond with the new population mean and variance, and save the
#    result to a new variable `xs`.
#
#  * Check the mean and standard deviation of xs to ensure your transformation worked as intended.
#
#  * Hint: you can apply standard arithmetic to vectors in R.


# Next, we want to treat our 500 X values as if they were 100 different random samples, each of
# size 5. To do this, we will transform our vector of length 500 into a `matrix` with 100 rows 
# and 5 columns. This way, each of the rows in the matrix corresponds to a different random 
# sample of size 5.

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ A matrix is created using the `matrix` function, which takes three arguments: the values of 
# ^ the matrix (`data`), the number of rows (`nrow`), and the number of columns (`ncol`).
# ^ 
# ^ To produce a 3 x 3 matrix containing the integers 1 to 9, we would write:
# ^ 
# ^   matrix(1:9, nrow = 3, ncol=3)
# ^ 
# ^ For more details, see "R Help 6: Matrices".
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# Exercise 2.4:
# ~~~~~~~~~~~~~
#
#
# * Use the `matrix` function to create a  100 x 5 matrix from the values in `xs`. 
# * Save the result as `xMatrix`.


# ========================================================================================
#
# 3. Investigating Xbar
#
#
# To investigate the behaviour of Xbar, we need some data. From the previous step, we have 
# effectively generated 100 Normal samples each of size 5, each represented by a row in 
# `xMatrix`. What we want to do now to calculate the sample mean for each of these samples 
# (i.e every row). This will give us a collection of 100 observed sample mean values, xbar, 
# drawn from the distribution of Xbar which we can investigate more closely.

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ We can use the `apply` function to evaluate the same function for either every row or every
# ^ column of a given matrix (or data frame). This is particularly useful for computing summary
# ^ statistics (means, standard deviations, etc) for all variables in a data set.
# ^ 
# ^ The `apply` function takes three arguments: 
# ^  *  `X` - the matrix or data frame to use for calculations
# ^  *  `MARGIN` - which indicates whether to evaluate over the rows (`MARGIN=1`) or 
# ^                columns (`MARGIN=2`) of the matrix
# ^  *  `FUN` - the name of the function we wish to evaluate.
# ^ 
# ^ For example, to calculate the minimum value in each column of a matrix `A`, we could write
# ^ 
# ^   apply(A, MARGIN=2, FUN=min)
# ^ 
# ^ For more details, read the entry in "R Help 9: Programming" on `apply`.
# ^                                          
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise 3.1:
# ~~~~~~~~~~~~~
#
#  * Use the `apply` function to compute the sample `mean` for every row of `xMatrix`. 
#    Save the results as `xbar`.
#
#  * Calculate the mean and variance of `xbar` - do they agree with the values that we would 
#    expect from the theoretical distribution of Xbar given in section 1.1 above?


# Now we'll use the `hist` function to plot a histogram of the `xbar` values from our sample, 
# and we'll overlay the Normal density curve we'd expect from the theory on top of the 
# histogram using `lines`.

#
# Exercise 3.2:
# ~~~~~~~~~~~~~
#  * If you're unfamiliar with plotting histograms with `hist`, quickly read up  on the `hist`
#    function in "R Help 7: Plots".
#
#  * Use `hist` to draw a histogram of `xbar` setting `breaks=15` to ensure we show enough detail
#    in the bars in the plot, and set `freq=FALSE` to plot the vertical axis as frequency density
#    instead of frequency.


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ 
# ^ TECHNIQUE: 
# ^ 
# ^ R provides a range of functions to support calculations with standard probability distributions.
# ^ To save creating our own functions to evaluate the pdfs and cdfs, we can use these standard 
# ^ functions instead. 
# ^ 
# ^ For every distribution there are four functions. The functions for each distribution begin
# ^ with a particular letter to indicate the functionality (see table below) followed by the name
# ^ of the distribution:
# ^ 
# ^ | Letter | e.g.    | Function                                                          |
# ^ |--------|---------|-------------------------------------------------------------------|
# ^ | "d"    | `dnorm` | evaluates the probability density (or mass) function, f(x)        |
# ^ | "p"    | `pnorm` | evaluates the cumulative density function, F(x)=P[X <= x], hence  |
# ^ |        |         | finds the probability the specified random variable is less than  |
# ^ |        |         | the given argument.                                               |
# ^ | "q"    | `qnorm` | evaluates the inverse cumulative density function (quantiles),    |
# ^ |        |         | F^{-1}(q) i.e. given a probability q it returns the value x such  |
# ^ |        |         | that P[X <= x] = q. Used to obtain critical values associated     |
# ^ |        |         | with particular probabilities q.                                  |
# ^ | "r"    | `rnorm` | generates random numbers                                          |
# ^   
# ^   We've already seen one example with `rnorm` to generate random Normal values. The appropriate
# ^ functions for Normal, t and chi^2 distributions are given below, along with the optional
# ^ l parameter arguments.
# ^ 
# ^     + Normal distribution: `dnorm`, `pnorm`, `qnorm`, `rnorm`.
# ^                            Parameters: `mean` (mu) and `sd` (sigma).
# ^     + t distribution: `dt`, `pt`, `qt`, `rt`. 
# ^                       Parameter: `df`.
# ^     + chi^2 distribution: `dchisq`, `pchisq`, `qchisq`, `rchisq`. 
# ^                       Parameter: `df`.
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise 3.3:
# ~~~~~~~~~~~~~
#
# To produce the Normal density curve, we need to evaluate the pdf of N(μ, σ^2/n) over the 
# plotting range of the histogram:
#
# * First, construct a sequence of values (using `seq`) which starts at the minimum of `xbar` 
#   and goes up to the maximum value of `xbar` and has length `100`. 
#   Save this vector as `seqx`.
#   Hint: You'll need the `seq`, `min`, and `max` functions - see "R Help 4 Vectors" for help
#         if needed.
# * We can use `dnorm` to evaluate the normal density function at the values `seqx` and save
#   these values as `ndens`. Try this now.
#   Make sure you supply the correct value of μ to the `mean` argument, and σ/√n to the `sd`
#   argument, otherwise your pdf won't match the histogram. 


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ We can use R's line drawing function `lines` to add an (approximate) curve to a plot. 
# ^ To draw a function f(x) on an existing plot, we evaluate the function f at a sequence of x 
# ^ points and then use `lines` to draw connected line segments between each (x,f(x)) pair to 
# ^ form the curve.
# ^ 
# ^ For example, if we save the x values in a vector `xs` and the f(x) values in a vector `fs`
# ^ then the command to add the curve to the plot is:
# ^ 
# ^   lines(x=xs, y=fs)
# ^ 
# ^ Like other plotting functions, `lines` can be modified by adding argumets to specify 
# ^ different line types, widths, and colours. See the "R Help 8: Advanced plots" page for 
# ^ detail on how to customise your curve.
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise 3.4:
# ~~~~~~~~~~~~~
#  * Use the `lines` function to add a solid blue curve between the points `x=seqx` and 
#    `y=ndens` to your histogram. 
#
#  * _Note:_ if your density curve goes off the top of your plot, then try redrawing the 
#    histogram but set the limits of the vertical axis by passing the `ylim=c(a,b)` argument 
#    to the `hist` function where `a` and `b` are your desired upper and lower limits of the
#    y axis. Use a lower limit of `0` and choose the upper limit to be large enough to see the
#    entire curve.
#
#  * How does the distribution of our sample of xbar values compare (in terms of shape, location,
#    and spread) with the theoretical distribution of Xbar? Does it agree with the theory?
#


# ========================================================================================
#
# 4. Investigating S^2 and T
#
#
# Now lets look move on to investigate the sample variances and standardised values. If our data
# are normal, then a multiple of the sample variance (n−1) S^2 / σ^2 should follow a χ2 
# distribution with n−1 degrees of freedom. Let’s compute the sample variances for our 100 
# samples and investigate this using the techniques above.
#
#
#
# Exercise 4.1:
# ~~~~~~~~~~~~~~
#
# * Apply the `var` function to each of the samples in `xMatrix` to create a vector of 100
#   sample variances, s^2. 
#   Call these `svar`.
# * The theory says that the random quantity (n−1)S^2/σ^2 has a χ2_(n-1) distribution. Re-scale
#   the sample variances by multiplying `svar` by (n−1)/σ^2, and call this `svarScl` (for 'scaled
#   sample variance').
# * Calculate the mean and variance of `svarScl` - do they agree with the mean and variance
#   of the appropriate χ2_(n-1) distribution? 


#
# Exercise 4.2:
# ~~~~~~~~~~~~~~
#
# * Go back to your code from Exercises 3.2-3.4 to plot the histogram of $\bar{x}$ with a 
#   normal density curve on top. 
#   Adjust the code to instead plot a histogram of s^2 and overlay the density function for
#   a χ2_n−1 distribution using the `dchisq` function.
# * Do the samples appear consistent with the theoretical distribution?


#
# Exercise 4.3:
# ~~~~~~~~~~~~~~
#
# * According to the theory, Xbar and S^2 should be independent. Use `plot` to produce a 
#   scatterplot of xbar against s^2.
#   Do they appear to be independent? What features of the plot support this conclusion?


# The final theoretical result concerns the distribution of the sample means when standardised 
# by the sample variance, 
#
# T = (Xbar−μ)/(s/√n).
#
# We know that if we use the population variance, σ^2, in this calculation instead of s^2, then
# the standardised values are N(0,1). However, as this expression uses the sample variance the
# distribution is no longer normal, and is t-distributed instead to account for the extra 
# uncetainty we have introduced by using an estimate of σ^2.


#
# Exercise 4.4:
# ~~~~~~~~~~~~~~
# 
# * Using your values of `xbar` and `svar`, create a new vector `tvals` with elements equal to
#   (xbar−μ)/(s/√n) for each sample of size 5.
# * Using the techniques above, investigate the behaviour and distribution of `tvals` (inspect
#   the mean and variance, construct a histogram, overlay the appropriate density curve).


# ========================================================================================
#
# 5. Probability Integral Transform
#
# All of the calculations we performed in the preceding section required us to be able to
# generate random numbers from a particular distribuion, in this case the Normal. We will 
# see (or have seen) in Lecture 2 that we can generate random numbers from (almost) any 
# distribution by simply applying an appropriate transformation to simple standard uniform 
# random numbers.
#
# The theory we require is known as the probability integral transform, and goes as follows. 
# Let X be a continuous random quantity with cdf F_X(x), and let U be a standard uniform 
# random quantity, U ∼ U[0,1]. Then 
#
#  1. the random quantity Y=F_X(X) has a standard uniform distribution U[0,1];
#
#  2. the random quantity W=F^{−1}_X(U) has cdf equal to F_X, and thus W has the same 
#     distribution as X.
#
# It is the second part of this theorem that is of relevance to generating random numbers from
# specific distributions. So long as we can find F^{−1}_X for the distribution we want, we can
# generate uniform random numbers, u, and apply the transformation F^{−1}_X(u) to produce random
# numbers from another distribution which have the distribution of X. This process is known as
# inverse sampling:
#  
# For example, let’s consider the exponential distribution with ‘rate’ λ>0 which has c.d.f.
#
#  F(y) = 1 − exp(−λy)
#
# for y ≥ 0 and 0 elsewhere. 
#
# If we set U=F(y), then we can find the inverse CDF as (check this result!)
#
#  Y = −log(1−U)/λ.
#
# Then by the theorem above, if U ~ U[0,1] the random quantity Y must have an exponential 
# distribution  with rate λ.
#
# Let’s investigate this further by verifing the result numerically:

#
# Exercise 5.1:
# ~~~~~~~~~~~~~
# 
# * Generate a sample of 1000 random numbers from the uniform distribution on [0,1] as the
#   vector `us`. You'll need the `runif` function.
#
# * Use the result above to transform the uniform sample into an exponential distribution with
#   rate λ (pick your own value of λ>0). Call these `ys`.
#
# * Plot a histogram of `ys`` and overlay the density function for the exponential distribution
#   using your value of λ - you’ll need to use the `dexp`` function to evaluate the exponential
#   density.
#


#
# Exercise 5.2:
# ~~~~~~~~~~~~~
# 
# * R has defined the inverse CDF functions for most standard distributions, these are the 
#   ‘quantile’ functions that  begin with `q` for each standarad distribution, e.g. `qnorm`,
#    `qchisq`, etc. 
#
# * Generate a random sample of size 1000 from a Normal distribution with zero mean and unit
#   variance using only the two functions `runif` and `qnorm`. Check the distribution of your 
#   random numbers in the same way as above.


# =========================================================================================
#
# 6. Further exercises
#
# If you finish the exercises above, have a look at the online worksheet for additional problems
# on these topics.
#
#