In this practical, we will use R to fit linear regression models by the method of ordinary least squares (OLS). We will make use of R’s lm
function simplify the process of model fitting and parameter estimation, and use information from its output to construct confidence intervals. Finally, we will assess the quality of the regression and use residual diagnostic plots to assess the validity of the regression assumptions.
mean
, sd
, var
, sum
library
and data
plot
and straight lines with abline
col
), line type (lty
), etclm
coef
), residuals (resid
), and fitted values (fitted
)summary
Regression analysis is a statistical area concerned with the estimation of the relationship between two or more variables. In the simple linear regression model we have a dependent or response variable \(Y\), and and independent or predictor variable \(X\). Suppose we have \(n\) pairs of observations \(\{y_i, x_i\}\), \(i = 1, \dots, n\), we can then write the simple linear regression model as \[y_i = {\beta_0}+ \beta_1 x_i + \varepsilon_i, ~\ \ i=1,\dots,n\] where \({\beta_0}\), \(\beta_1\) are the unknown regression parameters (i.e. the intercept and slope of the trend line), and \(\varepsilon_i\) are the unobserved random regression error around the trend line.
In linear regression, we make the following assumptions about \(\varepsilon_i\):
We fit the regression model by least squares, which seeks the values of \({\beta_0}\) and \(\beta_1\) which minimise the sum of the squared errors between our observed \(y_i\) and the regression line \({\beta_0}+\beta_1 x_i\), i.e. \[ Q(\beta_0,\beta_1) = \sum_{i=1}^n (y_i-{\beta_0}-\beta_1 x_i)^2. \] The minimising values are the OLS estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\): \[ \begin{align*} \widehat{\beta}_0 &=\bar{y}- \widehat{\beta}_1\bar{x}, & \widehat{\beta}_1 &= \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}. \end{align*} \]
faithful
data setThe dataset we will be using in this practical concerns the behaviour of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. For a natural phenomenon, the Old Faithful geyser is highly predictable and has erupted every 44 to 125 minutes since 2000. There are two variables in the data set:
Before we start fitting any models, the first step should always be to perform some exploratory data analysis of the data to gain some insight into the behaviour of the variables and suggest any interesting relationships or hypotheses to explore further.
data(faithful)
w
that contains the vector of waiting times, and a second vector d
that contains the durations of eruptions.data(faithful)
w <- faithful$waiting
d <- faithful$eruptions
plot(y=w,x=d)
The model we’re interested in fitting to these data is a simple linear relationship between the waiting times, \(w\), and the eruption durations, \(d\): \[ w = {\beta_0}+ \beta_1 d + \varepsilon.\]
abline
function to draw the lines on your plot. Save your best guess as guess.b0
and guess.b1
.As mentioned above, the OLS method minimizes the square of the residuals which are the difference between the vector of observations, \(w_i\), and the fitted values, \(\widehat{w}_i\). Before we move on, let’s compute this sum-of-squares of errors value using your best guess at the coefficients.
guess.what
containing the fitted values \(\widehat{w}_i=\widehat{\beta}_0+\widehat{\beta}_1 d_i\) for your regression line.guess.resid
containing the residuals \(e_i=w_i -\widehat{w}_i\).guess.b0
, guess.b1
\()\). Save the value as guess.Q
- we’ll compare it to the least-squares fit later.
We have access to all of the data and we know the formulae for the least squares estimates of the coefficients, so we should be able to calculate the best-fitting values directly
mean
to find \(\bar{w}\) and \(\bar{d}\) and save them as wbar
and dbar
.sdd
and sdw
. Hint: the sum
function will be useful.beta0hat
and beta1hat
.abline
function to add the least-squares regression as an additional line to your plot. Use colour (the col
argument) to distinguish it from your previous guesses.
lm
It would be rather tiresome to have to do the calculations explicitly for every regression problem, and so R provides the lm
function to compute the least-squares linear regression for us, and it returns an object which summarises all aspects of the regression fit.
In R, We can fit linear models by the method of least squares using the function lm
. Suppose our response variable is \(Y\), and we have a predictor variable \(X\), with observations stored in R vectors y
and x
respectively. To fit \(y\) as a linear function of \(x\), we use the following lm
command:
model <- lm(y ~ x)
Alternatively, if we have a data frame called dataset
with variables in columns a
and b
then we could fit the linear regression of a
on b
without having to extract the columns into vectors by specifying the data
argument
model <- lm(a ~ b, data=dataset)
The ~
(tilde) symbol in this expression should be read as “is modelled as”. Note that R always automatically include a constant term, so we only need to specify the \(X\) and \(Y\) variables.
We can inspect the fitted regression object returned from lm
to get a simple summary of the estimated model parameters. To do this, simply evaluate the variable which contains the information returned from lm
:
model
For more information, see Linear regression.
lm
to fit waiting times w
as a linear function of eruption durations d
.model
.model
to find the fitted least-squares coefficients. Check the values match those you computed in the previous exercise.
R also has a number of functions that, when applied to the results of a linear regression, will return key quantities such as residuals and fitted values.
coef(model)
and coefficicents(model)
– returns the estimated model coefficients as a vector \((\widehat{\beta}_0,\widehat{\beta}_1)\)fitted(model)
and fitted.values(model)
– returns the vector of fitted values, \(\widehat{y}_i=\widehat{\beta}_0+\widehat{\beta}_1 x_i\)resid(model)
and residuals(model)
– returns the vector of residual values, \(e_i=y_i-\widehat{y}\)For more information, see the R help page.
coef
function to extract the coefficients of the fitted linear regression model as a vector. Check the values agree with those obtained by direct calculation.model
, and use this to compute the sum of squares of the residuals as lsq.Q
.lsq.Q
to guess.Q
- was your best guess close to the minimum of the least-squares error function?
We can easily extract and inspect the coefficients and residuals of the linear model, but to obtain a more complete statistical summary of the model we use the summary
function.:
summary(model)
There is a lot of information in this output, but the key quantities to focus on are:
(Intercept)
, and subsequent rows will be named after the other predictor variables in the model.Estimate
column gives the least squares estimates of the coefficients.Std. Error
column gives the corresponding standard error for each coefficient.t value
column contains the \(t\) test statistics.Pr(>|t|)
column is then the \(p\)-value associated with each test, where a low p-value indicates that we have evidence to reject the null hypothesis.We can also use the summary
to access the individual elements of the regression summary output. If we save the results of a call to the summary function of a lm
object as summ
:
summ$residuals
– extracts the regression residuals as resid
abovesumm$coefficients
– returns the \(p \times 4\) coefficient summary table with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value.summ$sigma
– extracts the regression standard errorsumm$r.squared
– returns the regression \(R^2\)summary
function to model
to produce the regression summary for our example.Make sure you are able to locate:
se
. (We’ll need this later!)t value
column correspond to? What significance test (and what hypotheses) do these statistics correspond to?Given a fitted regression model, we can perform various hypothesis tests and construct confidence intervals to perform the usual kinds of statistical inference. To do this, we require the Normality assumption to hold for our inference to be valid.
Theory tells us that the standard errors for the parameter estimates are defined as \[ s_{\widehat{\beta}_0} = s_e \sqrt{\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}}, ~~~~~~~~~~~~~~~~~~~~ s_{\widehat{\beta}_1} = \frac{s_e}{\sqrt{S_{xx}}}, \] where \(s_{\widehat{\beta}_0}\) and \(s_{\widehat{\beta}_1}\) are unbiased estimates for \(\sigma_{\widehat{\beta}_0}\) and \(\sigma_{\widehat{\beta}_0}\), the variances of \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\).
Given these standard errors and the assumption of normality, we can say for coefficient \(\beta_i\), \(i=0,1\) that: \[ \frac{\widehat{\beta}_i - \beta_i}{s_{\widehat{\beta}_i}} \sim \mathcal{t}_{{n-2}}, \] and hence an \(\alpha\) level confidence interval for \(\beta_i\) can be written as \[ \widehat{\beta}_i \pm t^*_{n-2,\alpha/2} ~ s_{\widehat{\beta}_i}. \]
se.beta1
.beta1hat
and se.beta1
for a test under the null hypothesis \(H_0: \beta_1=0\). Does it agree with the summary output from R?pt
function to find the probability of observing a test statistic as extreme as this.t value
s in the regression summary, what do you conclude about the importance of the regression coefficients?qt
function to find the critical value of the \(t\)-distribution.Suppose now that we are interested in some new point \(X=x^*\), and at this point we want to predict: (a) the location of the regression line, and (b) the value of \(Y\).
In the first problem, we are interested in the value \(\mu^*=\beta_0+\beta_1 x^*\), i.e. the location of the regression line, at some new point \(X=x^*\).
In the second problem, we are interested in the value of \(Y^*=\beta_0+\beta_1 x^*+\varepsilon\), i.e. the actual observation value, at some new \(X=x^*\).
The difference between the two is that the first simply concerns the location of the regression line, whereas the inference for the new observation \(Y^*\) concerns both the position of the regression line and the regression error about that point.
Theory tells us that a \(100(1-\alpha)\%\) confidence interval for \(\mu^*\), the location of the regression line at \(X=x^*\), is
\[ (\widehat{\beta}_0 + \widehat{\beta}_1 x^*) \pm t^*_{n-2,\alpha/2} \times {s_e} \sqrt{\frac{1}{n} + \frac{(x^*-\bar{x})^2}{S_{xx}}} \]
Similarly, a \(100(1-\alpha)\%\) prediction interval for the location of the observation \(Y^*\) at \(X=x^*\) is
\[ (\widehat{\beta}_0 + \widehat{\beta}_1 x^*) \pm t^*_{n-2,\alpha/2} \times {s_e} \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar{x})^2}{S_{xx}}} \]
If you finish the exercises above, have a look at some of these additional problems.