If you finish the rest of Practical 3, investigate some of the additional problems below.
Residual Analysis
The residuals from our linear regression are the values \(e_i=\widehat{\varepsilon_i} = y_i - \widehat{y}_i\) for \(i = 1,\dots,n\), where \(\widehat{y}_i=\widehat{\beta}_0-\widehat{\beta}_1 x_i\). Analysis of the residuals allow us to perform diagnostic tests to investigate the goodness of fit of our regression under our modelling assumptions.
Under the assumption of linearity only, given the OLS estimates, the observed values of \(X\) are uncorrelated with the residuals. Which, in the case of simple linear regression, implies that the residuals are uncorrelated with any linear combination of \(X\), in particular the fitted values \(\widehat{y}_i=\widehat{\beta}_0+\widehat{\beta}_1 x_i\). Therefore, our diagnostics are based on the scatterplots of \(e\) against \(\widehat{y}\). In the case of simply linear regression will look the same as plotting \(e\) against \(x\).
- Plot the residuals against the fitted values, and the residuals against the eruption durations side-by-side.
- What do you see? Why should these plots be similar?
To assess the SLR assumptions, we inspect the residual plot for particular features:
- Even and random scatter of the residuals about \(0\) is the expected behaviour when our SLR assumptions are satisfied.
- Residuals shown evidence of a trend or pattern — The presence of a clear pattern or trend in the residuals suggests that \({\mathbb{E}\left[{\varepsilon_i}\right]}\neq 0\) and \({\mathbb{C}\text{ov}\left[{\varepsilon_i,\varepsilon_i}\right]}\neq 0\). There is clearly structure in the data that is not explained by the regression model, and so a simple linear model is not adequate for explaining the behaviour of \(Y\).
- Spread of the residuals is not constant — if the spread of the residuals changes substantially as \(x\) (or \(\widehat{y}\)) changes then clearly our assumption of constant variance is not upheld.
- A small number of residuals are very far from the others and \(0\) — observations with very large residuals are known as outliers. Sometimes these points can be explained through points with particularly high measurement error, or the effect of another variable which should be included in the model. Their presence could signal problems with the data, a linear regression being inadequate, or a violation of Normality.
- Use your residual plot to assess whether the SLR assumptions are valid for these data.
In order to state confidence intervals for the coefficients and for predictions made using this model, in addition to the assumptions tested above, we also require that the regression errors are normally distributed. We can check the normality assumption using quantile plots (See Practical 6 and qqnorm
for more on quantile plots).
- Produce a side-by-side plot of the histogram of the residuals and a normal quantile plot of the residuals.
- Do the residuals appear to be normally distributed?
Looking more deeply
We will now take a better a look at the data itself and discuss if our modelling approach is appropriate.
- Replot the waiting times waiting (\(y\)-axis) against the eruption durations (\(x\)-axis) - or refer back to the plot from section 1. Can you see anything out of the ordinary that might suggest caution when treating this data set as one group?
- Separate the data into two or more clusters by partitioning the data duration. Subset both the eruptions and waiting vectors according to your criteria, and redraw the scatterplot indicating the two groups by colouring the points. Was it reasonable to assume that all pairs of observations were sampled from a common distribution?
- What is really going on?
- Fit a linear regression model using the OLS method to each subset. Add the regression lines for each model to each plot. What do you see? Compare these new models to the previous one.