In this workshop we will start with the analysis of the Boston
dataset included in MASS
. This dataset consists of 506 samples. The response variable is median of owner-occupied homes in Boston (medv
). The dataset has 13 associated predictor variables.
We load library MASS
(installing the package is not needed for MASS
), check the help document for Boston
, check for missing entries and use commands names()
and head()
to get a better picture of how this dataset looks like.
library(MASS)
help(Boston)
sum(is.na(Boston))
## [1] 0
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
We will analyse medv
with respect to the predictor lstat
(percent of lower status of population). A convenient command we can use to directly import all of the columns of Boston
is the command attach()
.
attach(Boston)
head(cbind(medv, lstat), 10L)
## medv lstat
## [1,] 24.0 4.98
## [2,] 21.6 9.14
## [3,] 34.7 4.03
## [4,] 33.4 2.94
## [5,] 36.2 5.33
## [6,] 28.7 5.21
## [7,] 22.9 12.43
## [8,] 27.1 19.15
## [9,] 16.5 29.93
## [10,] 18.9 17.10
For convenience we will just name the response y
and the predictor x
. We will also pre-define the labels for the x-lab and y-lab that we will use repeatedly in figures.
y = medv
x = lstat
y.lab = 'Median Property Value'
x.lab = 'Lower Status (%)'
First, lets plot the two variables.
plot(lstat, medv, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, main = "", bty = 'l')
The plot suggests a non-linear relationship between medv
and lstat
.
Let us start by fitting to the data a 2-degree polynomial using the command lm()
and summarise the results using summary()
.
poly2 = lm(y ~ x + I(x^2))
summary(poly2)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## x -2.332821 0.123803 -18.84 <2e-16 ***
## I(x^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
Important note: Above we see that we need to encolse x^2
within the envelope function I()
. This is because x^2
is a function of x
and when we use a function (any function) of a predictor in lm()
we need to do that, otherwise we get wrong results! Besides that we see the usual output from summary()
.
The above syntax is not very convenient because we need to add manually further polynomial terms. For instance, for a 4-degree polynomial we would need to use lm(y ~ x + I(x^2)+ I(x^3)+ I(x^4))
. Fortunately, we have the function poly()
which makes things easier for us.
poly2 = lm(y ~ poly(x, 2, raw = TRUE))
summary(poly2)
##
## Call:
## lm(formula = y ~ poly(x, 2, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## poly(x, 2, raw = TRUE)1 -2.332821 0.123803 -18.84 <2e-16 ***
## poly(x, 2, raw = TRUE)2 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
The argument raw = TRUE
above is used in order to get the same coefficients as previously. If we do not use this argument the coefficients will differ because they will be calculated on a different basis. However, this is not important because as mentioned in lectures with polynomial regression we are almost never interested in the regression coefficients. In terms of fitting the curve poly(x, 2, raw = TRUE))
and poly(x, 2))
will give the same result!
Now, lets see how we can produce a plot similar to the ones presented in the lectures. A first important step is that we need to create an object, which we name sort.x
, which has the sorted values of predictor x
in a ascending order. Without sort.x
we will not be able to produce the plots! Then, we need to use predict()
with sort.x
as input in order to proceed to the next steps.
sort.x = sort(x)
sort.x[1:10] # the first 10 sorted values of x
## [1] 1.73 1.92 1.98 2.47 2.87 2.88 2.94 2.96 2.97 2.98
pred2 = predict(poly2, newdata = list(x = sort.x), se = TRUE)
names(pred2)
## [1] "fit" "se.fit" "df" "residual.scale"
The object pred2
contains fit
which are the fitted values and se.fit
which are the standard errors that we need in order to construct the approximate 95% confidence intervals. With this information we can construct the confidence intervals using cbind()
. Lets see how the first the 10 fitted values and how the first 10 confidence intervals look like.
se.bands2 = cbind(pred2$fit + 2 * pred2$se.fit, pred2$fit - 2 * pred2$se.fit)
pred2$fit[1:10] # the first 10 fitted values of the curve
## 1 2 3 4 5 6 7 8
## 38.95656 38.54352 38.41374 37.36561 36.52550 36.50468 36.37992 36.33840
## 9 10
## 36.31765 36.29691
se.bands2[1:10,] # the first 10 confidence intervals of the curve
## [,1] [,2]
## 1 40.33069 37.58243
## 2 39.88036 37.20668
## 3 39.73895 37.08853
## 4 38.59845 36.13278
## 5 37.68647 35.36453
## 6 37.66390 35.34546
## 7 37.52865 35.23118
## 8 37.48365 35.19314
## 9 37.46117 35.17413
## 10 37.43870 35.15513
Now that we have all that is needed we can finally produce the plot! Final detail: Below we use lines()
for pred2$fit
because this is a vector, but for se.bands2
which is a matrix we have to use matlines()
.
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-2 polynomial", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "red")
matlines(sort.x, se.bands2, lwd = 1.4, col = "red", lty = 3)
Having seen this procedure in detail once, plotting the fitted curves for polynomials of different degrees is straightforward. The code below produces a plot of 2-degree up to 5-degree polynomial fits.
poly3 = lm(y ~ poly(x, 3))
poly4 = lm(y ~ poly(x, 4))
poly5 = lm(y ~ poly(x, 5))
pred3 = predict(poly3, newdata = list(x = sort.x), se = TRUE)
pred4 = predict(poly4, newdata = list(x = sort.x), se = TRUE)
pred5 = predict(poly5, newdata = list(x = sort.x), se = TRUE)
se.bands3 = cbind(pred3$fit + 2*pred3$se.fit, pred3$fit-2*pred3$se.fit)
se.bands4 = cbind(pred4$fit + 2*pred4$se.fit, pred4$fit-2*pred4$se.fit)
se.bands5 = cbind(pred5$fit + 2*pred5$se.fit, pred5$fit-2*pred5$se.fit)
par(mfrow = c(2,2))
# Degree-2
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-2 polynomial", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "red")
matlines(sort.x, se.bands2, lwd = 2, col = "red", lty = 3)
# Degree-3
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-3 polynomial", bty = 'l')
lines(sort.x, pred3$fit, lwd = 2, col = "darkviolet")
matlines(sort.x, se.bands3, lwd = 2, col = "darkviolet", lty = 3)
# Degree-4
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-4 polynomial", bty = 'l')
lines(sort.x, pred4$fit, lwd = 2, col = "blue")
matlines(sort.x, se.bands4, lwd = 2, col = "blue", lty = 3)
# Degree-5
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "Degree-5 polynomial", bty = 'l')
lines(sort.x, pred5$fit, lwd = 2, col = "black")
matlines(sort.x, se.bands5, lwd = 2, col = "black", lty = 3)
For this data it is not clear which curve fits better. All four curves look reasonable given the data that we have available. If we want to base our decision on a formal procedure, rather than on a subjective decision, one option is to use the classical statistical methodology of analysis-of-variance (ANOVA). Specifically, we will perform sequential comparisons based on the F-test, comparing first the linear model vs. the quadratic model (degree-2 polynomial), then the quadratic model vs. the cubic model (degree-3 polynomial) and so on. So we also have to fit the simple linear model and also the degree-6 polynomial. We can perform this analysis in RStudio using the command anova()
as displayed below.
poly1 = lm(y ~ x)
poly6 = lm(y ~ poly(x, 6))
anova(poly1, poly2, poly3, poly4, poly5, poly6)
## Analysis of Variance Table
##
## Model 1: y ~ x
## Model 2: y ~ poly(x, 2, raw = TRUE)
## Model 3: y ~ poly(x, 3)
## Model 4: y ~ poly(x, 4)
## Model 5: y ~ poly(x, 5)
## Model 6: y ~ poly(x, 6)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 151.8623 < 2.2e-16 ***
## 3 502 14616 1 731.8 26.9390 3.061e-07 ***
## 4 501 13968 1 647.8 23.8477 1.406e-06 ***
## 5 500 13597 1 370.7 13.6453 0.0002452 ***
## 6 499 13555 1 42.4 1.5596 0.2123125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hypothesis that is checked at each step is that the decrease in RSS is not significant. The hypothesis is rejected if the p-value (column Pr(>F)
) is smaller than a given significance level (say 0.05). If the hypothesis is rejected then we move on to the next comparison. Based on this reasoning and the results above we would select the 5-degree polynomial.
Lets see how to analyse median property value from a different perspective. Specifically, suppose we want to model the probability that median property value exceed a specific threshold or cut-off point. We can get useful insights about the distribution of our response via summary()
. For example we see that 25% of the properties have a value above $25,000. So, lets use 25 as threshold; the plot below shows the data points and the the threshold at 25.
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "", bty = 'l')
abline(h = 25, lty = 2)
Now let fit a 5-degree polynomial logistic model. For the logistic model we cannot use lm()
because this commands fits only linear models. Instead, we have to use glm()
(which fits a variety of models). For logistic regression we have to specify family = 'binomial'
. Also below we use the envelope or wrapper function I()
to automatically define a binary response based on the chosed threshold.
poly.bin = glm(I(y>25) ~ poly(x, 5), family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Above we get a warning message, but we can ignore that (it is error messages that we can’t ignore). Fitting the curve with logistic models is slightly more complicated because the command predict()
returns the fitted values on the log scale of the probability odds so we have to transform them: we do that in the part exp(pred$fit)/(1 + exp(pred$fit))
below. We also have to do the same for the confidence bands.
pred = predict(poly.bin, newdata = list(x = sort.x), se = TRUE)
probs = exp(pred$fit)/(1 + exp(pred$fit))
se.bands.logit = cbind(pred$fit + 2*pred$se, pred$fit - 2*pred$se)
se.bands = exp(se.bands.logit)/(1 + exp(se.bands.logit))
We can now produce a plot of this probabilities vs. lower status using the following piece of code.
plot(x, I(y>25), bty = "n", type = 'n', xlab = x.lab,
ylab = 'Pr(Median Property Value > 25)', cex.lab = 1.1,
main = 'Probability plot from degree-5 polynomial')
points(x, I((y>25)), cex = .5, pch = "|", col = "darkorange")
lines(sort.x, probs, lwd = 2, col = "black")
matlines(sort.x, se.bands, lwd = 1.2, col = "black", lty = 3)
We see a sharp decrease of the probability as the percentage of lower socio-economic status increases. We also notice how the upper confidence interval “explodes” after the middle as there are no samples from the category medv>25
.
For step function we can make use of the command cut()
which automatically assigns samples to intervals given a specific number of cutpoints. We can check how this works by executing the following line of code.
table(cut(x, 2))
##
## (1.69,19.9] (19.9,38]
## 430 76
What we see is that cut(x, 2)
automatically created the intervals \((1.69, 19.9]\) and (19.9, 38] for us. The command table()
tells us that 430 samples of x
fall within the first interval and that 76 samples fall within the second interval.
So, we can use cut()
within lm()
to easily fit regression models with step functions. Below we consider 4 models with 2, 3, 4 and 5 cutpoints.
step2 = lm(y ~ cut(x, 2))
step3 = lm(y ~ cut(x, 3))
step4 = lm(y ~ cut(x, 4))
step5 = lm(y ~ cut(x, 5))
The analysis then is essentially the same as previously. The following code produces the 2 by 2 plot of the fitted lines of the four models.
pred2 = predict(step2, newdata = list(x = sort(x)), se = TRUE)
pred3 = predict(step3, newdata = list(x = sort(x)), se = TRUE)
pred4 = predict(step4, newdata = list(x = sort(x)), se = TRUE)
pred5 = predict(step5, newdata = list(x = sort(x)), se = TRUE)
se.bands2 = cbind(pred2$fit + 2*pred2$se.fit, pred2$fit-2*pred2$se.fit)
se.bands3 = cbind(pred3$fit + 2*pred3$se.fit, pred3$fit-2*pred3$se.fit)
se.bands4 = cbind(pred4$fit + 2*pred4$se.fit, pred4$fit-2*pred4$se.fit)
se.bands5 = cbind(pred5$fit + 2*pred5$se.fit, pred5$fit-2*pred5$se.fit)
par(mfrow = c(2,2))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "2 cutoffs", bty = 'l')
lines(sort(x), pred2$fit, lwd = 2, col = "red")
matlines(sort(x), se.bands2, lwd = 1.4, col = "red", lty = 3)
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "3 cutoffs", bty = 'l')
lines(sort(x), pred3$fit, lwd = 2, col = "darkviolet")
matlines(sort(x), se.bands3, lwd = 1.4, col = "darkviolet", lty = 3)
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "4 cutoffs", bty = 'l')
lines(sort(x), pred4$fit, lwd = 2, col = "blue")
matlines(sort(x), se.bands4, lwd = 1.4, col = "blue", lty = 3)
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "5 cutoffs", bty = 'l')
lines(sort(x), pred5$fit, lwd = 2, col = "black")
matlines(sort(x), se.bands5, lwd = 1.4, col = "black", lty = 3)
Note that we do not necessarily need to rely on the automatic selections of cutpoints used by cut()
. We can define the intervals if we want to. For instance, if we want cutpoints at 10, 20 and 30 we can do the following
range(x)
## [1] 1.73 37.97
breaks4 = c(min(x), 10, 20, 30, max(x))
table(cut(x, breaks = breaks4))
##
## (1.73,10] (10,20] (20,30] (30,38]
## 218 213 62 12
Then our model would be:
step.new4 = lm(y ~ cut(x, breaks = breaks4))
summary(step.new4)
##
## Call:
## lm(formula = y ~ cut(x, breaks = breaks4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4803 -4.6239 -0.4239 2.8968 20.6197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.3803 0.4415 66.540 <2e-16 ***
## cut(x, breaks = breaks4)(10,20] -10.4563 0.6281 -16.648 <2e-16 ***
## cut(x, breaks = breaks4)(20,30] -16.6770 0.9383 -17.773 <2e-16 ***
## cut(x, breaks = breaks4)(30,38] -18.6886 1.9331 -9.668 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.519 on 501 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4925, Adjusted R-squared: 0.4895
## F-statistic: 162.1 on 3 and 501 DF, p-value: < 2.2e-16
The relationship between predictor indus
and medv
also looks interesting. Predictor indus
is the proportion of non-retail business acres per town. First plot the variables to see how the relationship between indus
and medv
looks like. Then analyse the data. Feel free to select between polynomials and step functions (of course you can experiment with both if you have time). If you use step functions you can also define the intervals manually. Use ANOVA to decide which model fits the data best. For the logistic analysis you may choose any threshold point that seems appropriate.