Tutorial: Boston Housing Data - Social Status Analysis

In this workshop we will continue analysing the Boston dataset included in MASS. This dataset consists of 506 samples. The response variable is median of owner-occupied homes in Boston (medv). The dataset has 13 associated predictor variables.

Initial steps

We load library MASS again. As previously, we will analyse medv with respect to the predictor lstat (percent of lower status of population).

library(MASS)
attach(Boston)
y = medv
x = lstat
y.lab = 'Median Property Value'
x.lab = 'Lower Status (%)'
plot(lstat, medv, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab
     , main = "", bty = 'l')

Regression Splines

For this analysis we will require package splines, therefore we need to use install.packages('splines') in RStudio once. We then load the library

library(splines)

Initially lets fits regression splines by specifying knots. From the previous plot it is not clear where exactly we should place knots, so we will make use of the command summary in order to find the 25th, 50th and 75th percentiles of x, which will be the positions where we will place the knots.

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.73    6.95   11.36   12.65   16.95   37.97
cuts = summary(x)[c(2, 3, 5)] 
cuts
## 1st Qu.  Median 3rd Qu. 
##   6.950  11.360  16.955

Also, before we fit the splines it is again important to sort the variable x.

sort.x = sort(x)

For start lets fit a linear spline using our selected placement of knots. For this we can use command lm() and inside it we use the command bs() in which we specify degree = 1 for a linear spline and knots = cuts for the placement of the knots at the three percentiles. We also calculate the corresponding fitted values and confidence intervals exactly in the same way we did in the previous workshop.

spline1 = lm(y ~ bs(x, degree = 1, knots = cuts))
pred1 = predict(spline1, newdata = list(x = sort.x), se = TRUE)
se.bands1 = cbind(pred1$fit + 2 * pred1$se.fit, pred1$fit - 2 * pred1$se.fit)

Having done that we can now produce the corresponding plot.

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline", bty = 'l')
lines(sort.x, pred1$fit, lwd = 2, col = "red")
matlines(sort.x, se.bands1, lwd = 2, col = "red", lty = 3)

Using ?bs we see that instead of using the argument knots we can use the argument df, which are the degrees of freedom. We know from lectures that splines have \((d+1) + K\) degrees of freedom where \(d\) is the degree of the polynomial. So in this case we have 2+1+3 = 5 degrees of freedom. Selecting df = 5 in bs() will automatically use 3 knots placed at the 25th, 50th and 75th percentiles. Below we check whether the plot based on df=5 is indeed the same as the previous plot and as we can see it is.

spline1df = lm(y ~ bs(x, degree = 1, df = 5))
pred1df = predict(spline1df, newdata = list(x = sort.x), se = TRUE)
se.bands1df = cbind(pred1df$fit + 2 * pred1df$se.fit, pred1df$fit - 2 * pred1df$se.fit)

par(mfrow = c(1, 2))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline (with knots)", bty = 'l')
lines(sort.x, pred1$fit, lwd = 2, col = "red")
matlines(sort.x, se.bands1, lwd = 2, col = "red", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline (with df)", bty = 'l')
lines(sort.x, pred1df$fit, lwd = 2, col = "darkred")
matlines(sort.x, se.bands1df, lwd = 2, col = "red", lty = 3)

Having seen how this works we can also fit a degree-2 (quadratic) and degree-3 (cubic) spline to the data, all we have to do is change degree = 1 to degree = 2 and degree = 3 respectively. Also we increase the respective degrees of freedom from df = 5 to df = 6 and df = 7 in order to keep the same number (and position) of knots in the quadratic and cubic spline models.

spline2 = lm(y ~ bs(x, degree = 2, df = 6))
pred2 = predict(spline2, newdata = list(x = sort.x), se = TRUE)
se.bands2 = cbind(pred2$fit + 2 * pred2$se.fit, pred2$fit - 2 * pred2$se.fit)

spline3 = lm(y ~ bs(x, degree = 3, df = 7))
pred3 = predict(spline3, newdata = list(x = sort.x), se = TRUE)
se.bands3 = cbind(pred3$fit + 2 * pred3$se.fit, pred3$fit - 2 * pred3$se.fit)

par(mfrow = c(1,3))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline", bty = 'l')
lines(sort.x, pred1$fit, lwd = 2, col = "darkred")
matlines(sort.x, se.bands1, lwd = 2, col = "darkred", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Quadratic Spline", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "darkgreen")
matlines(sort.x, se.bands2, lwd = 2, col = "darkgreen", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Cubic Spline", bty = 'l')
lines(sort.x, pred3$fit, lwd = 2, col = "darkblue")
matlines(sort.x, se.bands3, lwd = 2, col = "darkblue", lty = 3)

Natural Splines

Another option we have seen in lectures is to fit natural splines to the data. For this we can use the command ns() in RStudio. As with the command bs(), previously, we have again the option to either specify the knots manually (via the argument knots) or to simply pre-define the degrees of freedom (via the argument df). Below we use the latter option to fit four natural splines with 1, 2, 3 and 4 degrees of freedom. As we see using 1 degree of freedom actually results in just a linear model.

spline.ns1 = lm(y ~ ns(x, df = 1))
pred.ns1 = predict(spline.ns1, newdata = list(x = sort.x), se = TRUE)
se.bands.ns1 = cbind(pred.ns1$fit + 2 * pred.ns1$se.fit, pred.ns1$fit - 2 * pred.ns1$se.fit)

spline.ns2 = lm(y ~ ns(x, df = 2))
pred.ns2 = predict(spline.ns2, newdata = list(x = sort.x), se = TRUE)
se.bands.ns2 = cbind(pred.ns2$fit + 2 * pred.ns2$se.fit, pred.ns2$fit - 2 * pred.ns2$se.fit)

spline.ns3 = lm(y ~ ns(x, df = 3))
pred.ns3 = predict(spline.ns3, newdata = list(x = sort.x), se = TRUE)
se.bands.ns3 = cbind(pred.ns3$fit + 2 * pred.ns3$se.fit, pred.ns3$fit - 2 * pred.ns3$se.fit)

spline.ns4 = lm(y ~ ns(x, df = 4))
pred.ns4 = predict(spline.ns4, newdata = list(x = sort.x), se = TRUE)
se.bands.ns4 = cbind(pred.ns4$fit + 2 * pred.ns4$se.fit, pred.ns4$fit - 2 * pred.ns4$se.fit)

par(mfrow = c(1, 4))

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (1 df)", bty = 'l')
lines(sort.x, pred.ns1$fit, lwd = 2, col = "darkred")
matlines(sort.x, se.bands.ns1, lwd = 2, col = "darkred", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (2 df)", bty = 'l')
lines(sort.x, pred.ns2$fit, lwd = 2, col = "darkgreen")
matlines(sort.x, se.bands.ns2, lwd = 2, col = "darkgreen", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (3 df)", bty = 'l')
lines(sort.x, pred.ns3$fit, lwd = 2, col = "darkblue")
matlines(sort.x, se.bands.ns3, lwd = 2, col = "darkblue", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (4 df)", bty = 'l')
lines(sort.x, pred.ns4$fit, lwd = 2, col = "brown")
matlines(sort.x, se.bands.ns4, lwd = 2, col = "brown", lty = 3)

Below we plot the cubic spline next to the natural cubic slpine for comparison. As seen the natural cubic spline is generally smoother and it is also closer to linear on the right boundary of the predictor space, where it has, additionally, narrower confidence intervals in comparison to the cubic spline.

par(mfrow = c(1,2))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Cubic Spline", bty = 'l')
lines(sort.x, pred3$fit, lwd = 2, col = "blue")
matlines(sort.x, se.bands3, lwd = 2, col = "blue", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (3 df)", bty = 'l')
lines(sort.x, pred.ns3$fit, lwd = 2, col = "darkblue")
matlines(sort.x, se.bands.ns3, lwd = 2, col = "darkblue", lty = 3)

Smoothing Splines

For fitting smoothing splines we use the command smooth.splines() instead of lm(). Under smoothing splines there are no knots to specify; only parameter \(\lambda\). This can be specified via cross-validation by specifying cv = TRUE inside smooth.splines(). Alternatively, we can specify the effective degrees of freedom which correspond to some value of \(\lambda\). Below we use cross-validation as well as a smoothing spline with 3 effective degrees of freedom (via the argument df = 3). In this case we see that tuning \(\lambda\) through cross-validation results in a curve which is slightly wiggly on the right boundary of the predictor space.

smooth1 = smooth.spline(x, y, df = 3)
smooth2 = smooth.spline(x, y, cv = TRUE)
## Warning in smooth.spline(x, y, cv = TRUE): cross-validation with non-unique 'x'
## values seems doubtful
par(mfrow = c(1,3))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (3 df)", bty = 'l')
lines(sort.x, pred.ns3$fit, lwd = 2, col = "darkblue")

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Smoothing Spline (3 df)", bty = 'l')
lines(smooth1, lwd = 2, col = "brown")

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Smoothing Spline (CV)", bty = 'l')
lines(smooth2, lwd = 2, col = "darkorange")

GAMs

In this final part we will fit a generalised additive model (GAM) utilising more than one predictor from the Boston dataset. We first use the command names() in order to check once again the available predictor variables.

names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Lets say that we want to use predictors lstat, indus and chas for the analysis. For GAMs we will make use of the library gam in RStudio, so the first thing that we have to do is to install this package by executing install.packages(gam) once. Then we load the library.

library(gam)
## Loading required package: foreach
## Loaded gam 1.20

The main function is gam(). Inside this function we can use any combination of non-linear and linear modelling of the various predictors. For, example below we use a cubic spline with 3 degrees of freedom for lstat, a smoothing spline with 3 degrees of freedom for indus and a simple linear model for variable chas. We then plot the contributions of each predictor using the command plot().

gam = gam(medv ~ bs(lstat, degree = 3, df = 5) + s(indus, df = 5) + chas, data = Boston)
par(mfrow = c(1,3))
plot(gam,  se = TRUE, col = "blue")

Note that simly using chas inside gam() just fitted a linear model for this variable. However, one thing that we observe is that variable is binary as it only takes the values of 0 and 1. This we can see from the x-axis of the chas plot on the right above. So, it would be preferable to use a step function for this variable. In order to do this we have to change the variable chas to a factor. We first create a second object called Boston1 (in order not to change the initial dataset Boston) and then we use the command factor() to change variable chas. Then we fit again the same model. As we can see below now gam() fitted a step function for variable chas which is more appropriate.

Boston1 = Boston
Boston1$chas = factor(Boston1$chas)

gam1 = gam(medv ~ bs(lstat, degree = 3, df = 5) + s(indus, df = 5) + chas, data = Boston1)
par(mfrow = c(1,3))
plot(gam1,  se = TRUE, col = "blue")

Practice: Boston Housing Data - Non-retail Business Concentration

Task 1

Perform an analysis similar to the previous analysis using again medv as response variable and indus as a single predictor.

Task 2

Choose a set of predictor variables to use in order to fit a GAM model. Feel free to experiment with different modelling options.