Decision_trees.Rmd

---
title: "Decision trees for regression problems"
author: "James Gammerman"
date: "26 April 2017"
output: 
  html_document: 
    keep_md: yes
---

# Predicting house prices in Boston, USA
 
```{r include = FALSE}
# for the Carseats dataset
#install.packages("tree",repos = "http://cran.us.r-project.org")
library(tree)
#install.packages("ISLR",repos = "http://cran.us.r-project.org")
library(ISLR)

# for the Boston dataset
#install.packages("MASS",repos = "http://cran.us.r-project.org")
library(MASS)
data(Boston)
```

* The Boston dataset (part of the standard R distribution) describes housing values and other information about Boston suburbs - 506 neighbourhoods in total. In particular, it records median house value *(medv)* for each neighbourhood 

* Let's try to predict median house value using 13 attributes such as average number of rooms per house *(rm)*, average age of houses *(age)*, and percent of households with low socioeconomic status *(lstat)*.

***
# Fitting the model
* First we create the training set and fit the tree to the training data:
```{r}
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston <- tree(medv ~ ., Boston, subset = train)
summary(tree.boston)
```

* Notice that the output of summary() indicates that only three of the variables have been used in constructing the tree.
* In the context of a regression tree, the deviance is simply the sum of squared errors for the tree (see lec slide 46). We now plot the tree.

***
# The resulting decision tree 
```{r, fig.width = 4, fig.height = 4}
plot(tree.boston)
text(tree.boston, pretty = 0)
```

* The tree indicates that lower values of lstat correspond to more expensive houses. The tree predicts a median house price of $46,380 for larger homes in suburbs in which residents have high socioeconomic status (rm>=7.437 and lstat<9.715).

***
# Could cost complexity pruning improve our prediction?
* Now we use the cv.tree() function to see whether pruning the tree will improve performance.
```{r, fig.width = 4, fig.height = 4}
cv.boston <- cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type = 'b')
```

* In this case, the most complex tree is selected by cross-validation. 
```{r, echo = FALSE, eval = FALSE}
#However, if we wish to prune the tree, we could do so as follows, using the prune.tree() function:
prune.boston <- prune.tree(tree.boston, best=5)
plot(prune.boston)
text(prune.boston, pretty=0)
```

***
# Predictions
* Therefore we use the unpruned tree to make predictions on the test set:
```{r, fig.width = 4, fig.height = 4}
yhat <- predict(tree.boston, newdata = Boston [-train,])   # predictions for Boston medv
boston.test  = Boston[-train, "medv"]  # correct results for Boston medv
plot(yhat, boston.test)
abline(0,1)  # The function abline(a,b) draws a line with intercept a and slope b
mean((yhat - boston.test)^2)  
```
* This is the test set MSE, and we get a value of 25.05. The sqrt is 5.005, indicating that this model leads to test predictions that are within around $5,005 of the true median home value for the suburb!

***