-
Notifications
You must be signed in to change notification settings - Fork 0
/
Decision_trees.Rmd
82 lines (66 loc) · 3.02 KB
/
Decision_trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
title: "Decision trees for regression problems"
author: "James Gammerman"
date: "26 April 2017"
output:
html_document:
keep_md: yes
---
# Predicting house prices in Boston, USA
```{r include = FALSE}
# for the Carseats dataset
#install.packages("tree",repos = "http://cran.us.r-project.org")
library(tree)
#install.packages("ISLR",repos = "http://cran.us.r-project.org")
library(ISLR)
# for the Boston dataset
#install.packages("MASS",repos = "http://cran.us.r-project.org")
library(MASS)
data(Boston)
```
* The Boston dataset (part of the standard R distribution) describes housing values and other information about Boston suburbs - 506 neighbourhoods in total. In particular, it records median house value *(medv)* for each neighbourhood
* Let's try to predict median house value using 13 attributes such as average number of rooms per house *(rm)*, average age of houses *(age)*, and percent of households with low socioeconomic status *(lstat)*.
***
# Fitting the model
* First we create the training set and fit the tree to the training data:
```{r}
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston <- tree(medv ~ ., Boston, subset = train)
summary(tree.boston)
```
* Notice that the output of summary() indicates that only three of the variables have been used in constructing the tree.
* In the context of a regression tree, the deviance is simply the sum of squared errors for the tree (see lec slide 46). We now plot the tree.
***
# The resulting decision tree
```{r, fig.width = 4, fig.height = 4}
plot(tree.boston)
text(tree.boston, pretty = 0)
```
* The tree indicates that lower values of lstat correspond to more expensive houses. The tree predicts a median house price of $46,380 for larger homes in suburbs in which residents have high socioeconomic status (rm>=7.437 and lstat<9.715).
***
# Could cost complexity pruning improve our prediction?
* Now we use the cv.tree() function to see whether pruning the tree will improve performance.
```{r, fig.width = 4, fig.height = 4}
cv.boston <- cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type = 'b')
```
* In this case, the most complex tree is selected by cross-validation.
```{r, echo = FALSE, eval = FALSE}
#However, if we wish to prune the tree, we could do so as follows, using the prune.tree() function:
prune.boston <- prune.tree(tree.boston, best=5)
plot(prune.boston)
text(prune.boston, pretty=0)
```
***
# Predictions
* Therefore we use the unpruned tree to make predictions on the test set:
```{r, fig.width = 4, fig.height = 4}
yhat <- predict(tree.boston, newdata = Boston [-train,]) # predictions for Boston medv
boston.test = Boston[-train, "medv"] # correct results for Boston medv
plot(yhat, boston.test)
abline(0,1) # The function abline(a,b) draws a line with intercept a and slope b
mean((yhat - boston.test)^2)
```
* This is the test set MSE, and we get a value of 25.05. The sqrt is 5.005, indicating that this model leads to test predictions that are within around $5,005 of the true median home value for the suburb!
***