forked from daviddalpiaz/r4sl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
20-resampling.Rmd
379 lines (274 loc) · 14.6 KB
/
20-resampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
# Resampling
```{r resampling_opts, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
```
- **NOTE**: This chapter is currently be re-written and will likely change considerably in the near future. It is currently lacking in a number of ways mostly narrative.
In this chapter we introduce **resampling** methods, in particular **cross-validation**. We will highlight the need for cross-validation by comparing it to our previous approach, which was to simply set aside some "test" data that we used for evaluating a model fit using "training" data. We will now refer to these held-out samples as the **validation** data and this approach as the **validation set approach**. Along the way we'll redefine the notion of a "test" dataset.
To illustrate the use of resampling techniques, we'll consider a regression setup with a single predictor $x$, and a regression function $f(x) = x^3$. Adding an additional noise parameter, we define the entire data generating process as
$$
Y \sim N(\mu = x^3, \sigma^2 = 0.25 ^ 2)
$$
We write an `R` function that generates datasets according to this process.
```{r}
gen_sim_data = function(sample_size) {
x = runif(n = sample_size, min = -1, max = 1)
y = rnorm(n = sample_size, mean = x ^ 3, sd = 0.25)
data.frame(x, y)
}
```
We first simulate a single dataset, which we also split into a *train* and *validation* set. Here, the validation set is 20% of the data.
```{r}
set.seed(42)
sim_data = gen_sim_data(sample_size = 200)
sim_idx = sample(1:nrow(sim_data), 160)
sim_trn = sim_data[sim_idx, ]
sim_val = sim_data[-sim_idx, ]
```
We plot this training data, as well as the true regression function.
```{r}
plot(y ~ x, data = sim_trn, col = "dodgerblue", pch = 20)
grid()
curve(x ^ 3, add = TRUE, col = "black", lwd = 2)
```
```{r}
calc_rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
```
Recall that we needed this validation set because the training error was far too optimistic for highly flexible models. This would lead us to always use the most flexible model.
```{r}
fit = lm(y ~ poly(x, 10), data = sim_trn)
calc_rmse(actual = sim_trn$y, predicted = predict(fit, sim_trn))
calc_rmse(actual = sim_val$y, predicted = predict(fit, sim_val))
```
## Validation-Set Approach
- TODO: consider fitting polynomial models of degree k = 1:10 to data from this data generating process
- TODO: here, we can consider k, the polynomial degree, as a tuning parameter
- TODO: perform simulation study to evaluate how well validation set approach works
```{r}
num_sims = 100
num_degrees = 10
val_rmse = matrix(0, ncol = num_degrees, nrow = num_sims)
```
- TODO: each simulation we will...
```{r}
set.seed(42)
for (i in 1:num_sims) {
# simulate data
sim_data = gen_sim_data(sample_size = 200)
# set aside validation set
sim_idx = sample(1:nrow(sim_data), 160)
sim_trn = sim_data[sim_idx, ]
sim_val = sim_data[-sim_idx, ]
# fit models and store RMSEs
for (j in 1:num_degrees) {
#fit model
fit = glm(y ~ poly(x, degree = j), data = sim_trn)
# calculate error
val_rmse[i, j] = calc_rmse(actual = sim_val$y, predicted = predict(fit, sim_val))
}
}
```
```{r echo = FALSE, fig.height = 5, fig.width = 10}
par(mfrow = c(1, 2))
matplot(t(val_rmse)[, 1:10], pch = 20, type = "b", ylim = c(0.17, 0.35), xlab = "Polynomial Degree", ylab = "RMSE", main = "RMSE vs Degree")
barcol = c("grey", "grey", "dodgerblue", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
barplot(table(factor(apply(val_rmse, 1, which.min), levels = 1:10)),
ylab = "Times Chosen", xlab = "Polynomial Degree", col = barcol, main = "Model Chosen vs Degree")
```
- TODO: issues are hard to "see" but have to do with variability
## Cross-Validation
Instead of using a single test-train split, we instead look to use $K$-fold cross-validation.
$$
\text{RMSE-CV}_{K} = \sum_{k = 1}^{K} \frac{n_k}{n} \text{RMSE}_k
$$
$$
\text{RMSE}_k = \sqrt{\frac{1}{n_k} \sum_{i \in C_k} \left( y_i - \hat{f}^{-k}(x_i) \right)^2 }
$$
- $n_k$ is the number of observations in fold $k$
- $C_k$ are the observations in fold $k$
- $\hat{f}^{-k}()$ is the trained model using the training data without fold $k$
If $n_k$ is the same in each fold, then
$$
\text{RMSE-CV}_{K} = \frac{1}{K}\sum_{k = 1}^{K} \text{RMSE}_k
$$
- TODO: create and add graphic that shows the splitting process
- TODO: Can be used with any metric, MSE, RMSE, class-err, class-acc
There are many ways to perform cross-validation in `R`, depending on the statistical learning method of interest. Some methods, for example `glm()` through `boot::cv.glm()` and `knn()` through `knn.cv()` have cross-validation capabilities built-in. We'll use `glm()` for illustration. First we need to convince ourselves that `glm()` can be used to perform the same tasks as `lm()`.
```{r}
glm_fit = glm(y ~ poly(x, 3), data = sim_trn)
coef(glm_fit)
lm_fit = lm(y ~ poly(x, 3), data = sim_trn)
coef(lm_fit)
```
By default, `cv.glm()` will report leave-one-out cross-validation (LOOCV).
```{r}
sqrt(boot::cv.glm(sim_trn, glm_fit)$delta)
```
We are actually given two values. The first is exactly the LOOCV-MSE. The second is a minor correction that we will not worry about. We take a square root to obtain LOOCV-RMSE.
In practice, we often prefer 5 or 10-fold cross-validation for a number of reason, but often most importantly, for computational efficiency.
```{r}
sqrt(boot::cv.glm(sim_trn, glm_fit, K = 5)$delta)
```
We repeat the above simulation study, this time performing 5-fold cross-validation. With a total sample size of $n = 200$ each validation set has 40 observations, as did the single validation set in the previous simulations.
```{r}
cv_rmse = matrix(0, ncol = num_degrees, nrow = num_sims)
```
```{r}
set.seed(42)
for (i in 1:num_sims) {
# simulate data, use all data for training
sim_trn = gen_sim_data(sample_size = 200)
# fit models and store RMSE
for (j in 1:num_degrees) {
#fit model
fit = glm(y ~ poly(x, degree = j), data = sim_trn)
# calculate error
cv_rmse[i, j] = sqrt(boot::cv.glm(sim_trn, fit, K = 5)$delta[1])
}
}
```
```{r, echo = FALSE, fig.height = 5, fig.width = 10}
par(mfrow = c(1, 2))
max_correct = max(max(table(apply(val_rmse, 1, which.min))), max(table(apply(cv_rmse, 1, which.min))))
barcol = c("grey", "grey", "dodgerblue", "grey", "grey", "grey", "grey", "grey", "grey", "grey")
barplot(table(factor(apply(val_rmse, 1, which.min), levels = 1:10)), ylim = c(0, max_correct),
ylab = "Times Chosen", xlab = "Polynomial Degree", col = barcol, main = "Single Validation Set")
barplot(table(factor(apply(cv_rmse, 1, which.min), levels = 1:10)), ylim = c(0, max_correct),
ylab = "Times Chosen", xlab = "Polynomial Degree", col = barcol, main = "5-Fold Cross-Validation")
```
```{r, echo = FALSE}
results = data.frame(
degree = 1:10,
colMeans(val_rmse),
apply(val_rmse, 2, sd),
colMeans(cv_rmse),
apply(cv_rmse, 2, sd)
)
colnames(results) = c(
"Polynomial Degree",
"Mean, Val",
"SD, Val",
"Mean, CV",
"SD, CV"
)
knitr::kable(results, digits = 3)
```
```{r, echo = FALSE, fig.height = 5, fig.width = 10}
par(mfrow = c(1, 2))
matplot(t(val_rmse)[, 1:10], pch = 20, type = "b", ylim = c(0.17, 0.35), xlab = "Polynomial Degree", ylab = "RMSE", main = "Single Validation Set")
matplot(t(cv_rmse)[, 1:10], pch = 20, type = "b", ylim = c(0.17, 0.35), xlab = "Polynomial Degree", ylab = "RMSE", main = "5-Fold Cross-Validation")
```
- TODO: differences: less variance, better selections
## Test Data
The following example, inspired by The Elements of Statistical Learning, will illustrate the need for a dedicated test set which is **never** used in model training. We do this, if for no other reason, because it gives us a quick sanity check that we have cross-validated correctly. To be specific we will always test-train split the data, then perform cross-validation **within the training data**.
Essentially, this example will also show how to **not** cross-validate properly. It will also show can example of cross-validated in a classification setting.
```{r}
calc_err = function(actual, predicted) {
mean(actual != predicted)
}
```
Consider a binary response $Y$ with equal probability to take values $0$ and $1$.
$$
Y \sim \text{bern}(p = 0.5)
$$
Also consider $p = 10,000$ independent predictor variables, $X_j$, each with a standard normal distribution.
$$
X_j \sim N(\mu = 0, \sigma^2 = 1)
$$
We simulate $n = 100$ observations from this data generating process. Notice that the way we've defined this process, none of the $X_j$ are related to $Y$.
```{r}
set.seed(42)
n = 200
p = 10000
x = replicate(p, rnorm(n))
y = c(rbinom(n = n, size = 1, prob = 0.5))
full_data = data.frame(y, x)
```
Before attempting to perform cross-validation, we test-train split the data, using half of the available data for each. (In practice, with this little data, it would be hard to justify a separate test dataset, but here we do so to illustrate another point.)
```{r}
trn_idx = sample(1:nrow(full_data), trunc(nrow(full_data) * 0.5))
trn_data = full_data[trn_idx, ]
tst_data = full_data[-trn_idx, ]
```
Now we would like to train a logistic regression model to predict $Y$ using the available predictor data. However, here we have $p > n$, which prevents us from fitting logistic regression. To overcome this issue, we will first attempt to find a subset of relevant predictors. To do so, we'll simply find the predictors that are most correlated with the response.
```{r}
# find correlation between y and each predictor variable
correlations = apply(trn_data[, -1], 2, cor, y = trn_data$y)
```
```{r, echo = FALSE}
hist(correlations, col = "grey", border = "dodgerblue")
```
While many of these correlations are small, many very close to zero, some are as large as 0.40. Since our training data has 50 observations, we'll select the 25 predictors with the largest (absolute) correlations.
```{r}
selected = order(abs(correlations), decreasing = TRUE)[1:25]
correlations[selected]
```
We subset the training and test sets to contain only the response as well as these 25 predictors.
```{r}
trn_screen = trn_data[c(1, selected)]
tst_screen = tst_data[c(1, selected)]
```
Then we finally fit an additive logistic regression using this subset of predictors. We perform 10-fold cross-validation to obtain an estimate of the classification error.
```{r}
add_log_mod = glm(y ~ ., data = trn_screen, family = "binomial")
boot::cv.glm(trn_screen, add_log_mod, K = 10)$delta[1]
```
The 10-fold cross-validation is suggesting a classification error estimate of almost 30%.
```{r}
add_log_pred = (predict(add_log_mod, newdata = tst_screen, type = "response") > 0.5) * 1
calc_err(predicted = add_log_pred, actual = tst_screen$y)
```
However, if we obtain an estimate of the error using the set, we see an error rate of 50%. No better than guessing! But since $Y$ has no relationship with the predictors, this is actually what we would expect. This incorrect method we'll call screen-then-validate.
Now, we will correctly screen-while-validating. Essentially, instead of simply cross-validating the logistic regression, we also need to cross validate the screening process. That is, we won't simply use the same variables for each fold, we get the "best" predictors for each fold.
For methods that do not have a built-in ability to perform cross-validation, or for methods that have limited cross-validation capability, we will need to write our own code for cross-validation. (Spoiler: This is not completely true, but let's pretend it is, so we can see how to perform cross-validation from scratch.)
This essentially amounts to randomly splitting the data, then looping over the splits. The `createFolds()` function from the `caret()` package will make this much easier.
```{r}
caret::createFolds(trn_data$y, k = 10)
```
```{r}
# use the caret package to obtain 10 "folds"
folds = caret::createFolds(trn_data$y, k = 10)
# for each fold
# - pre-screen variables on the 9 training folds
# - fit model to these variables
# - get error on validation fold
fold_err = rep(0, length(folds))
for (i in seq_along(folds)) {
# split for fold i
trn_fold = trn_data[-folds[[i]], ]
val_fold = trn_data[folds[[i]], ]
# screening for fold i
correlations = apply(trn_fold[, -1], 2, cor, y = trn_fold[,1])
selected = order(abs(correlations), decreasing = TRUE)[1:25]
trn_fold_screen = trn_fold[ , c(1, selected)]
val_fold_screen = val_fold[ , c(1, selected)]
# error for fold i
add_log_mod = glm(y ~ ., data = trn_fold_screen, family = "binomial")
add_log_prob = predict(add_log_mod, newdata = val_fold_screen, type = "response")
add_log_pred = ifelse(add_log_prob > 0.5, yes = 1, no = 0)
fold_err[i] = mean(add_log_pred != val_fold_screen$y)
}
# report all 10 validation fold errors
fold_err
# properly cross-validated error
# this roughly matches what we expect in the test set
mean(fold_err)
```
- TODO: note that, even cross-validated correctly, this isn't a brilliant variable selection procedure. (it completely ignores interactions and correlations among the predictors. however, if it works, it works.) next chapters...
## Bootstrap
ISL discusses the bootstrap, which is another resampling method. However, it is less relevant to the statistical learning tasks we will encounter. It could be used to replace cross-validation, but encounters significantly more computation.
It could be more useful if we were to attempt to calculate the bias and variance of a prediction (estimate) without access to the data generating process. Return to the bias-variance tradeoff chapter and think about how the bootstrap could be used to obtain estimates of bias and variance with a single dataset, instead of repeated simulated datasets.
## Which $K$?
- TODO: LOO vs 5 vs 10
- TODO: bias and variance
## Summary
- TODO: using cross validation for: tuning, error estimation
## External Links
- [YouTube: Cross-Validation, Part 1](https://www.youtube.com/watch?v=m5StqDv-YlM) - Video from user "mathematicalmonk" which introduces $K$-fold cross-validation in greater detail.
- [YouTube: Cross-Validation, Part 2](https://www.youtube.com/watch?v=OcJwdF8zBjM) - Continuation which discusses selection and resampling strategies.
- [YouTube: Cross-Validation, Part 3](https://www.youtube.com/watch?v=mvbBycl8BNM) - Continuation which discusses choice of $K$.
- [Blog: Fast Computation of Cross-Validation in Linear Models](http://robjhyndman.com/hyndsight/loocv-linear-models/) - Details for using leverage to speed-up LOOCV for linear models.
- [OTexts: Bootstrap](https://www.otexts.org/1467) - Some brief mathematical details of the bootstrap.
## `rmarkdown`
The `rmarkdown` file for this chapter can be found [**here**](20-resampling.Rmd). The file was created using `R` version `r paste0(version$major, "." ,version$minor)`.