forked from bips-hb/datatrain_workshop_ml
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-resampling-rf-boosting.Rmd
291 lines (215 loc) · 10.7 KB
/
02-resampling-rf-boosting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: "02: Resampling, Random Forest & Boosting"
date: "`r Sys.time()`"
output:
html_notebook:
toc: yes
theme: flatly
number_sections: yes
editor_options:
chunk_output_type: console
---
```{r setup}
library(mlr3verse) # Loads all the mlr3 stuff
library(ggplot2) # For plotting
# Just telling mlr3 to be quiet unless something broke
lgr::get_logger("mlr3")$set_threshold("error")
lgr::get_logger("bbotk")$set_threshold("error")
```
Goals of this part:
1. A Quick look at other tasks available and the "spam" task specifically
2. Resampling for model evaluation
3. Comparing learners with resampling
# Task Setup
As we've seen last time, our penguin data is pretty easy for our learners.
We need something a little more complex, meaning more observations (n)
and a couple more predictors (p).
If you're looking for ready-made example tasks for your own experimentation,
`mlr3` comes with a couple you can try. The procedure is similar to `mlr_learners`
and `mlr_measures`, but this time it's, you guessed it,
```{r task-dictionary}
mlr_tasks
# As a neat little table with all the relevant info
as.data.table(mlr_tasks)
```
I suggest we try a two-class (i.e. binary) classification problem next, and
maybe something we can probably all somewhat relate to: **Spam detection**.
```{r task-spam}
spam_task <- tsk("spam")
# Outcome categories
spam_task$class_names
# Target classes are not horribly unbalanced (40/60%)
autoplot(spam_task)
```
## Your turn!
Explore the task a little (you can use its built-in methods `spam_task$...`)
to check the data, get column types, the dataset dimensions etc.
Think about this as a new analysis problem, so what do we need to know here?
(If you'd like a simpler overview, you can read the help with `spam_task$help()`)
# Resampling
To continue with our kNN and tree experiments, we'll now enter resampling territory.
You probably realized that predicting on just one test dataset doesn't give us
too much of a useful idea of our model, which is why we use resampling.
There are [lots of resampling strategies](https://mlr3.mlr-org.com/reference/index.html#resampling-strategies),
but you usually can't go too wrong with cross validation (CV), which should not
take too much time and computing power on small datasets like ours.
Instead of training our model just once, we're going to train it 5 times with
different training- and test-datasets. In `{mlr3}`, we first pick a resampling
strategy with `rsmp()`, and then define our setup with `resample()` based on
our task and resampling strategy. The result an object of class `ResampleResult`:
```{r knn-resample-cv}
rr <- resample(
task = spam_task,
# Optional: Adjust learner parameters
learner = lrn("classif.kknn", k = 13, predict_sets = c("train", "test")),
resampling = rsmp("cv", folds = 5) # 5-fold CV
)
rr
# Contains task/learner/resampling/prediction objects
as.data.table(rr)
```
We also tell our learner to predict both on train- and test sets.
Now we have a look at our predictions, but *per resampling iteration*, for both
train- and test-set accuracy:
```{r knn-resample-acc}
traintest_acc <- list(
msr("classif.acc", predict_sets = "train", id = "acc_train"),
msr("classif.acc", id = "acc_test")
)
rr$score(traintest_acc)[, .(iteration, acc_train, acc_test)]
```
And on average over the resampling iterations:
```{r knn-resample-acc-avg}
rr$aggregate(traintest_acc)
```
As expected, our learner performed better on the training data than on the test data.
By default, `{mlr3}` only gives us the results for the test data, which we'll be
focusing on going forward.
Being able to compare train-/test-performance is useful though to make sure you're
not hopelessly overfitting on your training data!
Also note how we got to train our learner, do cross validation and get scores
all without having to do any extra work.
That's why we use `{mlr3}` instead of doing everything manually --- abstraction is nice.
## One more thing: Measures
So far we've always used the accuracy, i.e. the proportion of correct classifications
as our measure. For a problem such as spam detection that might not be the best choice,
because it might be better to consider the **probability** that an e-mail is spam
and maybe adjust the **threshold** at which we start rejecting mail.
For a class prediction we might say that if `prob(is_spam) > 0.5` the message is classified
as spam, but maybe we'd rather be more conservative and only consider a message to
be spam at a probability over, let's say, 70%. This will change the relative amounts of true and false positives and negatives:
```{r knn-binary-thresholds}
knn_learner <- lrn("classif.kknn", predict_type = "prob", k = 13)
set.seed(123)
spam_split <- partition(spam_task)
knn_learner$train(spam_task, spam_split$train)
knn_pred <- knn_learner$predict(spam_task, spam_split$test)
# Measures: True Positive Rate (Sensitivity) and True Negative Rate (1 - Specificity)
measures <- msrs(c("classif.tpr", "classif.tnr"))
knn_pred$confusion
knn_pred$score(measures)
# Threshold of 70% probability of spam:
knn_pred$set_threshold(0.7)
knn_pred$confusion
knn_pred$score(measures)
# If we were happy with only 20% probability of spam to mark a message as spam:
knn_pred$set_threshold(0.2)
knn_pred$confusion
knn_pred$score(measures)
```
For a better analysis than just manually trying out different thresholds, we can use [ROC curves](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).
To do so, we first have to adjust our Learner to predict the spam *probability*
instead of already simplifying the prediction to `"spam"` or `"nonspam"`.
We use the `autoplot()` function based on the prediction, specifying `type = "roc"` to give us an ROC curve:
```{r knn-roc}
autoplot(knn_pred, type = "roc")
```
This gives us the false positive rate (FPR, "1 - Specificity") and the
Sensitivity (or true positive rate, TPR) for our binary classification example.
"Positive" here means "the e-mail is spam". If our classification model was
basically a random coin flip, we would expect the curve to be the diagonal (depicted
in grey in the plot). Everything in the upper-left is at least better than random.
To condense this to a single number, we use the AUC, the *area under the (ROC) curve*.
If this AUC is 0.5, our model is basically a coin toss --- and if it's 1, that
means our model is *perfect*, which is usually too good to be true and means
we overfit in some way or the data is weird.
To get the AUC we use `msr("classif.auc")` instead of `"classif.acc"` going forward.
## Your Turn!
1. Repeat the same resampling steps for the `{rpart}` decision tree learner.
(Resample with 5-fold CV, evaluate based on test accuracy)
- Does it fare better than kNN with default parameters?
2. Repeat either learner resampling with different hyperparameters
3. `{mlr3viz}` provides alternatives to the ROC curve, which are described in the help page `?autoplot.PredictionClassif`.
Experiment with precision recall and threshold curves. Which would you consider most useful?
This is technically peck-and-find hyperparameter tuning which we'll do in a
more convenient (and methodologically sound) way in the next part :)
## Benchmarking
The next thing we'll try is to resample across multiple learners at once --- because doing resampling for each learner separately and comparing results is just too tedious.
Let's set up our learners with default parameters and compare them against
a dummy *featureless* learner, which serves as a naive baseline. This learner always predicts the average of the target or the majority class in this case, so it's the worst possible learner that should be beatable by any reasonable algorithm!
```{r bm-setup}
learners <- list(
lrn("classif.kknn", id = "knn", predict_type = "prob"),
lrn("classif.rpart", id = "tree", predict_type = "prob"),
lrn("classif.featureless", id = "Baseline", predict_type = "prob")
)
# Define task, learners and resampling strategy in a benchmark design
design <- benchmark_grid(
tasks = spam_task, # Still the same task
learners = learners, # The new list of learners
resamplings = rsmp("cv", folds = 3) # Same resampling strategy as before
)
# Run the benchmark and save the results ("BenchmarkResult" object)
bmr <- benchmark(design)
bmr
# It's just a collection of ResampleResult objects from before!
as.data.table(bmr)
```
`bmr` contains everything we'd like to know about out comparison. We extract the scores
and take a look:
```{r bm-results-full}
bm_scores <- bmr$score(msr("classif.auc"))
bm_scores
# Extract per-iteration accuracy per learner (only first five rows shows)
bm_scores[1:5, .(learner_id, iteration, classif.auc)]
# To get all results for the tree learner
bm_scores[learner_id == "tree", .(learner_id, iteration, classif.auc)]
# Or the results of the first iteration
bm_scores[iteration == 1, .(learner_id, iteration, classif.auc)]
```
And if we want to see what worked best overall:
```{r bm-results-aggregate}
bmr$aggregate(msr("classif.auc"))[, .(learner_id, classif.auc)]
autoplot(bmr, measure = msr("classif.auc"))
```
```{r bm-roc}
autoplot(bmr, type = "roc")
```
We see what we'd expect regarding the featureless learner --- it's effectively a coin toss.
Also, kNN does quite a bit better than the decision tree with the default
parameters here.
Of course including the featureless learner here doesn't really add any insights,
especially since we evaluate by AUC, where the featureless learner gets a score of 0.5 by definition.
## Your Turn!
Since we have a binary classification problem, we might even get away with using
plain old logistic regression.
Instead of benchmarking against the featureless learner, compare kNN and decision trees
to the logistic regression learner `"classif.log_reg"` without any hyperparameters.
Do our fancy ML methods beat the good old GLM?
Use the best hyperparameter settings for kNN and `rpart` you have found so far
# Random Forests & Boosting
Armed with our new model comparison skills, we can add Random Forests and
Gradient Boosting to the mix!
Our new learner IDs are
- `"classif.ranger"` for Random Forest, see `?ranger::ranger`
- `"classif.xgboost"` for (eXtreme) Gradient Boosting, see `?xgboost::xgboost`
(You know it has to be fancy if it has "extreme" in the name!)
Both learners can already do fairly well without tweaking hyperparameters, except
for the `nrounds` value in `xgboost` which sets the number of boosting iterations.
The default in the `{mlr3}` learner is 1, which kind of defeats the purpose of boosting.
## Your Turn!
Use the benchmark setup from above and switch the `kknn` and `rpart` learners
with the Random Forest and Boosting learners.
Maybe switch to holdout resampling to speed the process up a little and make
sure to set `nrounds` to something greater than 1 for `xgboost`.
What about now? Can we beat logistic regression?