-
Notifications
You must be signed in to change notification settings - Fork 5
/
01-knn-trees.Rmd
459 lines (339 loc) · 15.6 KB
/
01-knn-trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
---
title: '01: kNN & Trees'
date: "`r Sys.time()`"
output:
html_notebook:
toc: yes
theme: flatly
number_sections: yes
editor_options:
chunk_output_type: console
---
```{r setup}
library(ggplot2) # For plotting
library(palmerpenguins) # For penguins
library(kknn) # For kNN models
library(rpart) # For decision trees
```
Goals of this part:
1. Taking a look at our example dataset
2. Introduce kNN via `{kknn}` and decision trees via `{rpart}`
3. Train some models, look at some results
4. Introduce `{mlr3}` and do 3. again, but nicer
# The dataset: Pengiuns!
See [their website](https://allisonhorst.github.io/palmerpenguins/) for some more information if you're interested.
For now it's enough to know that we have a bunch of data about 3 species of penguins.
![](img/lter_penguins.png)
```{r penguins}
# remove missing values for simplicity in this example
# (handling missing data is a can of worms for another time :)
penguins <- na.omit(penguins)
str(penguins)
```
![](img/culmen_depth.png)
We can take a look at the different species across two numeric features, starting with flipper length and body mass (for reasons that may become clear later):
```{r penguins-plot}
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Palmer Penguins",
x = "Flipper Length [mm]", y = "Body Mass [g]",
color = "Species"
) +
theme_minimal() +
theme(legend.position = "top")
```
We split our penguin data in a roughly 2/3 to 1/3 training- and test dataset for our first experiments:
```{r penguin-split-manual}
penguin_N <- nrow(penguins) # Our total sample size
# We draw 2/3 of all indices randomly with a sampling proportion of 2/3 with a fixed seed
set.seed(234)
train_ids <- sample(penguin_N, replace = FALSE, size = penguin_N * 2/3)
# Our test set are the indices not in the training set
test_ids <- setdiff(1:penguin_N, train_ids)
# Assemble our train/test set using the indices we just randomized
penguins_train <- penguins[train_ids, ]
penguins_test <- penguins[test_ids, ]
```
# kNN and trees, step by step
Now that we have some data, we can start fitting models, just to see how it goes!
Given the plot from earlier, we may have a rough idea that the flipper length and body mass measurements are already giving us a somewhat decent picture about the species.
## kNN
Let's start with the nearest-neighbor approach via the `{kknn}` package.
It takes a `formula` argument like you may know from `lm()` and other common modeling functions in R, where the format is `predict_this ~ on_this + and_that + ...`.
```{r kknn-fit}
knn_penguins <- kknn(
formula = species ~ flipper_length_mm + body_mass_g,
k = 3, # Hyperparameter: How many neighbors to consider
train = penguins_train, # Training data used to make predictions
test = penguins_test # Data to make predictions on
)
# Peek at the predictions, one row per observation in test data
head(knn_penguins$prob)
```
To get an idea of how well our predictions fit, we add them to our original test data and compare observed (true) and predicted species:
```{r knn-prediction-check}
# Add predictions to the test dataset
penguins_test$knn_predicted_species <- fitted(knn_penguins)
# Rows: True species, columns: predicted species
table(penguins_test$species, penguins_test$knn_predicted_species,
dnn = c("Observed", "Predicted"))
# Proportion of correct predictions ("accuracy")
# (R shortcut: Logical comparison gives logical vector of TRUE/FALSE, which
# is used like 1/0 for math, so we can sum it up (== correct classifications)
# and divide by N for the proportion, i.e. calculate the mean)
mean(penguins_test$species == penguins_test$knn_predicted_species)
# Proportion of **incorrect** predictions (classification error)
mean(penguins_test$species != penguins_test$knn_predicted_species)
```
## Your turn!
Above you have working code for an acceptable but *not great* kNN model.
Can you make it even better?
Can you change something to make it *worse*?
Some things to try:
1. Try different predictors, maybe leave some out
- Which seem to work best?
2. Try using all available predictors (`formula = species ~ .`)
- Would you recommend doing that? Does it work well?
3. Try different `k` values.
Is higher == better? (You can stick to odd numbers)
- After you've tried a couple `k`'s, does it get cumbersome yet?
## Growing a decision tree
Now that we've played around with kNN a little, let's grow some trees!
We'll use the `{rpart}` (**R**ecursive **Part**itioning) package and start with the same model specification as before and use the default parameters.
```{r tree-fit}
rpart_penguins <- rpart(
formula = species ~ flipper_length_mm + body_mass_g,
data = penguins_train, # Train data
method = "class", # Grow a classification tree (don't change this)
cp = 0.01, # Complexity parameter (default = 0.01)
minsplit = 20, # Default = 20
maxdepth = 30 # Default (and upper limit for rpart!) = 30
)
```
The nice thing about a single tree is that you can just look at it and know exactly what it did:
```{r tree-model}
rpart_penguins
# Looking at the tree as a... tree.
plot(rpart_penguins)
text(rpart_penguins)
# Much nicer to use the rpart.plot package
library(rpart.plot)
rpart.plot(rpart_penguins)
```
If we want to know how accurate we are with our model we need to make predictions on our test data manually:
```{r tree-predict}
rpart_predictions <- predict(
rpart_penguins, # The model we just fit
newdata = penguins_test, # New data to predict species on
type = "class" # We want class predictions (the species), not probabilities
)
penguins_test$rpart_predicted_species <- rpart_predictions
# Same procedure as with kNN before
table(penguins_test$species, penguins_test$rpart_predicted_species)
# And our accuracy score
mean(penguins_test$species == penguins_test$rpart_predicted_species)
```
## Your turn!
We haven't picked any hyperparameter settings for our tree yet, maybe we should try?
1. What hyperparameters does `rpart()` offer?
Do you recognize some from the lecture?
- You can check via `?rpart.control`
- When in doubt check `minsplit`, `maxdepth` and `cp`
2. Try out trees with different parameters
- Would you prefer simple or complex trees?
- How far can you improve the tree's accuracy?
So, what seems to work better here?
kNN or trees?
## Plotting decision boundaries (for 2 predictors)
This is a rather cumbersome manual approach --- there's a nicer way we'll see later, but we'll do it the manual way at least once so you know how it works:
```{r decision-boundary-plot}
# Decision tree to plot the boundaries of
rpart_penguins <- rpart(
formula = species ~ flipper_length_mm + body_mass_g,
data = penguins_train, # Train data
method = "class", # Grow a classification tree (don't change this)
cp = 0.01, # Complexity parameter (default = 0.01)
minsplit = 20, # Default = 20
maxdepth = 30 # Default (and upper limit for rpart!) = 30
)
# Ranges of X and Y variable on plot
flipper_range <- range(penguins$flipper_length_mm)
mass_range <- range(penguins$body_mass_g)
# A grid of values within these boundaries, 100 points per axis
pred_grid <- expand.grid(
flipper_length_mm = seq(flipper_range[1], flipper_range[2], length.out = 100),
body_mass_g = seq(mass_range[1], mass_range[2], length.out = 100)
)
# Predict with tree for every single point
pred_grid$rpart_prediction <- predict(rpart_penguins, newdata = pred_grid, type = "class")
# Plot all predictions, colored by species
ggplot(pred_grid, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_tile(aes(color = rpart_prediction, fill = rpart_prediction), linewidth = 1, show.legend = FALSE) +
geom_point(data = penguins_test, aes(fill = species), shape = 21, color = "black", size = 2) +
labs(
title = "Palmer Penguins: Decision Boundaries",
subtitle = "Species as predicted by decision tree",
x = "Flipper Length [mm]", y = "Body Mass [g]",
fill = "Species"
) +
theme_minimal() +
theme(
legend.position = "top",
panel.grid = element_blank()
)
```
# Introducing `{mlr3}`
```{r setup-mlr3}
library(mlr3verse) # includes mlr3, mlr3learners, mlr3tuning, mlr3viz, ...
```
Now, imagine you want to try out some more hyperparameters for either `rpart()`
or `kknn()` or both, and then you want to compare the two --- that would probably
be kind of tedious unless you write some wrapper functions, right?
Well, luckily we're not the first people to do some machine learning!
Our code above works (hopefully), but for any given model or algorithm, there's
different R packages with slightly different interfaces, and memorizing or looking up how they work can be a tedious and error-prone task, especially when we want to repeat the same general steps with each learner.
`{mlr3}` and add-on packages unify all the common tasks with a consistent interface.
## Creating a task
The task encapsulates our data, including which variables we're using to learn
and which we want to predict.
Tasks can be created in [multiple ways and some standard example tasks are available in mlr3](https://mlr3book.mlr-org.com/chapters/chapter2/data_and_basic_modeling.html), but we're taking the long way around.
Questions a task object answers:
- What kind of prediction are we doing?
- Here: Classification (instead of e.g. regression)
- What are we predicting on?
- The dataset, specified as `backend =` (could be a `data.frame`, `matrix`, or a proper data base)
- What variable are we predicting?
- The `target` variable, here `species`
- Which variables are we using for predictions?
- The `feature` variables, which we can adjust
So let's put our penguins into a task object for `{mlr3}` and inspect it:
```{r penguins-task-creation}
# Creating a classification task from our penguin data
penguin_task <- TaskClassif$new(
id = "penguins",
backend = penguins,
target = "species"
)
# Contains our penguin dataset
penguin_task$data()
# We can ask it about e.g. our sample size and number of features
penguin_task$nrow
penguin_task$ncol
# And what the classes are
penguin_task$class_names
# What types of features do we have?
# (relevant for learner support, some don't handle factors for example!)
penguin_task$feature_types
```
(Note: Instead of `TaskClassif$new()`, you could also use `as_task_classif()` as a wrapper function if you prefer --- R6 classes may be unfamiliar territory)
We can further inspect and modify the task after the fact if we choose:
```{r penguins-task-modification}
# Display feature and target variable assignment
penguin_task$col_roles[c("feature", "target")]
# Maybe not all variables are useful for this task, let's remove some
penguin_task$set_col_roles(cols = c("island", "sex", "year"), remove_from = "feature")
# We can also explicitly assign the feature columns
penguin_task$col_roles$feature <- c("body_mass_g", "flipper_length_mm")
# Check what our current variables and roles are
penguin_task$col_roles[c("feature", "target")]
```
Some variables may have missing values --- if we had not excluded them in the beginning, you would find them here:
```{r penguins-task-missings}
penguin_task$missings()
```
### Train and test split
We're mixing things up with a new train/test split, just for completeness in the example.
For `{mlr3}`, we only need to save the indices / row IDs.
There's a handy `partition()` function that does the same think we did with `sample()` earlier, so let's use that!
```{r penguin-split}
set.seed(26)
penguin_split <- partition(penguin_task, ratio = 2/3)
# Contains vector of row_ids for train/test set
str(penguin_split)
```
We can now use `penguin_split$train` and `penguin_split$test` with every `mlr3` function that has a `row_ids` argument.
(Note: The `row_ids` of a task are not necessarily `1:N` --- there is no guarantee they start at 1, go up to N, or contain all integers in between. We can generally only expect them to be unique within a task!)
## Picking a Learner
A learner encapsulates the fitting algorithm as well as any relevant hyperparameters,
and `{mlr3}` supports a whole lot of learners to choose from.
We'll keep using `{kknn}` and `{rpart}` in the background for the classification task,
but we'll use `{mlr3}` on top of them for a consistent interface.
So first, we have to find the learners we're looking for.
There's a lot more [about learners in the mlr3 book](https://mlr3book.mlr-org.com/chapters/chapter2/data_and_basic_modeling.html#sec-learners),
but for now we're happy with the basics.
```{r learners, eval=FALSE}
# Lots of learners to choose from here:
mlr_learners
# Or as a large table:
as.data.table(mlr_learners)
# To show all classification learners (in *currently loaded* packages!)
mlr_learners$keys(pattern = "classif")
# There's also regression learner we don't need right now:
mlr_learners$keys(pattern = "regr")
# For the kknn OR rpart learners, we can use regex
mlr_learners$keys(pattern = "knn|rpart")
mlr_learners$keys(pattern = "classif\\.(kknn|rpart)")
```
Now that we've identified our learners, we can get it quickly via the `lrn()` helper function:
```{r knn-learner}
knn_learner <- lrn("classif.kknn")
# What's this learner about?
knn_learner$help()
# What parameters does this learner have?
knn_learner$param_set$ids()
# Setting parameters leaves the others as default
knn_learner$param_set$values <- list(k = 5)
# Identical method to set a single hyperparam:
knn_learner$param_set$values$k <- 5
# We can also set the parameters directly when we construct the learner object
knn_learner <- lrn("classif.kknn", k = 7)
knn_learner$param_set$values$k
```
We'll save the `{rpart}` learner for later, but all the methods are the same
to get help, look at possible hyperparameters and setting them!
## Training and evaluating
We can train the learner with default parameters once to see if it works as we expect it to.
```{r knn-train-once}
# Train learner on training data
knn_learner$train(penguin_task, row_ids = penguin_split$train)
# Look at stored model, which for knn is not very interesting
knn_learner$model
```
And we can make predictions on the test data:
```{r knn-predict-once}
knn_prediction <- knn_learner$predict(penguin_task, row_ids = penguin_split$test)
knn_prediction
```
Our predictions are looking quite reasonable, mostly:
```{r knn-confusion-matrix}
# Confusion matrix we got via table() previously
knn_prediction$confusion
```
To calculate the prediction accuracy, we don't have to do any math in small steps.
`{mlr3}` comes with lots of measures (like accuracy) we can use, they're organized
in the `mlr_measures` object (just like `mlr_learners`).
We're using `"classif.acc"` here with the shorthand function `msr()`, and
score our predictions with this measure, using the `$score()` method.
```{r knn-accuracy-once}
# Available measures for classification tasks
mlr_measures$keys(pattern = "classif")
# Scores according to the selected measure
knn_prediction$score(msr("classif.acc"))
# The inverse: Classification error (1 - accuracy)
knn_prediction$score(msr("classif.ce"))
```
As a bonus feature, `{mlr3}` also makes it easy for us to plot the decision boundaries for a two-predictor case, so we don't have to manually predict on a grid anymore.
(You can ignore any warning messages from the plot here)
```{r mlr3-plot-decision-boundaries}
plot_learner_prediction(
learner = knn_learner,
task = penguin_task
)
```
## Your turn!
Now that we've done the kNN fitting with `{mlr3}`, you can easily do the same thing with the `rpart`-learner!
All you have to do is switch the learner objects and give it a more fitting name.
# Useful links
- [mlr3 cheatsheet with basic syntax](https://cheatsheets.mlr-org.com/mlr3.pdf)
- [mlr3 book](https://mlr3book.mlr-org.com/)