-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.Rmd
555 lines (406 loc) · 15.9 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
---
pagetitle: "Reproduciblity, DfE Data Science Week 2020"
output:
xaringan::moon_reader:
css: ["default", "styles/custom.css", "styles/custom-fonts.css"]
seal: false
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: center, middle, inverse
# `r emo::ji("arrows_counterclockwise")` Reproducibility in R: three things
DfE Data Science Week, 2020-01-22
`r icon::fa('twitter')` [mattdray](https://twitter.com/mattdray)
`r icon::fa('github')` [matt-dray](https://github.com/matt-dray)
`r icon::fa('globe')` [rostrum.blog](https://www.rostrum.blog/)
???
* What reproducibility is, why you need it, some practical things you can do
* Focus is on R, but transferable
* You might be doing a bunch of this stuff already
* But hopefully you'll hear about something new
---
class: middle
tl;dr `r emo::ji("sleeping")`
* make your work reproducible
* unless you hate everyone
* especially yourself
???
* Reproducibility benefits anyone who wants to use your code or recreate work that's been done
* If you don't care about anyone else: it will still benefit you, specifically
* Let's say you're returning to some work after a period away, or a new publication is due
---
class: inverse, middle, center
# 'Reproducible'
???
* We should probably define 'reproducible'
* Turns out there's more than one definition
---
class: middle, center
![](img/turing-way-reproducibility.jpg)
From [The Turing Way](https://the-turing-way.netlify.com/introduction/introduction) by The Alan Turing Institute
???
* I think we care about reproducing outputs we've aready created (to prove that we can reliably recreate the outcome) and updating with fresh data (like the next quarter's data)
* But I'm going to refer to 'reproducible' throughout
---
class: middle
Can I recreate what you did:
* from scratch?
--
* on a different machine?
--
* in the future?
--
* without you present?
???
* A simpler way of thinking about it might be to answer these questions
* We should know all the preparatory steps
* Maybe I don't have the same packages, maybe I'm using another OS
* Dependency changes break things
* What if you leave the department?
---
class: center, middle
<img src='img/trex-my-machine.gif' width=100%>
You want to avoid saying this
???
* You should be confident that it can run _where_ it needs to run, _when_ it needs to run
---
class: middle
Reproducibility can help:
* reduce errors
--
* improve trust
--
* help you share
--
* speed up work
???
* You know it can be re-run to give the same results
* Others can see the steps taken and can recreate it themselves
* Everything you need in one folder/repo; all instructions in the box
* You're not building over old work and obfuscating and complicating things
---
class: middle, center
<img src='img/rap_v4_hex.png' width='50%'>
[Reproducible Analytical Pipelines](https://ukgovdatascience.github.io/rap-website/) (RAP)
???
* There's a grassroots movement to enact all this stuff in government already
* Visit the website (link in slide)
* Join #rap-collaboration on govdatascience.slack.com
* DfE are already active: speak to Cameron R or Laura S
---
class: middle, inverse
# Three things
???
* I promised three things
* I hope everyone learns something new, even if small
* The things are roughly in order of how you might progress on your reproducibility journey
---
class: middle
1. Centralise everything
1. Report with code
1. Manage workflows
???
* These are very broad -- what do they mean?
* Let's go through one by one
* I'll be focusing on R, but I think the points are transferable and Python analogues are available
---
class: inverse, middle
# 0\. Code everything
Haha, suckers, I zero-indexed my list!
???
* Before you do anything, switch to code from point-and-click
* I'll be picking on Excel a little bit
* Excel isn't _always_ bad
* It's just easier to be reproducible with code
---
class: middle, center
<img src='img/slack-q.png' width=100%>
<img src='img/slack-a.png' width=100%>
???
* We want to move away from workflows where we have lots of workbooks with multiple sheets
* Move away from pointing and clicking
* Move away from forgetting to drag formulae across the right cells
* Move toward coded analysis to provide a recipe for the workflow
* Move toward better documentation
---
class: middle, center
<img src='img/blueball.gif' width=100%>
Scene from an Excel-based workflow
???
* Question everything; baulk at phrases like 'this is how we did it last year'
* Excel workflows might start simple, but can get clogged up very easily
* They don't encourage documentation
* They're 'data up front, code in the background', but we want the opposite
---
class: middle
Some resources:
* [Spreadsheet horror stories](http://www.eusprig.org/horror-stories.htm) by EuSpRiG
* [Excel vs R: a brief intro](https://www.jessesadler.com/post/excel-vs-r/) by Jesse Sadler
* [Arguments on switching from Excel to R](https://community.rstudio.com/t/pre-teaching-r-whats-your-best-argument-for-switching-to-r-from-excel/3182) from the RStudio Community
???
* EuSpRiG: European Spreadsheet Risks Interest Group
---
class: inverse, middle
# 1. Centralise everything
???
* 'Centralise' your code, data, documentation, etc
* As in put the code and any other needed files in one place
* Consider generalising and sharing functions
---
class: middle
`r emo::ji("card_index_dividers")` Use [R Projects](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to:
* put code, data and docs in one place
* so everything you need is together
* to make your analysis shareable
???
* Code your workflow with comments and documentation
* Use a project-oriented workflow
* You can easily pass the Project folder as a single unit
* Filepaths are relative to the the project, not your machine
---
class: middle, center
<img src='img/rstudio-project.png' width=80%>
These slides are in an R Project
???
* All the files are need are in one place
* All file paths are relative
* I can recreate this from scratch at any time
* Others can download it and recreate it or build on it
* From RStudio: File > New project
* Doesn't have to be RStudio; just need to have everything you need in one place
---
class: middle
`r emo::ji("package")` Write packages to:
* avoid repeating yourself
* generalise and centralise your functions
* share your work
Get started with:
* the [{usethis} workflow](https://www.hvitfeldt.me/blog/usethis-workflow-for-package-development/) by Emil Hvitfeldt
* the [R Packages book](http://r-pkgs.had.co.nz/) by Hadley Wickham
???
* Write, test and debug your frequently-used functions just once
* The whole organisation and beyond can use the same functions
* Put this under version control so you can always refer to older versions of your package
---
class: middle
Two quick RStudio tips:
* [Don't save and restore the Project workspace](https://r4ds.had.co.nz/workflow-projects.html)
<img src='img/rstudio-workspace.png' width=70%>
* Do restart all the time (Cmd/Ctrl+Shift+F10)
???
* Always be running your Project as though you're running it from scratch
* Each time you startup, start afresh
* Always be restarting RStudio to clear all objects
* That'll prevent you relying on objects that exist in your workspace that you may have made on the fly
---
class: inverse, middle
# 2. Report with code
???
* So you've written analytical code to read and manipulate data
* But you're still copy-pasting into documents
* This could go wrong: What if you meant to copy value x into three different places, but you missed one?
* What if the data changes? Do you need to re-run everything from scratch and begin copy-pasting again?
---
class: middle
`r emo::ji("arrow_down")` Use [R Markdown](https://rmarkdown.rstudio.com/) to:
* put code in your reports
* update outputs instantly when code changes
* complete the reproducible data-to-output workflow
???
* From code chunks or inline
* New data? No problem. Re-run the code and the correct outputs are generated.
---
class: middle, center
<img src='img/example-rmd.jpg' width=100%>
???
* Quick overview of an RStudio window
* Left: an R Markdown script (.Rmd)
* Right: viewer output after rendering the .Rmd ('knitting')
* Code gets turned into output
---
class: middle
```{r echo=FALSE}
yr <- 2020
```
Declare a variable
```{r eval=FALSE}
yr <- format(Sys.Date(), "%Y")
```
Use it in R Markdown
```{r eval=FALSE}
> The year is `r yr`.
```
Rendered output
> The year is 2020.
???
* You can write whole 'chunks' of code and also render them inline
* Here's how an inline render might look
---
class: middle
Use R Markdown for lots of output types:
* [{xaringan}](https://slides.yihui.org/xaringan/#1) for slides
* [{bookdown}](https://bookdown.org/) for books
* [{blogdown}](https://bookdown.org/yihui/blogdown/) for blogs
* [{flexdashboard}](https://rmarkdown.rstudio.com/flexdashboard/index.html) for dashboards
* [{pagedown}](https://pagedown.rbind.io/) for paged HTML documents
* [{thesisdown}](https://github.com/ismayc/thesisdown) for theses
* [{govdown}](https://ukgovdatascience.github.io/govdown/) for websites with GOV.UK design
Check out the [R Markdown book](https://bookdown.org/yihui/rmarkdown/)
???
* {govdown} created by our very own Duncan G at GDS
---
class: inverse, middle
# 3. Manage workflows
???
* You've got reproducible steps from ingestion to outputs, but this may involve lots of files, functions and objects
* You need to manage the workflow itself, i.e. the dependencies between all the files, functions and objects
* This might be something you haven't thought of before
* You might be familiar with Make
* This is an R-specific implementation of that approach
---
class: middle
`r emo::ji("duck")` {drake} by [Will Landau](https://twitter.com/wmlandau)
* makes your analysis pipeline reproducible
* remembers your workflow for you
* re-runs only what needs to be re-run
Learn more:
* on the [website](https://docs.ropensci.org/drake/)
* in the [user manual](https://books.ropensci.org/drake/)
* from this [recorded rOpenSci call](https://ropensci.org/commcalls/2019-09-24/)
???
* You can't remember all the parts of your large analysis and how they fit together
* Do you have to re-run everything from scratch if you change something about the analysis?
* {drake} is a brain that remembers all the relationships and the order in which things need to be run
* It only re-runs what needs to be re-run, saving you time and brainpower
---
class: middle
.pull-left[<img src='img/drake-tweet.jpeg' height=450>]
A complicated workflow by [Frederik Aust](https://twitter.com/FrederikAust/status/1205103780938833921?s=20)
Each point is an object, function or data set
Can you remember all of this?
???
* Here's a complicated workflow with lots of inputs
* Many objects (circles) and functions (triangles)
* If something changes, {drake} re-runs only what needs updating
* Saves computation and time
---
class: middle
{drake} example adapted from [Kirill Müller](https://krlmlr.github.io/drake-pitch/#1)
```{r eval=FALSE}
library(drake)
library(tidyverse)
# Create your own functions
create_plot <- function(data) {
ggplot(data, aes(x = Petal.Width, fill = Species)) +
geom_histogram()
}
# Create a workflow 'plan'
plan <- drake_plan( #<<
raw_data = readxl::read_xlsx(file_in("raw-data.xlsx")),
data = raw_data %>% mutate(Species = forcats::fct_inorder(Species)),
hist = create_plot(data),
fit = lm(Sepal.Width ~ Petal.Width + Species, data),
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.pdf"),
quiet = TRUE
)
)
```
???
* this is a very simplified example of how you set up a {drake} workflow
* Load packages, create your functions if required
* Wrap workflow steps in `drake_plan()`
---
class: middle
The plan is a dataframe of targets and commands
```{r eval=FALSE}
plan
## # A tibble: 5 x 2
## target command
## <chr> <expr>
## 1 raw_data readxl::read_xlsx(file_in("raw-data.xlsx")) …
## 2 data raw_data %>% mutate(Species = forcats::fct_inorder(Sp…
## 3 hist create_plot(data) …
## 4 fit lm(Sepal.Width ~ Petal.Width + Species, data) …
## 5 report rmarkdown::render(knitr_in("report.Rmd"), output_file…
```
???
* So the steps have been recorded (in order)
* We have targets (objects) and commands (functions that produce the targets)
* Like a recipe
* They'll be executed in order
* {drake} is aware of which targets are dependent on each other
---
class: middle
Make the plan to generate the targets
```{r eval=FALSE}
make(plan)
## target raw_data
## target data
## target fit
## target hist
## target report
```
???
* `make()` executes everything in order
* It prints the targets that have been created
* If you change something, it will only recreate targets that are dependent on that change
* So next time you `make()`, only those targets will be recreated
* Everything else is stored in a special .drake/ cache ({drake}'s 'brain' for your project)
* You can access these objects with `loadd()` and `readd()` if you need them
---
class: middle
Update the 'data' target and {drake} will:
* re-run out-of-date downstream targets (black)
* leave everything upstream alone
<img src='img/drake-viz.png' width=800%>
???
* Here's a {drake} visualisation to show the dependencies in the simple example
* Imagine we updated the data target so everything downstream is out of date
* Re-running `make()` now will regenerate everything in black
---
class: inverse, middle
# It doesn't end here
???
* It's impossible to cover everything adequately here
* There's some great free materials out there
---
class: middle
Also consider:
* [{renv}](https://rstudio.github.io/renv/articles/renv.html) for [dependency management](https://ukgovdatascience.github.io/rap-website/article-dependency-and-reproducibility.html)
* [{here}](https://here.r-lib.org/) for [relative file paths](https://github.com/jennybc/here_here)
* [Git and GitHub](https://speakerdeck.com/alicebartlett/git-for-humans) for version control
* [Docker](https://ropenscilabs.github.io/r-docker-tutorial/) to fully contain files/code/environment
???
* Dependency managers make sure breaking changes to packages don't affect you
* users/matt/project/data/ doesn't exist on your machine; you need relative paths like project/data/
* Version control lets you roll back changes and experiment safely
* Docker takes the idea of a project folder and makes it even better; it contains the package versions and operating environment (i.e. R version) you need to create everything from scratch
---
class: middle
Some things to look at:
* [Reproducible Analytical Pipelines](https://ukgovdatascience.github.io/rap-website/) (RAP) and the #rap-collaboration channel [on Slack](https://govdatascience.slack.com)
* [Putting the R into reproducible research](https://annakrystalli.me/talks/r-in-repro-research-dc.html#1) by Anna Krystalli
* [The Turing Way](https://the-turing-way.netlify.com/introduction/introduction.html) by The Alan Turing Institute
* [R for Reproducible Scientific Analysis](https://swcarpentry.github.io/r-novice-gapminder/) from Software Carpentry
---
class: middle
I've written about reproducibility-related things:
* [Build an R package with {usethis}](https://www.rostrum.blog/2019/11/01/usethis/)
* [Git going: Git and GitHub](https://www.rostrum.blog/2019/10/21/git-github/)
* [Can {drake} RAP?](https://www.rostrum.blog/2019/07/23/can-drake-rap/)
* [A GitHub repo template for R analysis](https://www.rostrum.blog/2019/06/11/r-repo-template/)
* [Knitting Club: R Markdown for beginners](https://www.rostrum.blog/2018/09/24/knitting-club/)
---
class: inverse, middle
# Reproducibility in R: three things
1. Centralise everything
2. Report with code
3. Manage workflows
`r icon::fa('twitter')` [mattdray](https://twitter.com/mattdray)
`r icon::fa('github')` [matt-dray](https://github.com/matt-dray)
`r icon::fa('globe')` [rostrum.blog](https://www.rostrum.blog/)