-
Notifications
You must be signed in to change notification settings - Fork 1
/
ChainingOperationsTogether.Rmd
123 lines (85 loc) · 5.2 KB
/
ChainingOperationsTogether.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "Ways of chaining operations together"
output:
html_notebook: default
github_document: default
---
```{r setup, include = TRUE}
library(tidyverse)
articles <- read_csv("data/master.csv")
subjects <- read_csv("data/subjects.csv")
mentions <- read_csv("data/mentions.csv")
articles_with_subjects_and_mentions <- left_join(left_join(subjects, articles), mentions)
```
This notebook shows three different ways of chaining a set of operations together. The example used is one where we count the total number of mentions related to all the articles with the Scopus subject 'Health Sciences'.
Firstly, this means joining together articles, subjects and mentions (as per the setup above). Joining the three results in one very big dataset because one row is being created for every subject, and then each of these rows is being joined to all the mentions for the article in question (see *Making things more efficient* below).
Secondly, we need to perform the following operations in the order below:
1. **filter** the dataset (by Scopus subject 'Health Sciences').
2. **group_by** the dataset by article.
3. **summarise** the groups by total number of records. This should give the total number of mentions for each article.
4. **arrange** the resulting set in descending order of total number of records.
This chain of **filter** > **group_by** > **summarise** > **arrange** is extremely common so it's worth getting used to this example.
## Approach 1: setting explicit variables
The first approach to running the chain of operations is to set explicit variables for each step, then use each new variable in the next step.
```{r Set explicit variables}
filtered_by_subject <- filter(articles_with_subjects_and_mentions,
subject_scheme == "scopus",
subject_name == "Health Sciences")
grouped_by_article <- group_by(filtered_by_subject,
article_title)
summarised_by_total_mentions <- summarise(grouped_by_article,
total_mentions = n())
arranged_by_total_mentions <- arrange(summarised_by_total_mentions,
desc(total_mentions))
```
This has the advantage of making it as obvious as possible what's going on, but the big disadvantage is that it creates one fresh dataset for each operation. This causes two problems: firstly, your environment fills up with lots of different data variables, making things very confusing. Secondly, your computer's memory gets filled up by all these different datasets too, which will slow it down.
Also, using this approach means that the result isn't shown automatically when you preview the Notebook.
## Approach 2: using the Pipe function
This is very similar to the first approach, but instead of explicitly setting dataset variables everywhere, you pipe - %>% - the results of each operation to the next. This means you can get away with only setting one variable (i.e. the final one):
```{r Pipe from one operation to the next}
arranged_by_total_mentions_piped <- filter(articles_with_subjects_and_mentions,
subject_scheme == "scopus",
subject_name == "Health Sciences") %>%
group_by(article_title) %>%
summarise(total_mentions = n()) %>%
arrange(desc(total_mentions))
```
This has the advantage of being neater, because it only creates one variable. However, like Approach 1, it still doesn't display the table of results in the Notebook.
## Approach 3: nest all the operations
The final approach is to nest all the operations together, with the last operation (arrange) being the outermost, and the first (filter) being the innermost (and the only place where the actual input dataset gets mentioned):
```{r Nested operations}
arrange(
summarise(
group_by(
filter(articles_with_subjects_and_mentions,
subject_scheme == "scopus",
subject_name == "Health Sciences"
),
article_title
),
total_mentions = n()
),
desc(total_mentions)
)
```
This approach doesn't create any dataset variables at all, and outputs the result in the Notebook too. However, it is the hardest code to write and follow, because of all the nesting of operations that are going on. Doing it this way forces you to write all the operations in reverse order, and it's harder to get the knack of adding all the commas and closing parentheses properly.
## Making things even more efficient
One way to make all this more efficient is to apply the filter to the incoming datasets first (in this example, to subjects, as that's what we're filtering) before you join them.
```{r Filter subjects first}
filtered_subjects <- filter(subjects,
subject_scheme == "scopus",
subject_name == "Health Sciences")
articles_with_filtered_subjects_and_mentions <- left_join(
left_join(filtered_subjects,
articles),
mentions)
arrange(
summarise(
group_by(articles_with_filtered_subjects_and_mentions,
article_title
),
total_mentions = n()
),
desc(total_mentions)
)
```