-
Notifications
You must be signed in to change notification settings - Fork 25
/
08-workshop-joins.Rmd
271 lines (174 loc) · 5.3 KB
/
08-workshop-joins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
title: "Relational data"
subtitle: "Session 8"
institute: "NHS-R Community"
output:
xaringan::moon_reader:
css:
- default
- css/nhsr.css
- css/nhsr-fonts.css
lib_dir: libs
seal: false
self_contained: true
nature:
highlightStyle: googlecode
highlightLines: true
highlightLanguage: ["r"]
countIncrementalSlides: false
ratio: "16:9"
includes:
after_body: [css/insert-logo.html]
---
```{r libs, include=FALSE}
library(knitr)
library(magick)
library(tidyverse)
library(xaringan)
library(kableExtra)
library(icons)
library(xaringanExtra)
xaringanExtra::use_panelset()
xaringanExtra::use_clipboard()
xaringanExtra::use_share_again() # need to get the slide button on html view
opts_chunk$set(
echo = TRUE,
eval = FALSE,
message = FALSE,
warning = FALSE,
fig.width = 7.252,
fig.height = 4,
dpi = 300,
dev.args = list(type = "cairo")
)
# load data----------------------------------
tb_cases <- read_csv(url("https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_cases.csv"))
tb_pop <- read_csv(url("https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_pop.csv"))
tb_new_table <- read_csv(url("https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_new_table.csv"))
```
class: title-slide, left, bottom
# `r rmarkdown::metadata$title`
----
## **`r rmarkdown::metadata$subtitle`**
### `r rmarkdown::metadata$author`
### `r rmarkdown::metadata$date`
.right-column[
]
---
# Relational data
It's rare to find all the data you need for an analysis in a single table.
</br> Typically, you'll have to link two (or more) tables together by matching on common "key" variable(s).
.green[We use joins in SQL or R, VLOOKUPs in Excel]
---
# Relational data
Here, we'll focus on left (outer) joins.
</br> The syntax is similar for other types of joins.
---
# left_join()
Keep structure of table x
…and match to observations in table y
"key" variable (common to both tables)
```{r}
x %>%
left_join(y, by = "id")
```
<img class="center" src="img/session08/left-join.gif" width="40%"/>
source: https://github.com/gadenbuie/tidyexplain
---
# Relational Data
We're going to join two tables one with cases of tuberculosis by country, one with population by country.
</br> From this new table we could derive a rate.
---
class: center, middle
# Please Import
.blue[tb_cases.csv]
.blue[tb_pop.csv]
and
.blue[tb_new_table.csv]
---
# left_join
Keep the original structure of the tb_cases data frame
…then match to rows in tb_pop
based on "country" value
```{r population}
tb_cases %>%
left_join(tb_pop, by = "country")
```
---
# Duplicates!
For every country value in tb_cases, there are 4 in tb_pop
```{r ref.label="population", echo=FALSE, eval=TRUE}
```
---
# Join on multiple rows
match on two (or more) variables
```{r }
tb_cases %>%
left_join(tb_pop, by = c("country", "year"))
```
--
</br>
</br>
### You may recall from an earlier slide
You can create a vector with function: c() for concatenate/combine
```{r}
c(100, 80, 200)
c("beds", "staff", "patients")
```
---
# Default combine
Dplyr join verbs find the same named columns and list in the output what it has joined on ## Joining, by = c("country", "year")
This can be copied from the console
```{r eval=TRUE}
tb_cases %>%
left_join(tb_pop)
```
---
# Joining with different names
If two tables have .blue[different] names for .blue[same] variable:
```{r}
tb_cases %>%
left_join(tb_new_table, by = c("country" = "Place",
"year" = "Yr"))
```
tb_new_table is the same table as tb_cases but with a column Joined.
---
# Other joins
Like SQL, dplyr has left_join, inner_join and full_join.
Joins that are slightly different to SQL are:
--
### [Semi-join](https://stackoverflow.com/questions/42249690/what-is-semi-join-in-database)
All rows from x where there are matching values in y, keeping just columns from x.
<img class="center" src="img/session08/semi-join.gif" width="40%"/>
---
# Use Case
Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.
Join to the tb_new_table and bring back only those records where the column FirstLetter is 'A' but nothing from tb_new_table.
```{r eval=TRUE}
tb_cases %>%
semi_join(tb_new_table %>%
filter(FirstLetter == "A"), by = c("country" = "Place",
"year" = "Yr"))
```
---
### [Anti-join](https://stackoverflow.com/questions/28702960/find-complement-of-a-data-frame-anti-join)
All rows from x where there are not matching values in y, keeping just columns from x.
<img class="center" src="img/session08/anti-join.gif" width="40%"/>
---
# Use Case
In text mining, exclude 'stop words' (but, and, or) which are kept in a separate table for reference.
Join to the tb_new_table and exclude only those records where the column FirstLetter is 'A' .
```{r eval=TRUE}
tb_cases %>%
anti_join(tb_new_table %>%
filter(FirstLetter == "A"), by = c("country" = "Place",
"year" = "Yr"))
```
---
#### This work is licensed as
</br> Creative Commons
</br> Attribution
</br> ShareAlike 4.0
</br> International
</br> To view a copy of this license, visit
</br> https://creativecommons.org/licenses/by/4.0/