-
Notifications
You must be signed in to change notification settings - Fork 4
/
session-joins.qmd
196 lines (140 loc) · 5.5 KB
/
session-joins.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
title: "Introduction to R and Rstudio"
subtitle: "Session - joins"
---
## Relational data
```{r}
#| label: "libs"
#| include: false
#| eval: true
#| echo: false
library(dplyr)
library(readr)
library(countdown)
```
```{r}
#| echo: false
#| eval: true
#| label: "load-data"
tb_cases <- read_csv(url("https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_cases.csv"))
tb_pop <- read_csv(url("https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_pop.csv"))
tb_new_table <- read_csv(url("https://raw.githubusercontent.com/nhs-r-community/intro_r_data/main/tb_new_table.csv"))
```
It's rare to find all the data you need for an analysis in a single table.
Typically, you'll have to link two (or more) tables together by matching on common "key" variable(s).
We use joins in SQL or R, (V, H or X) LOOKUPs in Excel
## Joins
All the joins in SQL are available in R called `mutating joins` and a couple more called `filtering joins` which have the same concept as `EXISTS` used in the `WHERE` clause in SQL.
## left_join()
::: columns
::: {.column width="40%"}
Keep structure of table `x`
...and match to observations in table `y`
"key" variable (common to both tables)
```{r}
x |>
left_join(y, by = "id")
```
:::
::: {.column width="60%"}
[![Moving gif that shows two tables and their merge with the data on the left 1, 2 and 3 being added to from the right where it matches](img/session08/left-join.gif){.center}](https://github.com/gadenbuie/tidyexplain)
:::
:::
## Import some files!
We're going to join two tables one with cases of tuberculosis by country, one with population by country.
Import the files:
`tb_cases.csv`
`tb_pop.csv`
`tb_new_table.csv`
```{r}
#| eval: true
#| echo: false
countdown::countdown(minutes = 5,
color_border = "#005EB8",
color_text = "#005EB8",
color_running_text = "white",
color_running_background = "#005EB8",
color_finished_text = "#005EB8",
color_finished_background = "white",
margin = "0.9em",
font_size = "2em")
```
## Join 2 tables together
When columns have the same name R resolves this conflict by adding `.x` and `.y` to the end of the column name
```{r}
#| eval: true
tb_cases |>
left_join(tb_pop, by = "country")
```
## Join on multiple columns
Explicitly joining on more than one column requires the use of a vector
```{r }
tb_cases |>
left_join(tb_pop, by = c("country", "year"))
```
and as of 2023 now:
```{r}
tb_cases |>
left_join(tb_pop, join_by(country, year))
```
## Default combine
{dplyr} joins can occur automatically
```{r}
#| eval = TRUE
tb_cases |>
left_join(tb_pop)
```
## Joining with different names
If two tables have different names for same variable:
```{r}
tb_cases |>
left_join(tb_new_table, by = c("country" = "Place",
"year" = "Year"))
```
as of 2023
```{r}
tb_cases |>
left_join(tb_new_table, by = join_by(country == Place,
year == Year))
```
## semi_join()
::: columns
::: {.column width="40%"}
Semi-joins retain the data on the left if it matches the right, but drops the right (so is not like an inner join that keeps both sides).
All rows from `x` where there are matching values in `y`, keeping just columns from `x`.
:::
::: {.column width="60%"}
[![Moving gif showing the data from the left table being kept if it matches the right but everything is dropped from the right](img/session08/semi-join.gif){.center alt="Moving gif showing the data from the left table being kept if it matches the right but everything is dropped from the right" width="60%"}](https://github.com/gadenbuie/tidyexplain)
:::
:::
## Use Case - hospital Covid tests
Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.
Join to the `tb_new_table` and bring back only those records where the column `FirstLetter` is 'A' but drop the data from `tb_new_table.`
```{r}
# Create a new data frame to join to
lookup_table <- tb_new_table |>
filter(FirstLetter == "A")
tb_cases |>
semi_join(lookup_table, by = join_by(country == Place,
year == Year))
```
## anti_join()
::: columns
::: {.column width="40%"}
All rows from `x` where there are not matching values in `y`, keeping just columns from `x`.
:::
::: {.column width="60%"}
[![Moving gif that only retrains from the left table what doesn't match on the right](img/session08/anti-join.gif){.center alt="Moving gif that only retrains from the left table what doesn't match on the right"}](https://github.com/gadenbuie/tidyexplain)
:::
:::
## Use Case - text mining stop words
In text mining, exclude 'stop words' (but, and, or) which are kept in a separate table for reference.
Join to the `tb_new_table` and exclude only those records where the column `FirstLetter` is 'A' .
```{r}
tb_cases |>
anti_join(lookup_table, by = join_by(country == Place,
year == Year))
```
## End session
Thanks for the animated join gifs to Garrick Aden-Buie in his blog on [tidyexplain](https://www.garrickadenbuie.com/project/tidyexplain/).
A static representation of the joins can be found in the [data wrangling cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) shared by Anya Ferguson in the [Intermediate to R](https://github.com/nhs-r-community/intermediate_R_training) NHS-R course.