forked from cengel/R-data-wrangling
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-dplyr.Rmd
511 lines (374 loc) · 20.4 KB
/
01-dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
```{r, echo=FALSE, purl=FALSE, message = FALSE}
library(lubridate)
knitr::opts_chunk$set(results='hide', comment = "#>", purl = FALSE)
```
# Data Manipulation using **`dplyr`**{#dplyr}
> Learning Objectives
>
> * Select columns in a data frame with the **`dplyr`** function `select`.
> * Select rows in a data frame according to filtering conditions with the **`dplyr`** function `filter`.
> * Direct the output of one **`dplyr`** function to the input of another function with the 'pipe' operator `%>%`.
> * Add new columns to a data frame that are functions of existing columns with `mutate`.
> * Understand the split-apply-combine concept for data analysis.
> * Use `summarize`, `group_by`, and `tally` to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
------------
We will be working a small subset of the data from the [Stanford Open Policing Project](https://openpolicing.stanford.edu). It contains information about traffic stops for blacks and whites in the state of Mississippi during January 2013 to mid-July of 2016.
Let's begin with loading our sample data into a data frame.
```{r}
trafficstops <- read.csv("data/MS_trafficstops_bw_age.csv")
```
Manipulation of dataframes is a common task when you start exploring your data. We might select certain observations (rows) or variables (columns), group the data by a certain variable(s), or calculate summary statistics.
If we were interested in the mean age of the driver in different counties we can do this using the normal base R operations:
```{r, results='show'}
mean(trafficstops[trafficstops$county_name == "Clay County", "driver_age"],
na.rm = TRUE)
mean(trafficstops[trafficstops$county_name == "Lee County", "driver_age"],
na.rm = TRUE)
mean(trafficstops[trafficstops$county_name == "Yazoo County", "driver_age"],
na.rm = TRUE)
```
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Furthermore, there is a fair amount of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.
**`dplyr`** is a package for making tabular data manipulation easier.
> Brief recap:
> Packages in R are sets of additional functions that let you do more stuff. Functions like `str()` or `data.frame()`, come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.
If you haven't, please installe the **`tidyverse`** package.
```{r, eval=FALSE, purl = FALSE}
install.packages("tidyverse")
```
**`tidyverse`** is an "umbrella-package" that installs a series of packages useful for data analysis which work together well. Some of them are considered **core** packages (among them **`tidyr`**, **`dplyr`**, **`ggplot2`**), because you are likely to use them in almost every analysis. Other packages, like `lubridate` (to work wiht dates) or `haven` (for SPSS, Stata, and SAS data) that you are likely to use not for every analysis are also installed.
If you type the following command, it will load the **core** `tidyverse` packages.
```{r, message = FALSE, purl = FALSE}
library("tidyverse") ## load the core tidyverse packages, incl. dplyr
```
If you need to use functions from `tidyverse` packages other than the core packages, you will need to load them separately.
## What is **`dplyr`**?
**`dplyr`** is one part of a larger **`tidyverse`** that enables you to work
with data in tidy data formats. "Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table." (From Wickham, H. (2014): Tidy Data https://www.jstatsoft.org/article/view/v059i10)
The package **`dplyr`** provides convenient tools for the most common data manipulation
tasks. It is built to work directly with data frames, with many common tasks
optimized by being written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The benefits of
doing this are that the data can be managed natively in a relational database,
queries can be conducted on that database, and only the results of the query are
returned.
This addresses a common problem with R in that all operations are conducted
in-memory and thus the amount of data you can work with is limited by available
memory. The database connections essentially remove that limitation in that you
can have a database of many 100s GB, conduct queries on it directly, and pull
back into R only what you need for analysis.
To learn more about **`dplyr`** after the workshop, you may want to check out the [handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf).
## Subsetting columns and rows
To select columns of a
data frame with `dplyr`, use `select()`. The first argument to this function is the data
frame (`trafficstops`), and the subsequent arguments are the columns to keep.
```{r, eval=FALSE, purl = FALSE}
select(trafficstops, police_department, officer_id, driver_race)
```
```{r, results = 'show', purl = FALSE, echo=FALSE}
select(trafficstops, police_department, officer_id, driver_race) %>% head()
```
It is worth knowing that `dplyr` comes with a number of ["select helpers"](https://www.rdocumentation.org/packages/dplyr/versions/0.7.2/topics/select_helpers), which are functions that allow you to select columns based on their names. For example:
```{r, eval=FALSE, purl = FALSE}
select(trafficstops, starts_with("driver"))
```
```{r, results = 'show', purl = FALSE, echo=FALSE}
select(trafficstops, starts_with("driver")) %>% head()
```
To choose rows based on specific criteria, use `filter()`:
```{r, eval=FALSE, purl = FALSE}
filter(trafficstops, county_name == "Yazoo County")
```
```{r, results='show', purl = FALSE, echo=FALSE}
filter(trafficstops, county_name == "Yazoo County") %>% head()
```
Here are some other ways to select rows:
- select certain rows by row number: `slice(trafficstops, 1:3) # rows 1-3`
- select random rows:
- `sample_n(trafficstops, 5) # number of rows to select`
- `sample_frac(trafficstops, .01) # fraction of rows to select`
To sort rows by variables use the `arrange` function: `arrange(trafficstops, county_name, stop_date)`
```{r, results='show', echo=FALSE}
trafficstops %>%
arrange(county_name, stop_date) %>%
head()
```
## Pipes
What if you wanted to filter **and** select on the same data? For example, lets find drivers over 85 years and only keep the violation and gender columns. There are three ways to do this: use intermediate steps, nested functions, or pipes.
* Intermediate steps:
With intermediate steps, you essentially create a temporary data frame and use
that as input to the next function. This can clutter up your workspace with lots
of objects.
```{r, eval=FALSE}
tmp_df <- filter(trafficstops, driver_age > 85)
select(tmp_df, violation_raw, driver_gender)
```
* Nested functions
You can also nest functions (i.e. one function inside of another).
This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out.
```{r, eval=FALSE}
select(filter(trafficstops, driver_age > 85), violation_raw, driver_gender)
```
* Pipes!
The last option, pipes, are a fairly recent addition to R. Pipes let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset. Pipes in R look like
`%>%` and are made available via the `magrittr` package, installed automatically
with **`dplyr`**. If you use RStudio, you can type the pipe with <kbd>Ctrl</kbd>
+ <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC or <kbd>Cmd</kbd> +
<kbd>Shift</kbd> + <kbd>M</kbd> if you have a Mac.
```{r, eval=FALSE, purl = FALSE}
trafficstops %>%
filter(driver_age > 85) %>%
select(violation_raw, driver_gender)
```
In the above, we use the pipe to send the `trafficstops` dataset first through
`filter()` to keep rows where `driver_race` is Black, then through `select()`
to keep only the `officer_id` and `stop_date` columns. Since `%>%` takes
the object on its left and passes it as the first argument to the function on
its right, we don't need to explicitly include it as an argument to the
`filter()` and `select()` functions anymore.
If we wanted to create a new object with this smaller version of the data, we
could do so by assigning it a new name:
```{r, results='show', purl = FALSE}
senior_drivers <- trafficstops %>%
filter(driver_age > 85) %>%
select(violation_raw, driver_gender, driver_race)
senior_drivers
```
Note that the final data frame is the leftmost part of this expression.
> <h3>Challenge</h3>
>
> Using pipes, subset the `trafficstops` data to include stops in Tunica County only and retain the columns `stop_date`, `driver_age`, and `violation_raw`. Bonus: sort the table by driver age.
<!---
```{r, eval=FALSE, purl=FALSE}
## Answer
trafficstops %>%
filter(county_name == "Tunica County") %>%
select(stop_date, driver_age, violation) %>%
arrange(driver_age)
```
--->
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Pipes Challenge:
## Using pipes, subset the `trafficstops` data to include stops in
## Tunica County only and retain the columns `stop_date`, `driver_age`,
## and `violation_raw`. Bonus: sort the table by driver age.
```
## Add new columns
Frequently you'll want to create new columns based on the values in existing columns. For this we'll use `mutate()`.
To create a new column with the year the driver was born we can extract the first 4 elements of the string that represents the `driver_birthdate` and add it to the data frame like this:
```{r, eval=F, purl = FALSE}
trafficstops %>%
mutate(birth_year = substring(driver_birthdate, 1, 4))
```
The new and edited columns will not permanently be added to the existing data frame -- unless we explicitly save the output.
We can als use `mutate` to reassign values to an existing column. So here is an alternative to do the same as above, but here we first will use the `lubridate` library, which is installed with `tidyverse` to convert our string to an actual date format. We will then use the `year()` function to extract the year. All of this can happen in the same call to `mutate()`.
```{r, eval=F, purl = FALSE}
library(lubridate)
trafficstops %>%
mutate(birth_date = ymd(driver_birthdate),
birth_year = year(driver_birthdate))
```
If this runs off your screen and you just want to see the first few rows, you
can use a pipe to view the `head()` of the data. (Pipes work with non-**`dplyr`**
functions, too, as long as the **`dplyr`** or `magrittr` package is loaded). When piping into a function with no additional arguments, you can call the
function with or without parentheses (e.g. `head` or `head()`). (I like to add the parentheses to remind myself that it is a function and not a variable.)
```{r, eval=F, purl = FALSE}
trafficstops %>%
mutate(birth_date = ymd(driver_birthdate),
birth_year = year(driver_birthdate)) %>%
head()
```
We can keep adding columns like this:
```{r, eval=F, purl = FALSE}
trafficstops %>%
mutate(birth_date = ymd(driver_birthdate),
birth_year = year(driver_birthdate),
birth_cohort = round(birth_year/10)*10) %>%
head()
```
We are beginning to see the power of piping. Here is a slightly expanded example, where we select the column `birth_cohort` that we have created and send it to plot:
```{r driver-birth-cohorts, results='show', purl = FALSE, fig.cap='Driver Birth Cohorts'}
trafficstops %>%
mutate(birth_date = ymd(driver_birthdate),
birth_year = year(driver_birthdate),
birth_cohort = round(birth_year/10)*10,
birth_cohort = factor(birth_cohort)) %>%
select(birth_cohort) %>%
plot()
```
> <h3>Challenge</h3>
>
> Create a new data frame from the `trafficstops` data that meets the following
> criteria: contains only the `violation_raw` column for female drivers of age 50 that were stopped on a Sunday. For this add a new column to your data frame called
> `weekday_of_stop` containing the number of the weekday when the stop occurred. Use the `wday()` function from `lubridate` (Sunday = 1).
>
> Think about how the commands should be ordered to produce this data frame!
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Mutate Challenge:
## Create a new data frame from the `trafficstops` data that meets the following
## criteria: contains only the `violation_raw` column for female drivers of age 50 that were stopped on a Sunday. For this add a new column to your data frame called
## `weekday_of_stop` containing the number of the weekday when the stop occurred. Use the `wday()` function from `lubridate` (Sunday = 1).
## Think about how the commands should be ordered to produce this data frame!
```
<!---
```{r, eval=FALSE, purl=FALSE}
## Answer
trafficstops %>%
filter(driver_age == 50 & driver_gender == "female") %>%
mutate(wds = wday(ymd(stop_date))) %>%
select(violation_raw, wds) %>%
filter(wds == 1)
```
--->
## What is split-apply-combine?
Many data analysis tasks can be approached using the *split-apply-combine*
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results.
```{r split-apply-combine, echo=FALSE, results='asis', out.width='\\textwidth', fig.cap='Split - Apply - Combine'}
knitr::include_graphics('img/split-apply-combine.png')
```
**`dplyr`** makes this very easy through the use of the
`group_by()` function.
`group_by()` is often used together with `summarize()`, which collapses each
group into a single-row summary of that group. `group_by()` takes as arguments
the column names that contain the **categorical** variables for which you want
to calculate the summary statistics. So to view the mean age for black and white drivers:
```{r, results='show', purl = FALSE}
trafficstops %>%
group_by(driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE))
```
So, back to the beginning of the chapter, where we tried to calculate the mean age of the driver for different counties, how would we do this? Like this:
```{r, results='show', purl = FALSE, eval=FALSE}
trafficstops %>%
group_by(county_name) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE))
```
You can also group by multiple columns:
```{r, results='show', purl = FALSE}
trafficstops %>%
group_by(county_name, driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE))
```
If we wanted to remove the line with `NA` we could insert a `filter()` in the chain:
```{r, results='show', purl = FALSE}
trafficstops %>%
filter(!is.na(driver_race)) %>%
group_by(county_name, driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE))
```
Recall that `is.na()` is a function that determines whether something is an `NA`. The `!` symbol negates the result, so we’re asking for everything that is _not_ an `NA`.
You may have noticed that the output from these calls looks a little different. That's because **`dplyr`** has changed our `data.frame` object
to an object of class `tbl_df`, also known as a "tibble". Tibble's data
structure is very similar to a data frame. For our purposes the only differences
are that (1) columns of class `character` are never converted into
factors, and (2) in addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen. If we wanted to print all columns we can use the print command, and set the `width` parameter to `Inf`. To print the first 6 rows for example we would do this: `print(my_tibble, n=6, width=Inf)`.
Once the data are grouped, you can also summarize multiple variables at the same
time (and not necessarily on the same variable). For instance, we could add a
column indicating the minimum age in each group (i.e. county):
```{r, results='show', purl = FALSE}
trafficstops %>%
filter(!is.na(driver_race)) %>%
group_by(county_name, driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE),
min_age = min(driver_age, na.rm=TRUE))
```
## Tallying
When working with data, it is also common to want to know the number of
observations found for each factor or combination of factors. For this, **`dplyr`**
provides `tally()`. For example, if we wanted to see how many traffic stops each officer recorded we would do:
```{r, eval=F, purl = FALSE}
trafficstops %>%
group_by(officer_id) %>%
tally()
```
We can optionally sort the results in descending order by adding `sort=TRUE`:
```{r, eval=F, purl = FALSE}
trafficstops %>%
group_by(officer_id) %>%
tally(sort=TRUE)
```
Here, `tally()` is the action applied to the groups created by `group_by()` and counts the total number of records for each category.
Alternatives:
```{r}
trafficstops %>%
group_by(officer_id) %>%
summarize(n = n()) # n() is useful when count is needed for a calculation
trafficstops %>%
count(officer_id) # count() calls group_by automatically, then tallies
```
> <h3>Challenge</h3>
>
> Which 5 counties were the ones with the most stops in 2013?
> Hint: use the year() function from lubridate.
<!---
```{r, eval=F, echo=FALSE, purl=FALSE}
## Answer 1
library(lubridate)
trafficstops %>%
mutate (stopyear = year(ymd(stop_date))) %>%
filter(stopyear == 2013) %>%
count(county_name, sort=TRUE) %>%
head(5)
```
--->
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Tally Challenges:
## Which 5 counties were the ones with the most stops in 2013?
## Hint: use the year() function from lubridate.
```
## Joining two tables
<<<<<<< HEAD
It is not uncommon that we have our data spread out in different tables and need to bring those together for analysis. In this example we will combine the numbers of trafficstops for black and white drivers per county together with the numbers of the black and white total population for these counties. The population data are the estimated values of the 5 year average from the 2011-2015 American Community Survey (ACS):
=======
It is not uncommon that we have our data spread out in different tables and need to bring those together for analysis. In this example we will combine the numbers of trafficstops for black and white drivers per county together with the numbers of the black and white total population for these counties. Toe population data are the estimated values of the 5 year average from the 2011-2015 American Community Survey (ACS):
>>>>>>> 1aab4b954ad0f82372da78447cb5bc64fdc759fe
```{r, results='show'}
MS_bw_pop <- read.csv("data/MS_acs2015_bw.csv")
head(MS_bw_pop)
```
In a first step we will use a prevous `dplyr` command to count all the trafficstops per county.
```{r, results='show'}
trafficstops %>%
group_by(county_name) %>%
summarise(n_stops = n())
```
We will then pipe this into our next operation where we bring the two tables together. We will use `left_join`, which returns all rows from the left table, and all columns from the left and the right table. As unique ID, which uniquely identifies the corresponding records in each table we use the County Names.
```{r, results='show'}
trafficstops %>%
group_by(county_name) %>%
summarise(n_stops = n()) %>%
left_join(MS_bw_pop, by = c("county_name" = "County")) %>%
head()
```
Now we can, for example calculate the percentage of the population that gets stopped in each county.
> <h3>Challenge</h3>
>
> Which county has the highest (lowest) percentage of stopped drivers?
> Use the snippet from above and pipe into the additional operations
> to do this.
<!---
```{r, eval=F, echo=FALSE, purl=FALSE}
## Answer
trafficstops %>%
group_by(county_name) %>%
summarise(n_stops = n()) %>%
left_join(MS_bw_pop, by = c("county_name" = "County")) %>%
mutate(pct_stopped = n_stops/bw_pop * 100) %>%
filter(pct_stopped %in% range(pc_stopped))
```
--->
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Join Challenge
## In which county has the highest percentage of stopped drivers?
## Use the snippet from above and pipe into the additional operations
## to do this.
```
`dplyr` join functions are generally equivalent `merge` from the base command, but there are a few advantages:
* rows are kept in existing order
* much faster
* tells you what keys you're merging by (if you don't supply)
* also work with database tables.
https://groups.google.com/d/msg/manipulatr/OuAPC4VyfIc/Qnt8mDfq0WwJ
See `?dplyr::join` for all the possible joins.