forked from cengel/R-data-wrangling
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-tidyr.Rmd
232 lines (170 loc) · 10 KB
/
02-tidyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
```{r, echo=FALSE, purl=FALSE, message = FALSE}
library(tidyverse)
library(xtable)
options(xtable.comment = FALSE)
knitr::opts_chunk$set(results='hide', comment = "#>", purl = FALSE)
trafficstops <- read.csv("data/MS_trafficstops_bw_age.csv")
```
# Data Manipulation using **`tidyr`**{#tidyr}
> Learning Objectives
>
> * Understand the concept of a wide and a long table format and for which purpose those formats are useful.
> * Understand what key-value pairs are.
> * Reshape a data frame from long to wide format and back with the `pivot_wider` and `pivot_longer` commands from the **`tidyr`** package.
> * Export a data frame to a .csv file.
------------
`dplyr` pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis.
The package **`tidyr`** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per observation. Sometimes we want a data frame where each observation type has its own column, and rows are instead more aggregated groups - like surveys, where each column represents an answer. Moving back and forth between these formats is nontrivial, and **`tidyr`** gives you tools for this and more sophisticated data manipulation.
To learn more about **`tidyr`** after the workshop, you may want to check out this [cheatsheet about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf).
## About long and wide table format
The 'long' format is where:
- each column is a variable
- each row is an observation
In the 'long' format, you usually have 1 column for the observed variable and
the other columns are ID variables.
For the 'wide' format a row, for example could be a reserach subject for which you have multiple observation variables containing the same type of data, for example responses to a set of survey questions, or repeated observations over time, or a mix of both. Here is an example:
```{r wide-exmpl, results='asis', echo=FALSE}
exmpl_obs <- data.frame(subject_ID = LETTERS[1:3],
question_1 = c(4, 4, 2),
question_2 = c(3, 1, 5),
question_3 = c(4, 5, 2))
xtable(exmpl_obs)
```
You may find data input may be simpler or some other
applications may prefer the 'wide' format. However, many of `R`'s functions have
been designed assuming you have 'long' format data. This tutorial will help you
efficiently transform your data regardless of original format.
```{r wide-vs-long, echo=FALSE, results='asis', out.width='30%', fig.cap='Wide vs. Long Table Format', eval=TRUE}
knitr::include_graphics('img/wide-vs-long.png')
```
The choice of data format affects readability. For humans, the wide format is often more intuitive, since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The `ID` variables in our dataframes are similar to the fields in a database and observed variables are like the database values.
> Challenge 1
>
> Is trafficstops in a long or wide format?
>
## Long to Wide with `pivot_wider`
Now let's see this in action. First, using **`dplyr`**, let's create a data frame
with the mean age of each driver by gender and county:
```{r, results='show', purl=FALSE}
trafficstops_ma <- trafficstops %>%
filter(!is.na(driver_gender)) %>%
group_by(county_name, driver_gender) %>%
summarize(mean_age = mean(driver_age, na.rm = TRUE))
trafficstops_ma
```
Now, to make this long data wide, we use `pivot_wider` from `tidyr` to turn the driver gender into columns. In addition to our data table we provide `pivot_wider` with two arguments: `names_from` describes which column to use for name of the output column, and `values_from` tells it from column to get the cell values. We'll use a pipe so we can ignore the data argument.
```{r, results='show', purl=FALSE}
trafficstops_ma_wide <- trafficstops_ma %>%
pivot_wider(names_from = driver_gender,
values_from = mean_age)
trafficstops_ma_wide
```
We can now do things like compare the mean age of men against women drivers. As example we use the age difference to find the counties with the largest and with the smallest number. (A negative number means that female drivers are on average older than male drivers, a positive number means that male drivers are on average older than women drivers.)
```{r, results='show', purl=FALSE}
trafficstops_ma_wide %>%
mutate(agediff = male - female) %>%
ungroup() %>%
filter(agediff %in% range(agediff))
```
Note that `trafficstops_ma_wide` is derived from `trafficstops_ma`, and is a "grouped" data frame, which was created with the `group_by` function above. (Check `class(trafficstops_ma)` and `class(trafficstops_ma_wide)`). That means that any instruction that follows will operate on each group (in this case county) separately. That may be ok for some instances (like `mutate`), but if we are interested in retrieving the maximum and the minumim age difference over *all* counties we need to `ungroup` the tibble to have the `filter` command operate on the entire dataset instead of each group (i.e. county).
## Wide to long with `pivot_longer`
What if we had the opposite problem, and wanted to go from a wide to long
format? For that, we use `pivot_longer`, which will increase the number of rows and decrease the number of columns. We provide the functino with thee arguments: `cols` which are the columns we want to pivot into the long format, `names_to`, which is a string specifying the name of the column to create from the data stored in the column names, and `values_to`, which is also a string, specifying the name of the column to create from the data stored in cell values.
So, to go backwards from `trafficstops_ma_wide`, and exclude `county_name` from the long, we would do the following:
```{r, results='show', purl=FALSE}
trafficstops_ma_long <- trafficstops_ma_wide %>%
pivot_longer(cols = -county_name,
names_to = "driver_gender", # name is a string!
values_to = "mean_age") # also a string
trafficstops_ma_long
```
We could also have used a specification for what columns to include. This can be
useful if you have a large number of identifying columns, and it's easier to
specify what to gather than what to leave alone. And if the columns are adjacent to each other, we don't even need to list them all out -- we can use the `:` operator!
```{r, results='show', purl=FALSE}
trafficstops_ma_wide %>%
pivot_longer(cols = male:female, # this also works
names_to = "driver_gender",
values_to = "mean_age")
```
There are many powerful operations you can do with the `pivot_*` functions. To learn more review the vignette:
```{r, results='show', eval=FALSE}
vignette("pivot")
```
> <h3>Challenge</h3>
>
> 1.From the trafficstops dataframe create a wide data frame `tr_wide` with
> "year" as columns, each row is a different violation ("violation_raw"),
> and the values are the
> number of traffic stops per each violation, roughly like this:
>
>` violation_raw | 2013 | 2014 | 2015 ...`
> `Improper turn | 65 | 54| 67 ...`
> `Speeding | 713 | 948| 978 ...`
> `...`
>
> Use `year()` from the lubridate package. Hint: You will need to summarize
> and count the traffic stops before reshaping the table.
> 2. Now take the data frame, and make it long again, so each row is a
> unique violation_raw - year combination, like this:
>
> `violation_raw | year | n of stops`
> `Improper turn | 2013 | 65`
> `Improper turn | 2014 | 54`
> `... etc`
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Reshaping challenges
## From the trafficstops dataframe create a wide data frame `tr_wide`
## with "year" as columns, with "violation_raw" as rows,
## and the values are the number of traffic stops per each violation,
## roughly like this:
## violation_raw | 2013 | 2014 | 2015 ...
## Improper turn | 65 | 54| 67 ...
## Speeding | 713 | 948| 978 ...
## ...
## Use year() from the lubridate package. Hint: You will need to summarize
## and count the traffic stops before reshaping the table.
## Now take the data frame, and make it long again, so each row is a
## unique violation_raw - year combination, like this:
## violation_raw | year | n of stops
## Improper turn | 2013 | 65
## Improper turn | 2014 | 54
## ... etc
```
<!---
```{r, echo=FALSE, purl=FALSE}
## Answer 1
library(lubridate)
tr_wide <- trafficstops %>%
mutate(stop_date = ymd(stop_date),
year = year(stop_date)) %>%
group_by(violation_raw, year) %>%
summarize(n_stops = n()) %>%
pivot_wider(names_from = year, values_from = n_stops)
head(tr_wide)
## Answer 2
tr_wide %>%
pivot_longer(cols = -violation_raw, names_to = "year", values_to = "n_stops")
```
--->
## Exporting data
Similar to the `read.csv()` function used for reading CSV files into R, there is a `write.csv()` function that generates CSV files from data frames.
Before using `write.csv()`, we are going to create a new folder, `data_output`,
in our working directory that will store this generated dataset. We don't want
to write generated datasets in the same directory as our raw data. It's good
practice to keep them separate. The `data` folder should only contain the raw,
unaltered data, and should be left alone to make sure we don't delete or modify
it. In contrast, our script will generate the contents of the `data_output`
directory, so even if the files it contains are deleted, we can always
re-generate them.
We can save the table generated above in our `data_output`
folder. By default, `write.csv()` includes a column with row names (in our case
these names are the row numbers), so we need to add `row.names = FALSE` so
they are not included:
```{r, purl=FALSE, eval=FALSE}
write.csv(trafficstops_ma_wide, "data_output/MS_county_stops.csv", row.names = F)
```
```{r, purl=FALSE, echo=FALSE}
if (!file.exists("data_output")) dir.create("data_output")
write.csv(trafficstops_ma_wide, "data_output/MS_county_stops.csv", row.names = F)
```