02-tidyr.Rmd

```{r, echo=FALSE, purl=FALSE, message = FALSE}
library(tidyverse)
library(xtable)
options(xtable.comment = FALSE)

knitr::opts_chunk$set(results='hide', comment = "#>", purl = FALSE)
trafficstops <- read.csv("data/MS_trafficstops_bw_age.csv")
```

# Data Manipulation using **`tidyr`**{#tidyr}

> Learning Objectives
>
> * Understand the concept of a wide and a long table format and for which purpose those formats are useful.
> * Understand what key-value pairs are.
> * Reshape a data frame from long to wide format and back with the `pivot_wider` and `pivot_longer` commands from the **`tidyr`** package.
> * Export a data frame to a .csv file.

------------

`dplyr` pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis.

The package **`tidyr`** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per observation. Sometimes we want a data frame where each observation type has its own column, and rows are instead more aggregated groups - like surveys, where each column represents an answer. Moving back and forth between these formats is nontrivial, and **`tidyr`** gives you tools for this and more sophisticated  data manipulation.

To learn more about **`tidyr`** after the workshop, you may want to check out this [cheatsheet about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf).


## About long and wide table format

The 'long' format is where:

 - each column is a variable
 - each row is an observation

In the 'long' format, you usually have 1 column for the observed variable and
the other columns are ID variables.


For the 'wide' format a row, for example could be a reserach subject for which you have multiple observation variables containing the same type of data, for example responses to a set of survey questions, or repeated observations over time, or a mix of both. Here is an example:

```{r wide-exmpl, results='asis', echo=FALSE}
exmpl_obs <- data.frame(subject_ID = LETTERS[1:3], 
                        question_1 = c(4, 4, 2),
                        question_2 = c(3, 1, 5),
                        question_3 = c(4, 5, 2))
xtable(exmpl_obs)
```


You may find data input may be simpler or some other
applications may prefer the 'wide' format. However, many of `R`'s functions have
been designed assuming you have 'long' format data. This tutorial will help you
efficiently transform your data regardless of original format.

```{r wide-vs-long, echo=FALSE, results='asis', out.width='30%', fig.cap='Wide vs. Long Table Format', eval=TRUE}
knitr::include_graphics('img/wide-vs-long.png')
```

The choice of data format affects readability. For humans, the wide format is often more intuitive, since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The `ID` variables in our dataframes are similar to the fields in a database and observed variables are like the database values.


> Challenge 1
>
> Is trafficstops in a long or wide format?
>

## Long to Wide with `pivot_wider`

Now let's see this in action. First, using **`dplyr`**, let's create a data frame
with the mean age of each driver by gender and county:

```{r, results='show', purl=FALSE}
trafficstops_ma <- trafficstops %>%
    filter(!is.na(driver_gender)) %>%
    group_by(county_name, driver_gender) %>%
    summarize(mean_age = mean(driver_age, na.rm = TRUE))

trafficstops_ma
```

Now, to make this long data wide, we use `pivot_wider` from `tidyr` to turn the driver gender into columns. In addition to our data table we provide `pivot_wider` with two arguments: `names_from` describes which column to use for name of the output column, and `values_from` tells it from column to get the cell values. We'll use a pipe so we can ignore the data argument.

```{r, results='show', purl=FALSE}
trafficstops_ma_wide <- trafficstops_ma %>%
  pivot_wider(names_from = driver_gender, 
              values_from = mean_age) 

trafficstops_ma_wide
```

We can now do things like compare the mean age of men against women drivers. As example we use the age difference to find the counties with the largest and with the smallest number. (A negative number means that female drivers are on average older than male drivers, a positive number means that male drivers are on average older than women drivers.)

```{r, results='show', purl=FALSE}
trafficstops_ma_wide %>% 
  mutate(agediff = male - female) %>% 
  ungroup() %>%
  filter(agediff %in% range(agediff))
```

Note that `trafficstops_ma_wide` is derived from `trafficstops_ma`, and is a "grouped" data frame, which was created with the `group_by` function above. (Check `class(trafficstops_ma)` and `class(trafficstops_ma_wide)`). That means that any instruction that follows will operate on each group (in this case county) separately. That may be ok for some instances (like `mutate`), but if we are interested in retrieving the maximum and the minumim age difference over *all* counties we need to `ungroup` the tibble to have the `filter` command operate on the entire dataset instead of each group (i.e. county). 


## Wide to long with `pivot_longer`

What if we had the opposite problem, and wanted to go from a wide to long
format? For that, we use `pivot_longer`, which will increase the number of rows and decrease the number of columns. We provide the functino with thee arguments: `cols` which are the columns we want to pivot into the long format, `names_to`, which is a string specifying the name of the column to create from the data stored in the column names, and `values_to`, which is also a string, specifying the name of the column to create from the data stored in cell values.
So, to go backwards from `trafficstops_ma_wide`, and exclude `county_name` from the long, we would do the following:

```{r, results='show', purl=FALSE}
trafficstops_ma_long <- trafficstops_ma_wide %>%
  pivot_longer(cols = -county_name, 
               names_to = "driver_gender", # name is a string!
               values_to = "mean_age")     # also a string

trafficstops_ma_long
```

We could also have used a specification for what columns to include. This can be
useful if you have a large number of identifying columns, and it's easier to
specify what to gather than what to leave alone. And if the columns are adjacent to each other, we don't even need to list them all out -- we can use the `:` operator!

```{r, results='show', purl=FALSE}
trafficstops_ma_wide %>%
  pivot_longer(cols = male:female,      # this also works
               names_to = "driver_gender", 
               values_to = "mean_age")
```

There are many powerful operations you can do with the `pivot_*` functions. To learn more review the vignette:

```{r, results='show', eval=FALSE}
vignette("pivot")
```

> <h3>Challenge</h3>
>
> 1.From the trafficstops dataframe create a wide data frame `tr_wide` with 
> "year" as columns, each row is a different violation ("violation_raw"), 
> and the values are the 
> number of traffic stops per each violation, roughly like this:
> 
>` violation_raw  | 2013 | 2014 | 2015 ...`
> `Improper turn  |   65 |    54|   67 ...`
> `Speeding       |  713 |   948|  978 ...`
> `...`
>
> Use `year()` from the lubridate package. Hint: You will need to summarize 
> and count the traffic stops before reshaping the table.

> 2. Now take the data frame, and make it long again, so each row is a 
> unique violation_raw - year combination, like this:
> 
> `violation_raw | year | n of stops`
> `Improper turn | 2013 | 65`
> `Improper turn | 2014 | 54`
> `... etc`

```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Reshaping challenges

## From the trafficstops dataframe create a wide data frame `tr_wide` 
## with "year" as columns, with "violation_raw" as rows, 
## and the values are the number of traffic stops per each violation,
## roughly like this:

## violation_raw  | 2013 | 2014 | 2015 ...
## Improper turn  |   65 |    54|   67 ...
## Speeding       |  713 |   948|  978 ...
## ...

## Use year() from the lubridate package. Hint: You will need to summarize 
## and count the traffic stops before reshaping the table.

## Now take the data frame, and make it long again, so each row is a 
## unique violation_raw - year combination, like this:

## violation_raw | year | n of stops
## Improper turn | 2013 | 65
## Improper turn | 2014 | 54
## ... etc

```

<!---
  ```{r, echo=FALSE, purl=FALSE}
## Answer 1
library(lubridate)
tr_wide <- trafficstops %>%
  mutate(stop_date = ymd(stop_date),
         year = year(stop_date)) %>% 
  group_by(violation_raw, year) %>%
  summarize(n_stops = n()) %>%
  pivot_wider(names_from = year, values_from = n_stops)

head(tr_wide)

## Answer 2
tr_wide %>%
  pivot_longer(cols = -violation_raw, names_to = "year", values_to = "n_stops")

```
--->


## Exporting data 

Similar to the `read.csv()` function used for reading CSV files into R, there is a `write.csv()` function that generates CSV files from data frames.

Before using `write.csv()`, we are going to create a new folder, `data_output`,
in our working directory that will store this generated dataset. We don't want
to write generated datasets in the same directory as our raw data. It's good
practice to keep them separate. The `data` folder should only contain the raw,
unaltered data, and should be left alone to make sure we don't delete or modify
it. In contrast, our script will generate the contents of the `data_output`
directory, so even if the files it contains are deleted, we can always
re-generate them.

We can save the table generated above  in our `data_output`
folder. By default, `write.csv()` includes a column with row names (in our case
these names are the row numbers), so we need to add `row.names = FALSE` so
they are not included:

```{r, purl=FALSE, eval=FALSE}
write.csv(trafficstops_ma_wide, "data_output/MS_county_stops.csv", row.names = F)
```

```{r, purl=FALSE, echo=FALSE}
if (!file.exists("data_output")) dir.create("data_output")

write.csv(trafficstops_ma_wide, "data_output/MS_county_stops.csv", row.names = F)
```