-
Notifications
You must be signed in to change notification settings - Fork 1
/
parse_files.Rmd
116 lines (88 loc) · 3.94 KB
/
parse_files.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: "Parse Files"
author: "Matei Ionita"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Data manifest from file names
Load the tidyverse, which contains some useful functions for data wrangling.
```{r tidyverse, message=FALSE, warning=FALSE}
library(tidyverse)
```
Say that you have multiple data files in different folders, for example
based on different medical conditions. We'll explore how to create a data
manifest on the fly, based on the file names and the directory structure.
For this exercise, I created two directories called "cond1"
and "cond2", and placed them in a parent directory called "data".
Then I created two (empty) .txt files in each directory.
If you want to follow along on your own system:
* create the same file structure yourself;
* change the `data_base` variable below to whatever your parent directory is.
By default, paths will be interpreted relative to your working directory --
run the `getwd()` command if you're not sure where that is. Otherwise you
could provide an absolute path from your home folder on your machine,
denoted by `~`. An example is Wade's use of
`proj_base = "~/Data/Independent_Consulting/Penn/Matei/"`.
Once you complete these prerequisites, you can use the `list.files` command
to find files in a given location, whose name matches a given pattern.
In this case, I'm looking for the .txt files I created. For your project,
you may want to replace .txt with .fcs.
The argument `recursive=TRUE` looks inside subdirectories of `data_base`.
Use `?list.files` to read the documentation of this function and learn more.
```{r read}
data_base <- "data"
files <- list.files(path=data_base, pattern=".txt", recursive = TRUE)
files
```
Let's start creating the manifest. We create a tibble (fancy name for a data
frame) which initially has just one column, the file path. Then we mutate
to add additional columns. Note the pipe operator `%>%`, which takes the
output of the previous command and inputs it to the next.
```{r manifest}
manifest <- tibble(path = files) %>% # Data frame with one column, the file path
mutate(filename = path %>%
str_split(pattern="/") %>% # Split the path on "/"
sapply("[", 2), # Filename is the second piece
condition = path %>%
str_split(pattern="/") %>%
sapply("[", 1)) # Condition is the first piece (directory name)
manifest
```
Let's go further and extract the tissue type and subject name from the filename.
We now have to split on two characters, "_" and ".". For this we use the
expression "[_.]+".
```{r split_file_name, tibble.width=Inf}
manifest <- manifest %>%
mutate(tissue = filename %>%
str_split(pattern="[_.]+") %>% # Split string on multiple characters
sapply("[", 1), # Tissue is first piece
subject = filename %>%
str_split(pattern="[_.]+") %>%
sapply("[", 2)) # Subject name is second piece
manifest
```
## Joining with analysis results
Assume now that you did some analysis on your files, and came up
with some values for a biomarker in each of the files. For this
exercise, I will create some dummy results instead of actually
running an analysis.
```{r dummy_results}
results <- tibble(file = sort(files, decreasing=TRUE),
biomarker = c(1, 1, 5, 6))
results
```
For some reason, the order of your files changed during the analysis.
If you naively concatenate the results with your manifest (using the
`rbind` function) you will mix up the association between subjects
and results. To avoid this you should do a join.
```{r join_data}
manifest_results <- inner_join(manifest, results, by=c("path"="file"))
manifest_results
```
Now everything looks good, and you can do some statistics. In this case,
we explicitly told `inner_join` what columns to use for matching. By
default, it will use all columns with common names between the two
data frames, and throw an error if it doesn't find any.