-
Notifications
You must be signed in to change notification settings - Fork 11
/
34-DIPG_data_cleaning.Rmd
280 lines (221 loc) · 8.71 KB
/
34-DIPG_data_cleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "DIPG: data cleaning"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
In this notebook, we'll clean diffuse intrinsic pontine glioma (DIPG) data.
There is no DIPG data in recount2, so this is another use case for this
project.
We'll be working with two datasets stored in the
[`greenelab/rheum-plier-data`](https://github.com/greenelab/rheum-plier-data)
repository:
* [`E-GEOD-26576`](https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-26576/)
* [`GSE50021`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50021)
**Citations:**
> Paugh BS, Broniscer A, Qu C, et al. [Genome-wide analyses identify recurrent
amplifications of receptor tyrosine kinases and cell-cycle regulatory genes in
diffuse intrinsic pontine glioma.](https://dx.doi.org/10.1200/JCO.2011.35.5677)
_J Clin Oncol._ 2011;29(30):3999-4006.
> Buczkowicz P, Hoeman C, Rakopoulos P, et al. [Genomic analysis of diffuse
intrinsic pontine gliomas identifies three molecular subgroups and recurrent
activating _ACVR1_ mutations.](https://dx.doi.org/10.1038/ng.2936) _Nat Genet._
2014;46(5):451-6.
## Set up
```{r}
# magrittr pipe
`%>%` <- dplyr::`%>%`
# we need the function that aggregates duplicate gene identifiers to the
# mean value
source(file.path("util", "test_LV_differences.R"))
```
#### Directory setup
```{r}
# directory that holds the gene expression files
exprs.dir <- file.path("data", "expression_data")
# directory that holds the sample metadata
sample.info.dir <- file.path("data", "sample_info")
```
## Read in and clean data
### GSE50021
We have the series matrix for `GSE50021`, which contains both the expression
values and the metadata
```{r}
series.mat.file <- file.path(exprs.dir, "GSE50021_series_matrix.txt")
# expression matrix -- everything but the comment lines that begin with !
ma.data <-
readr::read_delim(series.mat.file,
delim = "\t",
comment = "!",
col_names = TRUE,
skip = 1)
```
#### Gene identifier conversion
```{r}
# The GPL information from GEO, which was made public on Jul 18, 2011 and
# last updated on Jan 18, 2013
gpl.info.df <- readr::read_tsv(file.path(exprs.dir, "GPL13938-11302.txt"),
comment = "#")
```
```{r}
annot.gse50021.df <- gpl.info.df %>%
# from the GEO information, grab just the probe identifier and the gene
# symbol columns
dplyr::select(c(ID, Symbol)) %>%
# only ILMN IDs (probes) in both
dplyr::inner_join(ma.data, by = c("ID" = "ID_REF")) %>%
# collapsing duplicate symbols later will require the symbols to be in the
# first column, called "Gene", with no additional columns
dplyr::mutate(Gene = Symbol) %>%
dplyr::select(-ID, -Symbol) %>%
dplyr::select(Gene, dplyr::everything())
```
Collapse duplicate symbols and write to file.
```{r}
# summarize to mean
annot.mean.df <- PrepExpressionDF(annot.gse50021.df)
readr::write_tsv(annot.mean.df, file.path(exprs.dir, "GSE50021_mean_agg.pcl"))
```
#### Sample metadata
As mentioned above, metadata is also extracted from the series matrix file.
We do this for a single line at a time that we've picked based on the relevance
to any downstream analysis we might do (contact information, for example,
does not help us in this context).
We'll write a custom function specifically for this context and environment
```{r}
# given a line number of the series matrix file (series.mat.file),
# get the values
GetSampleAttributes <- function(skip.value) {
conn <- file(series.mat.file)
open(conn)
sample.attributes <- read.table(conn, skip = skip.value, nrow = 1)
close(conn)
return(sample.attributes)
}
# sample accession e.g., GSMXXXXX
sample.accession <- GetSampleAttributes(skip.value = 79)
# source name
source.name <- GetSampleAttributes(skip.value = 85)
# tissue
tissue <- GetSampleAttributes(skip.value = 88)
# gender
gender <- GetSampleAttributes(skip.value = 89)
# age at diagnosis
age.dx <- GetSampleAttributes(skip.value = 90)
# overall survival
survival <- GetSampleAttributes(skip.value = 91)
# get those lines into data.frame format
smpl.info.df <- as.data.frame(t(dplyr::bind_rows(sample.accession,
source.name,
tissue,
gender,
age.dx,
survival))[-1, ])
colnames(smpl.info.df) <- c("sample_accession", "source_name", "tissue",
"gender", "age_at_diagnosis_yrs",
"overall_survival_yrs")
# strip extraneous strings
smpl.info.df <- smpl.info.df %>%
dplyr::mutate(tissue = gsub("cell type: ", "", tissue),
gender = gsub("gender: ", "", gender),
age_at_diagnosis_yrs = gsub("age at dx \\(years\\): ", "",
age_at_diagnosis_yrs),
overall_survival_yrs = gsub("os \\(years\\): ", "",
overall_survival_yrs))
# change "N/A" to NA
smpl.info.df[which(smpl.info.df == "N/A", arr.ind = TRUE)] <- NA
```
Write the cleaned metadata to a TSV file.
```{r}
readr::write_tsv(smpl.info.df,
file.path(sample.info.dir, "GSE50021_cleaned_metadata.tsv"))
```
Clean up the workspace a bit before working with the next dataset.
```{r}
to.keep <- c("%>%", "exprs.dir", "sample.info.dir", "PrepExpressionDF")
rm(list = setdiff(ls(), to.keep))
```
### E-GEOD-26576
We used an Entrez ID BrainArray package to process this data set and we need
gene symbols to work with PLIER.
```{r}
# SCANfast processed PCL
gse26576.file <- file.path(exprs.dir,
"DIPG_E-GEOD-26576_hgu133plus2_SCANfast.pcl")
# read in the PCL file, remove the trailing "_at" added by Brainarray (these
# are Entrez gene identifiers), drop the Entrez IDs with _at appended, reordered
# such that the gene identifiers are the first column
gse26576.df <- readr::read_tsv(gse26576.file) %>%
dplyr::mutate(EntrezID = sub("_at", "", X1)) %>%
dplyr::select(-X1) %>%
dplyr::select(EntrezID, dplyr::everything())
```
#### Gene identifier conversion
```{r}
# extract the Entrez ID Gene symbol mapping from org.Hs.eg.db
symbol.obj <- org.Hs.eg.db::org.Hs.egSYMBOL
mapped.genes <- AnnotationDbi::mappedkeys(symbol.obj)
symbol.list <- as.list(symbol.obj[mapped.genes])
symbol.df <- as.data.frame(cbind(names(symbol.list), unlist(symbol.list)))
colnames(symbol.df) <- c("EntrezID", "GeneSymbol")
```
Join the annotation `data.frame` to the expression `data.frame`
```{r}
annot.gse26576.df <- symbol.df %>%
dplyr::inner_join(gse26576.df, by = "EntrezID")
rm(symbol.df)
```
Are there any duplicates?
```{r}
any(duplicated(annot.gse26576.df$GeneSymbol))
```
No, so we don't need to do anything else.
Write the `data.frame` that includes gene symbols to file.
```{r}
gse26576.output.file <-
file.path(exprs.dir,
"DIPG_E-GEOD-26576_hgu133plus2_SCANfast_with_GeneSymbol.pcl")
readr::write_tsv(annot.gse26576.df, path = gse26576.output.file)
```
#### Sample Metadata
```{r}
meta.gse26576.file <- file.path(sample.info.dir, "E-GEOD-26576.sdrf.txt")
meta.gse26576.df <- readr::read_tsv(meta.gse26576.file)
```
The sample-data relationship files from ArrayExpress (specifically, their
column names) are pretty tidyverse-unfriendly.
```{r}
cleaned.meta.df <- data.frame(
sample_id = gsub(" 1", "", meta.gse26576.df$`Source Name`),
sample_file = meta.gse26576.df$`Array Data File`,
sample_title = meta.gse26576.df$`Comment [Sample_title]`,
age_at_diagnosis = meta.gse26576.df$`Characteristics[age at diagnosis (years)]`,
disease_state = meta.gse26576.df$`Characteristics[disease]`,
histology = meta.gse26576.df$`Characteristics[histology]`,
sample_collection = meta.gse26576.df$`Characteristics[sample collection]`,
material_type = meta.gse26576.df$`Material Type`
) %>%
# get rid of the genomic DNA samples
dplyr::filter(material_type == "total RNA") %>%
dplyr::select(-material_type)
```
There is information in the `sample_title` field that can help us fill in
the `disease_state` blanks.
```{r}
cleaned.meta.df <- cleaned.meta.df %>%
dplyr::mutate(disease_state = dplyr::case_when(
grepl("normal", cleaned.meta.df$sample_title) ~ "normal",
grepl("low", cleaned.meta.df$sample_title) ~ "LGG",
grepl("DIPG", cleaned.meta.df$sample_title) ~ "DIPG",
grepl("Glioblastoma", cleaned.meta.df$sample_title) ~ "Glioblastoma"
))
```
The `sample_file` field will match the headers of the PCL file.
Write the cleaned metadata to file.
```{r}
readr::write_tsv(cleaned.meta.df,
path = file.path(sample.info.dir,
"E-GEOD-26576_cleaned_metadata.tsv"))
```