-
Notifications
You must be signed in to change notification settings - Fork 11
/
37-medulloblastoma_recount2_model.Rmd
195 lines (146 loc) · 5.11 KB
/
37-medulloblastoma_recount2_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Medulloblastoma: applying MultiPLIER"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
In this notebook we'll prep medulloblastoma expression data (luckily the
metadata is already in pretty good shape!) and apply MultiPLIER/recount2 model.
We're using data from the following publications:
> Northcott PA, Shih DJ, Peacock J, et al. [Subgroup-specific structural
variation across 1,000 medulloblastoma
genomes.](https://dx.doi.org/10.1038/nature11327) _Nature._
2012;488(7409):49-56. ([`GSE37382`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37382))
> Robinson G, Parker M, Kranenburg TA, Lu C et al. [Novel mutations target
distinct subgroups of medulloblastoma.](https://dx.doi.org/10.1038/nature11213)
_Nature._ 2012 Aug 2;488(7409):43-8. ([`GSE37418`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37418))
It has been processed with `SCAN::SCANfast` through
[refine.bio](https://www.refine.bio/).
## Set up
#### Libraries
```{r}
# we need this library to convert from Ensembl gene identifiers to gene symbols
# for use with PLIER
library(org.Hs.eg.db)
```
#### Functions
```{r}
# magrittr pipe
`%>%` <- dplyr::`%>%`
# need PrepExpressionDF
source(file.path("util", "test_LV_differences.R"))
# need GetNewDataB
source(file.path("util", "plier_util.R"))
```
Function for working with refine.bio-processed data (e.g., uses Ensembl
gene identifiers that must be converted to gene symbols).
```{r}
MBExpressionPrep <- function(exprs.df) {
# mapIds default behavior -- select whatever it finds first in the case of
# 1:many mappings
gene.symbols.df <-
AnnotationDbi::mapIds(org.Hs.eg.db, keys = exprs.df$Gene,
column = "SYMBOL", keytype = "ENSEMBL") %>%
as.data.frame() %>%
tibble::rownames_to_column("Ensembl")
colnames(gene.symbols.df)[2] <- "Symbol"
# tack on the gene symbols
annot.exprs.df <- gene.symbols.df %>%
dplyr::inner_join(exprs.df, by = c("Ensembl" = "Gene")) %>%
dplyr::select(-Ensembl)
colnames(annot.exprs.df)[1] <- "Gene"
# if there are any duplicate gene symbols, use PrepExpressionDF
if (any(duplicated(annot.exprs.df$Gene))) {
agg.exprs.df <- PrepExpressionDF(exprs = annot.exprs.df) %>%
dplyr::filter(!is.na(Gene))
} else {
return(annot.exprs.df %>% dplyr::filter(!is.na(Gene)))
}
}
```
#### Directory setup
```{r}
# directory that holds all the gene expression matrices
exprs.dir <- file.path("data", "expression_data")
# directories specific to this notebook
plot.dir <- file.path("plots", "37")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "37")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```
## Read in data
### Expression data
GSE37382
```{r}
northcott.exprs.file <- file.path(exprs.dir, "GSE37382_SCAN.pcl")
northcott.exprs.df <- readr::read_tsv(northcott.exprs.file)
```
GSE37418
```{r}
robinson.exprs.file <- file.path(exprs.dir, "GSE37418.tsv")
robinson.exprs.df <- readr::read_tsv(robinson.exprs.file)
colnames(robinson.exprs.df)[1] <- "Gene"
```
### PLIER model
```{r}
recount.file <- file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS")
recount.plier <- readRDS(recount.file)
```
## Gene identifier conversion and aggregation
We're going to use the default behavior of `mapIds` in
`MBExpressionPrep`, i.e., if there are 1:many mappings, just return the first.
### GSE37382
Prepare the Northcott, et al. data
```{r}
northcott.prepped.df <- MBExpressionPrep(northcott.exprs.df)
```
```{r}
northcott.agg.file <- file.path(exprs.dir, "GSE37382_mean_agg.pcl")
readr::write_tsv(northcott.prepped.df, path = northcott.agg.file)
```
Remove the `data.frame` that are large and no longer necessary.
```{r}
rm(northcott.exprs.df)
```
### GSE37481
Prep the Robinson, et al. data
```{r}
robinson.prepped.df <- MBExpressionPrep(robinson.exprs.df)
```
```{r}
robinson.agg.file <- file.path(exprs.dir, "GSE37418_mean_agg.pcl")
readr::write_tsv(robinson.prepped.df, path = robinson.agg.file)
```
## MultiPLIER
Once the data have been prepped, apply MulitPLIER model.
We'll use the short wrapper function below since we have to do this twice.
It is only intended to be used in this environment (e.g., `recount.plier` is
in the global environment) which is why we've placed it here.
```{r}
MBMultiPLIER <- function(agg.exprs.df, output.file) {
# need a matrix where the gene symbols are row names
exprs.mat <- agg.exprs.df %>%
tibble::column_to_rownames("Gene") %>%
as.matrix()
# apply the MultPLIER model
b.matrix <- GetNewDataB(exprs.mat = exprs.mat,
plier.model = recount.plier)
# save to file!
saveRDS(b.matrix, output.file)
}
```
Northcott, et al. data
```{r}
MBMultiPLIER(agg.exprs.df = northcott.prepped.df,
output.file = file.path(results.dir,
"GSE37382_recount2_B.RDS"))
```
Robinson, et al. data
```{r}
MBMultiPLIER(agg.exprs.df = robinson.prepped.df,
output.file = file.path(results.dir,
"GSE37418_recount2_B.RDS"))
```