37-medulloblastoma_recount2_model.Rmd

---
title: "Medulloblastoma: applying MultiPLIER"
output:   
  html_notebook: 
    toc: true
    toc_float: true
---

**J. Taroni 2018**

In this notebook we'll prep medulloblastoma expression data (luckily the 
metadata is already in pretty good shape!) and apply MultiPLIER/recount2 model.

We're using data from the following publications:

> Northcott PA, Shih DJ, Peacock J, et al. [Subgroup-specific structural 
variation across 1,000 medulloblastoma 
genomes.](https://dx.doi.org/10.1038/nature11327) _Nature._ 
2012;488(7409):49-56. ([`GSE37382`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37382))

> Robinson G, Parker M, Kranenburg TA, Lu C et al. [Novel mutations target 
distinct subgroups of medulloblastoma.](https://dx.doi.org/10.1038/nature11213)
_Nature._ 2012 Aug 2;488(7409):43-8. ([`GSE37418`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37418))

It has been processed with `SCAN::SCANfast` through 
[refine.bio](https://www.refine.bio/).

## Set up

#### Libraries

```{r}
# we need this library to convert from Ensembl gene identifiers to gene symbols
# for use with PLIER
library(org.Hs.eg.db)
```

#### Functions

```{r}
# magrittr pipe
`%>%` <- dplyr::`%>%`
# need PrepExpressionDF
source(file.path("util", "test_LV_differences.R"))
# need GetNewDataB
source(file.path("util", "plier_util.R"))
```

Function for working with refine.bio-processed data (e.g., uses Ensembl
gene identifiers that must be converted to gene symbols).

```{r}
MBExpressionPrep <- function(exprs.df) {
  
  # mapIds default behavior -- select whatever it finds first in the case of
  # 1:many mappings
  gene.symbols.df <- 
    AnnotationDbi::mapIds(org.Hs.eg.db, keys = exprs.df$Gene, 
                          column = "SYMBOL", keytype = "ENSEMBL") %>%
    as.data.frame() %>%
    tibble::rownames_to_column("Ensembl")
  colnames(gene.symbols.df)[2] <- "Symbol"
  
  # tack on the gene symbols
  annot.exprs.df <- gene.symbols.df %>%
    dplyr::inner_join(exprs.df, by = c("Ensembl" = "Gene")) %>%
    dplyr::select(-Ensembl)
  colnames(annot.exprs.df)[1] <- "Gene"
  
  # if there are any duplicate gene symbols, use PrepExpressionDF
  if (any(duplicated(annot.exprs.df$Gene))) {
    agg.exprs.df <- PrepExpressionDF(exprs = annot.exprs.df) %>%
      dplyr::filter(!is.na(Gene))
  } else {
    return(annot.exprs.df %>% dplyr::filter(!is.na(Gene)))
  }
  
}
```

#### Directory setup

```{r}
# directory that holds all the gene expression matrices
exprs.dir <- file.path("data", "expression_data")
# directories specific to this notebook
plot.dir <- file.path("plots", "37")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "37")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```

## Read in data

### Expression data

GSE37382

```{r}
northcott.exprs.file <- file.path(exprs.dir, "GSE37382_SCAN.pcl")
northcott.exprs.df <- readr::read_tsv(northcott.exprs.file)
```

GSE37418

```{r}
robinson.exprs.file <- file.path(exprs.dir, "GSE37418.tsv")
robinson.exprs.df <- readr::read_tsv(robinson.exprs.file)
colnames(robinson.exprs.df)[1] <- "Gene"
```

### PLIER model

```{r}
recount.file <- file.path("data", "recount2_PLIER_data", 
                          "recount_PLIER_model.RDS")
recount.plier <- readRDS(recount.file)
```

## Gene identifier conversion and aggregation

We're going to use the default behavior of `mapIds` in
`MBExpressionPrep`, i.e., if there are 1:many mappings, just return the first.

### GSE37382 

Prepare the Northcott, et al. data

```{r}
northcott.prepped.df <- MBExpressionPrep(northcott.exprs.df)
```

```{r}
northcott.agg.file <- file.path(exprs.dir, "GSE37382_mean_agg.pcl")
readr::write_tsv(northcott.prepped.df, path = northcott.agg.file)
```

Remove the `data.frame` that are large and no longer necessary.

```{r}
rm(northcott.exprs.df)
```

### GSE37481

Prep the Robinson, et al. data

```{r}
robinson.prepped.df <- MBExpressionPrep(robinson.exprs.df)
```

```{r}
robinson.agg.file <- file.path(exprs.dir, "GSE37418_mean_agg.pcl")
readr::write_tsv(robinson.prepped.df, path = robinson.agg.file)
```

## MultiPLIER

Once the data have been prepped, apply MulitPLIER model. 
We'll use the short wrapper function below since we have to do this twice.
It is only intended to be used in this environment (e.g., `recount.plier` is
in the global environment) which is why we've placed it here.

```{r}
MBMultiPLIER <- function(agg.exprs.df, output.file) {
  
  # need a matrix where the gene symbols are row names
  exprs.mat <- agg.exprs.df %>%
    tibble::column_to_rownames("Gene") %>%
    as.matrix()
  
  # apply the MultPLIER model
  b.matrix <- GetNewDataB(exprs.mat = exprs.mat,
                          plier.model = recount.plier)
  
  # save to file!
  saveRDS(b.matrix, output.file)
}
```

Northcott, et al. data

```{r}
MBMultiPLIER(agg.exprs.df = northcott.prepped.df,
             output.file = file.path(results.dir, 
                                     "GSE37382_recount2_B.RDS"))
```

Robinson, et al. data

```{r}
MBMultiPLIER(agg.exprs.df = robinson.prepped.df,
             output.file = file.path(results.dir, 
                                     "GSE37418_recount2_B.RDS"))
```