hands-on.Rmd

---
title: "MSU - Week 3: Statistical models of Differential Expression"
date: "08/15/2015"
output: html_document
layout: topic
bibliography: references.bib
---

```{r, echo=FALSE}
library(knitcitations)
cleanbib()
options("citation_format" = "pandoc")
```

# Setting up

First you are going to need to [download the dataset](https://github.com/mistrm82/msu_ngs2015/raw/master/bottomly_eset.RData). By clicking on the link the `.RData` file should automatically donwload to your laptop, or you can right click to save. Keep note of where it gets saved.

Now, you can open up`RStudio` and open up a new project for this module (`File > New project > Create new project`). You can call this project `DE_analysis`, and in the bottom right panel you should see where you are in your filesystem. This is your *working* directory, move the `.RData` file over to your working directory.

Let's start by loading the filtered data and the required libraries:
```{r load}

# Install requested libraries from Bioconductor

# try http:// if https:// URLs are not supported

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
source("https://bioconductor.org/biocLite.R")
biocLite("limma")

# Load required libraries 
library(DESeq2)
library(edgeR)
library(limma)
library(Biobase)

load('bottomly_eset.RData')
```

Take a look at your environment, and you should find that three ExpressionSet objects have been loaded along with all the required libraries. _If you have errors with this, you may have to re-install the package. Please take a look at the ['setting up' markdown](https://github.com/mistrm82/msu_ngs2015/blob/master/setting-up.Rmd) for more details on how the data was prepared for this module._

The three datasets correspond to the full [Bottomly](http://www.ncbi.nlm.nih.gov/pubmed?term=21455293) dataset that is available on [ReCount](http://bowtie-bio.sourceforge.net/recount/), and two smaller datasets generated by randomly selecting samples from the full set. The full set contains data from two different mouse strains C57Bl and DBA, with 10 and 11 replicates, respectively. We created another dataset with 5 replicates for each strain (n = 10), and one with 2 replicates for each strain (n = 4). **How does the relationship between mean and variance change as you increase the number of replicates?**

```{r mean-variance, fig.align='center'}

# Plot 2 replicate dataset
eset <- bottomly.2reps
cpm.mat <- log(cpm(exprs(eset)))
mean.vec <- apply(cpm.mat, 1, mean)
sdvec <- apply(cpm.mat, 1, sd)
plot(mean.vec, sdvec, pch=".", main="2 replicates", ylab="sd", xlab="Average logCPM")

# Plot 5 replicate dataset
eset <- bottomly.5reps
cpm.mat <- log(cpm(exprs(eset)))
mean.vec <- apply(cpm.mat, 1, mean)
sdvec <- apply(cpm.mat, 1, sd)
plot(mean.vec, sdvec, pch=".", main="5 replicates", ylab="sd", xlab="Average logCPM")

# Plot 10 replicate dataset
eset <- bottomly.eset
cpm.mat <- log(cpm(exprs(eset)))
mean.vec <- apply(cpm.mat, 1, mean)
sdvec <- apply(cpm.mat, 1, sd)
plot(mean.vec, sdvec, pch=".", main="10 replicates", ylab="sd", xlab="Average logCPM")

```

# The mean-variance relationship (and what it means)

In RNA-Seq data, **genes with larger average expression have on average larger observed variances across samples**, that is, they vary in expression from sample to sample more than other genes with lower average expression. This phenomena of 'having different scatter' is known as data **heteroscedasticity**. In our figures, we observe this heteroscedasticity except that the **logarithmic transformation counteracts** this and overdoes the adjustment somewhat so that large log-counts now have smaller scatter than small log-counts. 

More, importantly we see the **scatter tends to reduce as we increase the number of replicates**. Standard deviations of averages are smaller than standard deviations of individual observations. So as you add more data (replicates), you get increasingly precise estimates of group means. With more precise estimates, we are more easily able to tell apart means which are close together.

# Statistical models for DE analysis
So how do we best model this type of data? The low levels of replication rule out, for all practical purposes, distribution-free rank or permutation-based methods. The Poisson model doesn't work either -- it is a single parameter model with mean == variance and real data has more variance than Poisson can explain (which we will look at in our dataset). The Negative Binomial (NB) model is a good approximation, where the variability between replicates is modeled by the **dispersion parameter** `r citep("10.1038/nprot.2013.099")`. 

[edgeR](http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) and [DESeq2](https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf) are two very popular tools which implement differential expression analysis on the basis of the NB model. They offer overlapping functionality but differ in the way dispersions are estimated. Accurate estimation of the dispersion parameter is critical for the statistical inference of differential expression.


## DESeq2
Both methods utilize the concept of “borrowing information from genes”; using the whole dataset to compute a common value, trend, or prior distribution for the dispersions, and then shrink individual gene-wise dispersion estimates toward this chosen anchor. With DESeq2 the first round of estimates are computed using maximum likelihood, and a curve is fit to provide an estimate of expected dispersion. Then the **gene-wise dispersion estimates are shrunken toward the values predicted by the curve** to obtain final dispersion values `r citep("10.1186/s13059-014-0550-8")`. 

Look at the dispersion-mean plots below, What do you see?

```{r deseq2, fig.align='center', warning=FALSE, message=FALSE}
# Create DESeq2 datasets
dds <- DESeqDataSetFromMatrix(countData = exprs(bottomly.eset), colData = pData(bottomly.eset), 
design = ~ strain )
dds <- DESeq(dds)

dds.5rep <- DESeqDataSetFromMatrix(countData = exprs(bottomly.5reps), colData = pData(bottomly.5reps), 
design = ~ strain )
dds.5rep <- DESeq(dds.5rep)

dds.2rep <- DESeqDataSetFromMatrix(countData = exprs(bottomly.2reps), colData = pData(bottomly.2reps), 
design = ~ strain )
dds.2rep <- DESeq(dds.2rep)


# Plot dispersion estimates
plotDispEsts(dds)
plotDispEsts(dds.5rep)
plotDispEsts(dds.2rep)
```

The final estimates for datasets with so few replicates result in values that are so far off from the initial estimates. This is because the strength of shrinkage depends (i) on an estimate of how close true dispersion values tend to be to the fit and (ii) on the degrees of freedom. **As the sample size decreases, the shrinkage increases in strength.** In the case of a 2 replicate dataset it seems that for many genes the **dispersion is being well overestimated** which may lower the rate of true detection.

## edgeR
In edgeR, a common dispersion is first estimated for all the tags and then an empirical Bayes strategy is applied for **squeezing the tagwise dispersions towards the common or trended dispersion**. The 'Trended' and 'Common' dispersions are partly for diagnostic purposes and partly to give a general idea of the magnitude of the dispersions before it estimates the tagwise values (however, neither are used explicitly). **The amount of shrinkage is determined by** the prior weight given to the common dispersion (or the dispersion trend) and the precision of the tagwise estimates, and can be considered as the prior degrees of freedom `r citep("10.1093/bioinformatics/btp616")`.

Biological CV (BCV) is the coefficient of variation with which the (unknown) true abundance of the gene varies between replicate RNA samples. Dispersion estimates are obtained by squaring the BCV values. Look at the BCV-mean expression plots below, What do you see?


```{r edgeR, fig.align='center'}

dge <- DGEList(counts=exprs(bottomly.eset), group=pData(bottomly.eset)$strain)
# Normalize by total count
dge <- calcNormFactors(dge)

# Create the contrast matrix
design.mat <- model.matrix(~ 0 + dge$samples$group)
colnames(design.mat) <- levels(dge$samples$group)

# Estimate dispersion parameter for GLM
dge <- estimateGLMCommonDisp(dge, design.mat)
dge <- estimateGLMTrendedDisp(dge, design.mat, method="power")
dge<- estimateGLMTagwiseDisp(dge,design.mat)

# Do it all over again for 5 replicates
dge.5reps <- DGEList(counts=exprs(bottomly.5reps), group=pData(bottomly.5reps)$strain)
dge.5reps <- calcNormFactors(dge.5reps)
design.mat <- model.matrix(~ 0 + dge.5reps$samples$group)
colnames(design.mat) <- levels(dge.5reps$samples$group)

dge.5reps <- estimateGLMCommonDisp(dge.5reps, design.mat)
dge.5reps <- estimateGLMTrendedDisp(dge.5reps, design.mat, method="power")
dge.5reps<- estimateGLMTagwiseDisp(dge.5reps,design.mat)

# Do it all over again for 2 replicates
dge.2reps <- DGEList(counts=exprs(bottomly.2reps), group=pData(bottomly.2reps)$strain)
dge.2reps <- calcNormFactors(dge.2reps)
design.mat <- model.matrix(~ 0 + dge.2reps$samples$group)
colnames(design.mat) <- levels(dge.2reps$samples$group)

dge.2reps <- estimateGLMCommonDisp(dge.2reps, design.mat)
dge.2reps <- estimateGLMTrendedDisp(dge.2reps, design.mat, method="power")
dge.2reps<- estimateGLMTagwiseDisp(dge.2reps,design.mat)

# Plot mean-variance
plotBCV(dge)
plotBCV(dge.5reps)
plotBCV(dge.2reps)
```

The BCV value can be interpreted as the percentage of variation observed in the true abundance of a gene between replicates. For example, with the **full dataset the common BCV is 0.19 which indicates that the true abundance for a gene can vary by 19% up or down between replicates**. We find this number increases as the number of biological replicates decreases (0.20, 0.22 for five and two replicates, respectively). More importantly, the scatter in the two replicate dataset for tag-wise estimates is large for low expressor genes and BCV values range quite high (~0.8).


## limma-voom
Negative Binomial methods treat the dispersion estimates as if they were known parameters without accounting for uncertainty of estimation, which can make the testing overly liberal in type I error rate control (increased false positives). Empirical Bayes methods based on the NB don’t have the ability to adapt to different types of data with high or low dispersion parameter heterogeneity -- and can’t get the same robustness with small sample size datasets as you can with corresponding methods with microarrays `r citep("10.1186/gb-2014-15-2-r29")`.

Enter [limma-voom](http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf).

The voom (variance modeling at the observational level) transformation is applied to the read counts. **It uses the variance of genes to create weights for use in linear models.**

1. Gene-wise linear models are fitted to the normalized log-cpm values (accounting for all relevant factors); this generates a residual standard deviation for each gene 
2. A robust trend is then fitted to the residual standard deviations as a function of the average log-count for each gene
3. The fitted log-cpm (from the linear model fit) for each observation is converted into a predicted count. Using this predicted count the standard deviation is interpolated from the trend.
4. The inverse squared predicted standard deviation for each observation becomes the weight for that observation.

After this, the RNA-seq data can be analyzed as if it was microarray data. This means for example that any of the linear modelling or gene set testing methods in the limma package can be applied to RNA-seq data. It is usually best to work with voom when you have really noisy data and are dealing with a number of factors. _Note for the design matrix we include an intercept, since we will be using linear models as opposed to GLMs._

```{r, fig.align='center'}

# Create design matrix
design <- model.matrix(~ pData(bottomly.eset)$strain)
  
# Apply voom transformation
nf <- calcNormFactors(bottomly.eset)
v <- voom(exprs(bottomly.eset), design, lib.size=colSums(exprs(bottomly.eset))*nf, 
normalize.method="quantile", plot=TRUE)

# Do same for 5 replicate dataset
design <- model.matrix(~ pData(bottomly.5reps)$strain)
nf <- calcNormFactors(bottomly.5reps)
v.5reps <- voom(exprs(bottomly.5reps), design, lib.size=colSums(exprs(bottomly.5reps))*nf, 
                normalize.method="quantile", plot=TRUE)

# Do same for 2 replicates dataset
design <- model.matrix(~ pData(bottomly.2reps)$strain)
nf <- calcNormFactors(bottomly.2reps)
v.2reps <- voom(exprs(bottomly.2reps), design, lib.size=colSums(exprs(bottomly.2reps))*nf, 
                normalize.method="quantile", plot=TRUE)

```

## DEG comparisons with full dataset
The next step is to identify a list of differentially expressed genes using each of the three tools. We will set a fairly liberal threshold of FDR < 0.05 to select genes as 'siginificantly' differentially expressed between the two mouse strains. 

```{r}
p.threshold <- 0.05
```


### edgeR 
Let's start with edgeR. The DGE object we previously created contains the i) original count data, ii) the sample group information, iii) normalization factors for each sample, and iv) the dispersion estimates. The next step is to take all the available for information for each gene and fit the expression data to the model. We can then make contrasts to specify the comparison we wish to make (in our case this is simple, since we have only two groups) and extract the statistics for our results. **How many genes are differentially expressed between C57BL/6J and DBA/2J?** 

```{r}
## edgeR ##
# Design matrix
design.mat <- model.matrix(~ 0 + dge$samples$group)
colnames(design.mat) <- c("C57BL", "DBA")

# Model fitting
fit.edgeR <- glmFit(dge, design.mat)

# Differential expression
contrasts.edgeR <- makeContrasts(C57BL - DBA, levels=design.mat)
lrt.edgeR <- glmLRT(fit.edgeR, contrast=contrasts.edgeR)

# Access results tables
edgeR_results <- lrt.edgeR$table
sig.edgeR <- decideTestsDGE(lrt.edgeR, adjust.method="BH", p.value = p.threshold)
genes.edgeR <- row.names(edgeR_results)[which(sig.edgeR != 0)]

```

### DESeq2
Similar to edgeR, with DESeq2, we have a DESeq object that was previously created when we ran the `DESeq()` function. The difference is that DESeq is less verbose, requiring fewer lines of code to generate the same information. In one fell swoop, the DESeq functions generates all the required statistics from normalization to coefficients from the model fit. All we are required to do is specify the comparisons we wish to make and extract the relevant data.  **How many genes are differentially expressed between C57BL/6J and DBA/2J?** 

```{r}

## DESeq2 ##
contrast.deseq2 <- list("strainC57BL.6J", "strainDBA.2J")
deseq2_results <- results(dds, contrast=contrast.deseq2)
deseq2_results$threshold <- as.logical(deseq2_results$padj < p.threshold)
genes.deseq <- row.names(deseq2_results)[which(deseq2_results$threshold)]
```


### limma-voom
For limma-voom we are taking our voom transformations and using that matrix as input to normal-based limma functions that we would use on microarray data. **Plot a histogram of your transformed data and you should see that it is normally distributed.** The steps that follow are: i) fitting the linear model given the factor of interest and ii) compute moderated statistics based on empirical Bayes moderation of the standard errors towards a common value.

```{r}

## voom-limma ##
# Create design matrix
design <- model.matrix(~ pData(bottomly.eset)$strain)

# Usual limma pipeline
fit.voom <- lmFit(v, design)
fit.voom <- eBayes(fit.voom)

voom_results <- topTable(fit.voom, coef=2,  adjust="BH", number = nrow(exprs(bottomly.eset)))
voom_results$threshold <- as.logical(voom_results$adj.P.Val < p.threshold)
genes.voom <- row.names(voom_results)[which(voom_results$threshold)]
```

### Overlapping genes

We will use a venn diagram to depict the overlaps. To do this you will be required to install the package `gplots`. 

```{r, eval=FALSE}
install.packages('gplots')
```

Once installed, load the library and use the following code to plot the Venn Diagram of overlapping genes between the three different tools. _Note: there are fancier packages to create this figure!_

```{r}
library(gplots)
venn(list(edgeR = genes.edgeR, DESeq2 = genes.deseq, voom = genes.voom))
```


## DEG comparisons with 2 replicates
Ok now, let's try this with our 2 replicate data, and see how results compare from the three different tools.

### edgeR

```{r}
## edgeR ##
# Design matrix
design.mat <- model.matrix(~ 0 + dge.2reps$samples$group)
colnames(design.mat) <- c("C57BL", "DBA")

# Model fitting
fit.edgeR <- glmFit(dge.2reps, design.mat)

# Differential expression
contrasts.edgeR <- makeContrasts(C57BL - DBA, levels=design.mat)
lrt.edgeR <- glmLRT(fit.edgeR, contrast=contrasts.edgeR)

# Access results tables
edgeR_results_2reps <- lrt.edgeR$table
sig.edgeR.2reps <- decideTestsDGE(lrt.edgeR, adjust.method="BH", p.value = p.threshold)
genes.edgeR.2reps <- row.names(edgeR_results_2reps)[which(sig.edgeR.2reps != 0)]
```

### DESeq2

```{r}

## DESeq2 ##
contrast.deseq2 <- list("strainC57BL.6J", "strainDBA.2J")
deseq2_results_2reps <- results(dds.2rep, contrast=contrast.deseq2)
deseq2_results_2reps$threshold <- as.logical(deseq2_results_2reps$padj < p.threshold)
genes.deseq.2reps <- row.names(deseq2_results_2reps)[which(deseq2_results_2reps$threshold)]
```

### limma-voom
```{r}

## voom-limma ##
# Create design matrix
design <- model.matrix(~ pData(bottomly.2reps)$strain)

# Usual limma pipeline
fit.voom <- lmFit(v.2reps, design)
fit.voom <- eBayes(fit.voom)

voom_results_2reps <- topTable(fit.voom, coef=2,  adjust="BH", number = nrow(exprs(bottomly.2reps)))
voom_results_2reps$threshold <- as.logical(voom_results_2reps$adj.P.Val < p.threshold)
genes.voom.2reps <- row.names(voom_results_2reps)[which(voom_results_2reps$threshold)]

```

### Overlapping genes

How do the number of genes identified using the 2-replicate dataset compared to the results from the full dataset? How does the overlap compare?

```{r, fig.align='center'}
length(genes.deseq.2reps)
length(genes.edgeR.2reps)
length(genes.voom.2reps)

venn(list(edgeR = genes.edgeR.2reps, DESeq2 = genes.deseq.2reps, voom = genes.voom.2reps))
```

You should find that with the 2-replicate dataset, in general each tool identifies a much smaller set of significant genes. Also, not a surprise that our overlap is much smaller since we have fewer genes to begin with (on the range of ~15-30% compared to ~50-80%). 

## How many replicates are sufficient?
In a nicely designed [48-replicate experiment](http://arxiv.org/pdf/1505.00588.pdf) (S. cerevisae; wt and _snf2_ knock-out mutant), researchers sought to answer this question  examined the same three tools used here to see which best represented read-count distribution. When the authors removed 6-8 bad replicates from their pool of 48 samples, their data became consistent with a negative binomial distribution. Assuming experimental variability similar to the authors, this indicates at least 6 replicates in a DGE experiment is good practice. Their findings also favour the approach implemented in edgeR, where variance for one gene is squeezed towards a common dispersion calculated across all genes. *REVISION: it turns out the authors found a bug in the DESeq2 code and both tools perform comparably. This paper revision is underway* 

## Take-Home message
The underlying **read-count distribution for a gene is a fundamental property of RNA-seq data** but without a large number of measurements/replicates it is not possible to identify the form of this distribution unambiguously. **Fewer replicates means the true distribution of read counts for an individual gene is unclear.** Many DGE tools make strong assumptions about the form of this underlying distribution, thus having an unreliable distribution can impact on their ability to correctly identify significantly DE genes.  

# References

```{r writebib, echo=FALSE, message=FALSE, warning=FALSE}
write.bibtex(file="references.bib")
```