seurat_scRNAseq.Rmd

---
title: "single cell RNA-seq workflow"
output: rmarkdown::github_document
---

After running through the 10X cellranger to count and aggregate the data. Create the Seurat Object by loading the aggr file that contains the the matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv files.

Below is an example of single cell RNA-seq data obtained from [Ortmann et. al, 2020](https://www.sciencedirect.com/science/article/pii/S193459092030357X). The study has 4 replicates for each time-point but one sample from each time point (shown below) was taken to run through cellranger to generate the aggr matrix file which was used in Seurat.

| File_1 | File_2  | Treatment | Timepoint |
|  :---:        |     :---:      |     :---:      |     :---:      |
| SIGAF1_S4_L001_R1_001.fastq.gz   | SIGAF1_S4_L001_R2_001.fastq.gz   | 2iLif     | Day 0    |
| SIGAG1_S6_L001_R1_001.fastq.gz   | SIGAG1_S6_L001_R2_001.fastq.gz   | N2B27     | Day 1    |
| SIGAG9_S12_L001_R1_001.fastq.gz   | SIGAG9_S12_L001_R2_001.fastq.gz   | N2B27     | Day 2    |

```{r}
library(dplyr)
library(Seurat)
library(patchwork)
library(ggplot2)
library(sctransform)

data <- Read10X(data.dir = "C:/Users/willd/OneDrive/Desktop/scRNA_seq/experiment/outs/count/filtered_feature_bc_matrix/")

mesc <- CreateSeuratObject(counts = data, project = "mESC_diff", min.cells = 3, min.features = 200)

```

## Preprocessing workflow
The below steps will analyze for quality, filter out low-quality cells, empty droplets and doublets then normalize.

The QC metrics are stored in the seurat object. We will add the percentage of reads that map to the mitochondrial genome to a column in the meta.data.
```{r}
mesc[["percent.mt"]] <- PercentageFeatureSet(mesc, pattern = "^mt-")

#Show QC metrics for the first 5 cells
head(mesc@meta.data, 5)
```

Observe the QC metrics to filter the cells
```{r}
# Visualize QC metrics as a violin plot
VlnPlot(mesc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
```

We can plot the relationship of read counts (nCount_RNA) vs the number of genes and percent of mitochondrial reads.
Cells with a high percent of mitochondiral reads should be filtered out as they represent cells going through apoptosis or lysing.
```{r fig.height=8, fig.width=12}
# We can observe the relationship of the features and read counts
plot1 <- FeatureScatter(mesc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2 <- FeatureScatter(mesc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot1 + plot2
```

From the above QC plots we will:
- filter out cells that have unique feature counts over 3,500 or less than 200
- filter out cells with > 5% of mitochondrial counts. 


```{r}
# keep cells with features between 200-3500.
mesc <- subset(mesc, subset = nFeature_RNA > 200 & nFeature_RNA < 3500 & percent.mt < 5)
```

### Normalizing the data
The latest normalization implemented by `SCTransform()`, pools the information across genes with similar abundances and perform a regularized negative binomial regression. This preserves the biological heterogeneity. 

The `SCTransform()` also can remove confounding sources of variation such as mitochondrial mapping percentage. It will also return the highly variable genes used for PCA and UMAP analysis.
```{r}
mesc <- SCTransform(mesc, vars.to.regress = "percent.mt")
```


## Perform linear dimensional reduction
PCA is performed on the scaled data.

```{r}
mesc <- RunPCA(mesc, features = VariableFeatures(object = mesc))
```

```{r}
mesc <- FindNeighbors(mesc, dims = 1:30)
mesc <- FindClusters(mesc, resolution = 0.5)
```

## Perform non-linear dimensional reduction (UMAP/tSNE)

The best way to confirm the clustering you observe in a UMAP/tSNE is to run multiple clustering tools and observe similar clustering and gene features within those clusters. 

```{r echo=FALSE}
# load reticulate to interface with python and make sure you download umap-learn into the python environment. If you haven't downloaded it run the below conda_install line. You need this to run the UMAP function.

library(reticulate)
use_virtualenv("bioinfo")
#conda_install(envname='bioinfo', packages='umap-learn')
```

```{r message=FALSE}
mesc <- RunUMAP(mesc, dims = 1:30, umap.method = 'umap-learn', metric = 'correlation' )
```

```{r echo=TRUE}
DimPlot(mesc, reduction = "umap", label = TRUE)
```

You can access the corrected UMI counts from `mesc[["SCT"]]@counts`. The log-normalized counts are in `mesc[["SCT"]]@data`.


## Finding differentially expressed genes (cluster biomarkers)

We can find markers within a cluster that are different (highly variable) compared to the other clusters. 
I have chosen to use  MAST (Model-based Analysis of Single-cell Transcriptomics) which has been shown in [Soneson & Robinson, 2018](https://www.nature.com/articles/nmeth.4612) to be one of the methods that performed well for DE in scRNA-seq.

Below is a table of differentially expressed genes in cluster 4 compared to all other cells.

```{r echo=FALSE, message=FALSE}
# find all markers of cluster 1, this compares the cells in cluster 1 versus all other cells.
cluster1.markers <- FindMarkers(mesc, ident.1 = 1, min.pct = 0.25, test.use = "MAST")
cluster0.markers <- FindMarkers(mesc, ident.1 = 0, min.pct = 0.25, test.use = "MAST")
cluster2.markers <- FindMarkers(mesc, ident.1 = 2, min.pct = 0.25, test.use = "MAST")
cluster3.markers <- FindMarkers(mesc, ident.1 = 3, min.pct = 0.25, test.use = "MAST")
cluster4.markers <- FindMarkers(mesc, ident.1 = 4, min.pct = 0.25, test.use = "MAST")
cluster5.markers <- FindMarkers(mesc, ident.1 = 5, min.pct = 0.25, test.use = "MAST")
cluster6.markers <- FindMarkers(mesc, ident.1 = 6, min.pct = 0.25, test.use = "MAST")
cluster7.markers <- FindMarkers(mesc, ident.1 = 7, min.pct = 0.25, test.use = "MAST")

head(arrange(cluster4.markers, desc(avg_logFC)), n=15)
```

To compare only between two clusters, cluster 4 and cluster 9:
```{r echo=FALSE}
# compare between cluster 4 and 9
cluster4_9.markers <- FindMarkers(mesc, ident.1 = 4, ident.2 = 9, min.pct = 0.25)
head(arrange(cluster4_9.markers, desc(avg_logFC)), n=20)
```

## Visualization 
You can observe specific genes within the clusters. If certain cell markers are known, such as pluripotency, then you can view these genes and see how they can be found across the different clusters. This can be used to identify the clusters as certain cell types.
```{r fig.height=8, fig.width=8}
VlnPlot(mesc, features = c("Nanog", "Klf4", "Tbx3", "Sox1", "Dnmt3b", "Otx2"), 
    pt.size = 0.2, ncol = 3)
```


```{r fig.height=8, fig.width=8}
# Visualize canonical marker genes on the sctransform embedding.
FeaturePlot(mesc, features = c("Nanog", "Klf4", "Tbx3", "Dnmt3b"), pt.size = 0.2, 
    ncol =2)
```

Plot a heatmap of the top 20 markers for each cluster. This helps to compare the gene expression over multiple clusters at once.

```{r fig.height=20, fig.width=8, message=FALSE, warning=FALSE}
# find markers for every cluster compared to all remaining cells, report only the positive ones
mesc.markers <- FindAllMarkers(mesc, test.use = "MAST")
top20 <- mesc.markers %>% group_by(cluster) %>% top_n(n = 20, wt = avg_logFC)
DoHeatmap(mesc, features = top20$gene) #+ NoLegend()
```

### Assigning cell type identiy to the clusters

Based on canonical markers or differentially expressed markers identified from above we can assign IDs to the clusters

```{r}
new.cluster.ids <- c("0","1", "2", "Early Epiblast-like", "mESC", "5", "Early Epiblast-like", "mESC", "8", "Epiblast-like", "10", "11", "12")
names(new.cluster.ids) <- levels(mesc)
mesc <- RenameIdents(mesc, new.cluster.ids)
DimPlot(mesc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
```