-
Notifications
You must be signed in to change notification settings - Fork 1
/
workshop.Rmd
292 lines (244 loc) · 11.2 KB
/
workshop.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "Workshop"
author: I. Moustakas
output:
html_document:
keep_md: true
smart: false
toc: true
toc_float: true
theme: united
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---
***
```{r setup, include=FALSE}
knitr::opts_chunk$set(
cache.lazy = FALSE,
tidy = TRUE
)
```
# Setup
## Load the necessary libraries
```{r}
suppressMessages(library(Seurat))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
```
## Load the dataset
By just running the loaded object, we get some basic information about it: it has 57490 features (genes) across 838 samples (cells)
* Note you need to provide the location of `tyser.rds` in your own machine
```{r}
object <- readRDS("/home/imoustakas/Documents/anatomy_workshop/tyser.rds")
object
```
# Analysis tasks
## Explore the object meta data
The [Seurat object](https://rdrr.io/cran/SeuratObject/man/Seurat-class.html) is a representation of single-cell expression data for R. It stores the raw and normalized count table, low dimensional representations of the cells (PCA, UMAP, tSNE), some cell meta data and more. For this workshop we will explore the meta data, which is the most useful
* The cell metadata is basically a table. What do rows and columns represent?
```{r}
head(object[[]], n=10)
```
### UMAP
You have just loaded an already analyzed data set. The cells have been clustered and the their 2d representation has been calculated using UMAP. Simply plot the UMAP to visualize the clusters
* We save the plot as a pdf file using `ggsave`
```{r}
DimPlot(object=object,
reduction = "umap")
# Save the plot as PDF
ggsave("umap.pdf")
```
### Another cell attribute on UMAP
Is any of the meta data columns (cell attributes) of your interest? Plot it on the UMAP. Cells will be then colored according to it
* Look at the documentation of `DimPlot` by typing `?Dimplot` in the console. How can you achieve the above?
```{r}
DimPlot(object=object,
reduction = "umap",
group.by="cluster_id")
```
## Discover marker genes
Seurat can help you find markers that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in `ident.1`), compared to all other cells. In this example the DEGs of cluster 5 vs all the rest are calculated, then the first 10 lines of the resulting table are printed
```{r}
markers <- FindMarkers(object=object,
ident.1 = 5,
min.pct = 0.25,
only.pos = TRUE)
head(markers, n=10)
```
Can you tell what do the columns in this table mean? Hint: refer to the function's documentation by typing `?FindMarkers` in your console or simply do a web search
### Save the marker table
Save the table so you can open it with Excel and easily explore the full extend of it
```{r}
write.table(x=markers,
file = "markers.tsv",
col.names = NA)
```
## Visualize the DEGs
### Plot the top 6 DEGs of your cluster
Notice we are altering the figure height and width to accommodate a big plot
```{r, fig.height=12, fig.width=12}
top <- head(rownames(markers), n=6)
FeaturePlot(object=object,
features = top)
```
### Discover the PGCs
Are there Primordial Germ Cells in this dataset? Hint: look for expression of NANOS3
* Bonus task: Look into docs for the argument to plot the cells in order of expression. That way cells with zero expression will not obscure cells with high expression
```{r}
FeaturePlot(object=object,
features = c("NANOS3"))
```
### Plot the top 3 DEGS of your cluster a uisng violin plot
```{r, fig.height=6, fig.width=12}
VlnPlot(object,
features = c("LEFTY2", "RIPPLY2", "DLL3"))
```
Q: Are the marker genes exclusively expressed in your cluster?
## Volcano plot
[Volcano plots](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-volcanoplot/tutorial.html) are commonly used to display the results of RNA-seq or other omics experiments. A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. In a volcano plot, the most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top.
### Make a volcano plot between your cluster and cluster 13
We will need to load two more R packages and write a small script to make a volcano plot, as there is no volcano plot available in Seurat.
```{r}
library(ggplot2)
library(ggrepel)
## Calculate the DEGs between clusters 5 and 13
deTab <- FindMarkers(object = object,
ident.1 = 5,
ident.2 = 13,
min.pct = 0.25)
# Add a column to to the table store whether a gene is Up-regulated, Down-regulated or the change is Non-significant
# The levels of average log fold change and he adjusted P-value are used to decide on the above
deTab$diffexpressed <- "Non-significant"
deTab$diffexpressed[deTab$avg_log2FC > 1 & deTab$p_val_adj < 0.05] <- "Up"
deTab$diffexpressed[deTab$avg_log2FC < -1 & deTab$p_val_adj < 0.05] <- "Down"
# In the
deTab$delabel <- NA
deTab$delabel[deTab$p_val_adj<0.05] <- rownames(deTab)[deTab$p_val_adj<0.05]
ggplot(data=deTab, aes(x=avg_log2FC, y=-log10(p_val_adj), col=diffexpressed, label=delabel)) +
ggtitle("Volcano Plot") +
geom_point() +
theme_minimal() +
geom_text_repel(max.overlaps = 10) +
scale_color_manual(values=c("blue", "gray", "red")) +
geom_vline(xintercept=c(-1, 1), col="red") +
geom_hline(yintercept=-log10(0.05), col="red")
```
## Heatmap
### Markers for all clustes
If you are out of gene set ideas to plot on heatmap, calculate the markers for all clusters and select the top 2 for each of them
```{r}
markers <- FindAllMarkers(object=object,
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25)
top <- markers %>%
group_by(cluster) %>%
top_n(n = 2, wt = avg_log2FC)
head(top)
```
### Filter and save the markers
We like to filter the results so we keep genes with `pct.1>0.6` and `p_val_adj<0.05`. Note you can do the same filtering by just sorting the corresponding columns in the all_markers table
```{r}
write.table(x = markers,
file = "makrers.tsv",
sep = "\t",
quote = FALSE,
row.names=FALSE)
filtered <- markers %>%
filter(pct.1>0.6, p_val_adj<0.05)
write.table(x = filtered,
file = "filtered_makrers.tsv",
sep = "\t",
quote = FALSE,
row.names=FALSE)
```
### Plot the set of genes on heatmap using Seurat function DoHeatmap
In this case the top 2 genes for each cluster (so 2*13=26 in total) are used
```{r}
DoHeatmap(object=object,
features=top$gene,
slot='data')
```
### Plot the set of genes on heatmap
The build-in function from Seurat leaves some things to be desired. First, it plots every cell individually, which clutters the plot with unnecessary info about each cell. Taking the average for each cluster would render a cleaner output. Second, it is difficult to further customize. For that reason we will write a few lines of code and create our heatmap
* We will use `heatmap.2` function from library `gplots`
```{r, fig.height=6, fig.width=6}
suppressMessages(library(gplots))
## Calculate the per cluster mean expression
groupIdentity <- Idents(object=object)
rawCounts <- GetAssayData(object=object, layer="counts")#, slot='scale.data')
## Substiture cell ID with cluster ID
colnames(rawCounts) <- groupIdentity
meanDF <- do.call(cbind, lapply(levels(groupIdentity), function(id){
groupCounts <- rawCounts[, colnames(rawCounts) == id]
df <- data.frame( c = apply(groupCounts, 1, mean))
df <- log(df+1)
colnames(df) <- id
return(df)
}))
meanDF_select <- as.matrix(meanDF[top$gene, ])
heatmap.2(meanDF_select,
col="bluered",
trace = "none")
```
### Plot again without the dendrogram for the genes (rows)
* Bonus task: Can you plot plot the rows but not columns dendrogram
```{r, fig.height=6, fig.width=6}
heatmap.2(meanDF_select,
col="bluered",
trace = "none",
dendrogram='column')
```
### Keep the original order for the genes in your heatmap
You may want to keep the order of the genes as provided, while still reorder the clusters
```{r, fig.height=6, fig.width=6}
heatmap.2(meanDF_select,
col="bluered",
trace = "none",
dendrogram='column',
Rowv=FALSE,
Colv=TRUE)
```
### Z-score values heatmap
In statistics, the [standard score](https://en.wikipedia.org/wiki/Standard_score), or z-score, is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.
* If `scale="row"` (or `scale="col"`) the rows (columns) are scaled to have mean zero and standard deviation one. There is some empirical evidence from genomic plotting that this is useful.
```{r, fig.height=6, fig.width=6}
heatmap.2(meanDF_select,
col="bluered",
trace='none',
scale='row')
```
Q: What are the implications of applying scaling? How do you interpret the heatmap now?
## Functional enrichment analysis
Performs functional enrichment analysis, also known as over-representation analysis (ORA) or gene set enrichment analysis, on input gene list. It maps genes to known functional information sources and detects statistically significantly enriched terms.
### Enrichment analysis on DEGs of our cluster
DEGs table is already sorted on p_val_adj. Use the top 100 genes to perform the analysis using gprofiler2
```{r}
suppressMessages(library(gprofiler2))
top_genes_my_cluster <- markers %>%
filter(cluster==5) %>%
select(gene) %>%
head(n=100) %>%
rownames()
gost_results <- gost(query = top_genes_my_cluster,
organism = "hsapiens",
ordered_query = TRUE,
user_threshold = 0.05,
domain_scope = "annotated",
sources = c('GO:BP', 'GO:MF', 'GO:CC', 'KEGG', 'REAC', 'TF'))
## Drop parents column
terms_table <- gost_results$result %>% dplyr::select(-parents)
## Save in file
write.table(x = terms_table,
file = "enrichment_analysis.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE)
```
### Enrichment analysis using the web service of gProfiler
You can easily perform an enrichment analysis on your favorite list of genes using the [g:Profiler web server](https://biit.cs.ut.ee/gprofiler/gost).
## Well Done!
### Session Info
For future reference, it is important to know the verions of all R packages used in your analysis
```{r}
sessionInfo()
```