forked from YuLab-SMU/biomedical-knowledge-mining-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
14_compareCluster.Rmd
118 lines (72 loc) · 8.12 KB
/
14_compareCluster.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# Biological theme comparison {#clusterprofiler-comparecluster}
```{r include=FALSE}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, echo=TRUE, cache=TRUE)
library(clusterProfiler)
```
The `r Biocpkg("clusterProfiler")` package was developed for biological theme comparison [@yu2012; @wu_clusterprofiler_2021], and it provides a function, `compareCluster`, to automatically calculate enriched functional profiles of each gene clusters and aggregate the results into a single object. Comparing functional profiles can reveal functional consensus and differences among different experiments and helps in identifying differential functional modules in omics datasets.
## Comparing multiple gene lists
The `compareCluster()` function applies selected function (via the `fun` parameter) to perform enrichment analysis for each gene list.
```{r}
data(gcSample)
str(gcSample)
```
Users can use a named list of gene IDs as the input that passed to the `geneCluster` parameter.
```{r}
ck <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
ck <- setReadable(ck, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
head(ck)
```
## Formula interface of compareCluster
As an alternaitve to using named list, the `compareCluster()` function also supports passing a formula to describe more complicated experimental designs (*e.g.*, $Gene \sim time + treatment$).
```{r}
mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"
formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enrichKEGG")
head(formula_res)
```
## Visualization of functional profile comparison {#vis-compare-profile}
### Dot plot {#compare-dotplot}
We can visualize the result using the `dotplot()` method.
```{r eval=FALSE}
dotplot(ck)
dotplot(formula_res)
```
(ref:ccnscap) Comparing enrichment results of multiple gene lists.
(ref:ccncap) **Comparing enrichment results of multiple gene lists.** (A) Using a named list of gene clusters, the results were displayed as multiple columns with each one represents an enrichment result of a gene cluster. (B) Using formula interface, the columns represent gene clusters defined by the formula.
```{r ccn, dev='png', fig.height=5, fig.width=12, fig.cap="(ref:ccncap)", fig.scap="(ref:ccnscap)", echo=FALSE, fig.keep="last"}
library(stringr)
library(ggplot2)
p1 <- dotplot(ck) + scale_y_discrete(labels=function(x) str_wrap(x, width=50)) + scale_size(range = c(3, 10)) + theme_dose(14)
p2 <- dotplot(formula_res) + scale_y_discrete(labels=function(x) str_wrap(x, width=30)) + scale_size(range = c(5, 15)) + theme_dose(14) + theme(axis.text.x=element_text(angle=30, hjust=1))
cowplot::plot_grid(p1, p2, ncol=2, labels=c("A", "B"))
```
The fomula interface allows more complicated gene cluster definition. In Figure \@ref(fig:ccn)B, the gene clusters were defined by two variables (i.e. `group` that divides genes into `upregulated` and `downregulated` and `othergroup` that divides the genes into two categories of `A` and `B`.). The `dotplot()` function allows us to use one variable to divide the result into different facet and plot the result with other variables in each facet panel (Figure \@ref(fig:ccf)).
(ref:ccfscap) Comparing enrichment results of multiple gene lists defined by multiple variable.
(ref:ccfcap) **Comparing enrichment results of multiple gene lists defined by multiple variable.** The results were visualized as a dot plot with an x-axis representing one level of conditions (the *group* variable) and a facet panel indicating another level of conditions (the *othergroup* variable).
```{r ccf, dev='png', fig.height=3, fig.width=6,fig.cap="(ref:ccfcap)", fig.scap="(ref:ccfscap)", fig.keep="last",crop = TRUE}
dotplot(formula_res, x="group") + facet_grid(~othergroup)
```
By default, only top 5 (most significant) categories of each cluster
was plotted. User can changes the parameter `showCategory` to
specify how many categories of each cluster to be plotted, and if
`showCategory` was set to `NULL`, the whole result will
be plotted. The `showCategory` parameter also allows passing a vector of selected categories to [plot pathway of interests](#showing-specific-pathways)
The `dotplot()` function tries to make the comparison among different clusters more informative and reasonable. After extracting *e.g.* 10 categories for each clusters, `r Biocpkg("clusterProfiler")` tries to collect overlap of these categories among clusters. For example, `term A` is enriched in all the gene clusters (*e.g.*, `g1` and `g2`) and is in the 10 most significant categories of `g1` but not `g2`. `r Biocpkg("clusterProfiler")` will capture this information and include `term A` in `g2` cluster to make the comparison in `dotplot` more reasonable. If users want to ignore these information, they can set `includeAll = FALSE` in `dotplot()`, which is not recommended.
The `dotplot()` function accepts a parameter `size` for setting the scale of dot sizes. The default parameter `size` is setting to `geneRatio`, which corresponding to the `GeneRatio` column of the output. If it was setting to `count`, the comparison will be based on gene counts, while if setting to `rowPercentage`, the dot sizes will be normalized by `count/(sum of each row)`. Users can also map the dot size to other variables or derived variables (see [Chapter 16](#clusterProfiler-dplyr)).
To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when `by` is setting to `rowPercentage` and number of gene clusters in each cluster label (numbers in parentheses) when `by` is setting to `geneRatio`, as shown in Figure \@ref(fig:ccn).
The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding adjusted p-values. Color gradient ranging from red to blue correspond to the order of increasing adjusted p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). Adjusted p-values were filtered out by the threshold giving by the parameter `pvalueCutoff`, and FDR can be estimated by `qvalue`.
### Gene-Concept Network {#compare-cnetplot}
The [cnetplot](#cnetplot) also works with `compareCluster()` result.
(ref:mcnetplotscap) `cnetplot()` for comparing functional profiles of multiple gene clusters.
(ref:mcnetplotcap) **`cnetplot()` for comparing functional profiles of multiple gene clusters.** Genes and functional categories (*i.e.*, pathways) are encoded as pies to distinguish different gene clusters.
```{r mcnetplot, fig.width=12, fig.height=8, fig.cap="(ref:mcnetplotcap)", fig.scap="(ref:mcnetplotscap)"}
cnetplot(ck)
```
## Summary
The comparison function was designed as a framework for comparing gene clusters of any kind of ontology associations, not only [groupGO](#go-classification), [enrichGO](#clusterprofiler-go-ora), [enrichKEGG](#clusterprofiler-kegg-pathway-ora), [enrichMKEGG](#clusterprofiler-kegg-module-ora), [enrichWP](#clusterprofiler-wikipathway-ora) and [enricher](#universal-api) that were provided in this package, but also other biological and biomedical ontologies, including but not limited to [enrichPathway](#reactomepa-ora), [enrichDO](#dose-do-ora), [enrichNCG](#dose-ncg-ora), [enrichDGN](#dose-dgn-ora) and [enrichMeSH](#meshes-ora).
In [@yu2012], we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus) [@schmidt2008]. We identified 8 gene clusters from differentially expressed genes, and using the `compareCluster()` function to compare these gene clusters by their enriched biological process. In [@wu_clusterprofiler_2021], we analyzed the GSE8057 dataset which contains expression data from ovarian cancer cells at multiple time points and under two treatment conditions. Eight groups of DEG lists were analyzed simultaneously using `compareCluster()` with WikiPathways. The result indicates that the two drgus have distinct effects at the begining but consistent effects in the later stages (Fig. 4 of [@wu_clusterprofiler_2021]).