forked from YuLab-SMU/biomedical-knowledge-mining-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
06_GO.Rmd
135 lines (85 loc) · 5.57 KB
/
06_GO.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# GO enrichment analysis {#clusterprofiler-go}
GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological
process (BP), and cellular component (CC).
```{r include=FALSE}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
library(clusterProfiler)
```
## Supported organisms {#clusterProfiler-go-supported-organisms}
GO analyses (`groupGO()`, `enrichGO()` and `gseGO()`) support organisms that have an `OrgDb` object available (see also [session 2.2](#gosemsim-supported-organisms)).
If a user has GO annotation data (in a `data.frame` format with the first column as gene ID and the second column as GO ID), they can use the `enricher()` and `GSEA()` functions to perform an over-representation test and gene set enrichment analysis.
If the genes are annotated by direction annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the `buildGOmap` function, which will infer indirect annotation and generate a `data.frame` that is suitable for both `enricher()` and `GSEA()`.
## GO classification
In `r Biocpkg("clusterProfiler")`, the `groupGO()` function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset `geneList` provided by `r Biocpkg("DOSE")`.
```{r clusterprofiler-groupgo, warning=FALSE}
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
# Entrez gene ID
head(gene)
ggo <- groupGO(gene = gene,
OrgDb = org.Hs.eg.db,
ont = "CC",
level = 3,
readable = TRUE)
head(ggo)
```
The `gene` parameter is a vector of gene IDs (can be any ID type that is supported by the corresponding `OrgDb`, see also [session 16.1](#id-convert)). If `readable` is set to `TRUE`, the input gene IDs will be converted to gene symbols.
## GO over-representation analysis {#clusterprofiler-go-ora}
The `r Biocpkg("clusterProfiler")` package implements `enrichGO()` for gene ontology over-representation test.
```{r}
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
```
Any gene ID type that is supported in `OrgDb` can be directly used in GO analyses. Users need to specify the `keyType` parameter to specify the input gene ID type.
```{r}
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
ego2 <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(ego2, 3)
```
Gene IDs can be mapped to gene Symbols by using the parameter `readable=TRUE` or [`setReadable()` function](#setReadable).
## GO Gene Set Enrichment Analysis
The `r Biocpkg("clusterProfiler")` package provides the `gseGO()` function for [gene set enrichment analysis](#gsea-algorithm) using gene ontology.
```{r eval=FALSE}
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
```
The format of input data, `geneList`, was documented in the [FAQ](#genelist). Beware that only gene Set size in `[minGSSize, maxGSSize]` will be tested.
## GO analysis for non-model organisms {#clusterprofiler-go-non-model}
Both the `enrichGO()` and `gseGO()` functions require an `OrgDb` object as the background annotation. For organisms that don't have `OrgDb` provided by `Bioconductor`, users can query one (if available) online via `r Biocpkg("AnnotationHub")`. If there is no `OrgDb` available, users can obtain GO annotation from other sources, e.g. from `r Biocpkg("biomaRt")`, or annotate the genes using [Blast2GO](https://www.blast2go.com/) or the [Trinotate](https://rnabio.org/module-07-trinotate/0007/02/01/Trinotate/) pipeline. Then the `enricher()` or `GSEA()` functions can be used to perform GO analysis for these organisms, similar to the examples using wikiPathways and MSigDB. Another solution is to create an `OrgDb` on your own using `r Biocpkg("AnnotationForge")` package.
Here is an example of querying GO annotation from Ensembl using `r Biocpkg("biomaRt")`.
```{r eval=FALSE}
library(biomaRt)
ensembl <- useEnsemblGenomes(biomart = "plants_mart", dataset = "nattenuata_eg_gene")
gene2go <- getBM(attributes =c("ensembl_gene_id", "go_id"), mart=ensembl)
```
## Visualize enriched GO terms as a directed acyclic graph
The `goplot()` function can accept the output of `enrichGO` and visualize the enriched GO induced graph.
(ref:goplotscap) Goplot of enrichment analysis.
(ref:goplotcap) **Goplot of enrichment analysis.**
```{r goplot, fig.height=12, fig.width=8, fig.cap="(ref:goplotcap)", fig.scap="(ref:goplotscap)"}
goplot(ego)
```
## Summary {#clusterprofiler-go-summary}
GO semantic similarity can be calculated by `r Biocpkg("GOSemSim")` [@yu2010]. We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.