06_GO.Rmd

# GO enrichment analysis {#clusterprofiler-go}


GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological
process (BP), and cellular component (CC).


```{r include=FALSE}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
library(clusterProfiler)
```


## Supported organisms {#clusterProfiler-go-supported-organisms}

GO analyses (`groupGO()`, `enrichGO()` and `gseGO()`) support organisms that have an `OrgDb` object available (see also [session 2.2](#gosemsim-supported-organisms)).


If a user has GO annotation data (in a `data.frame` format with the first column as gene ID and the second column as GO ID), they can use the `enricher()` and `GSEA()` functions to perform an over-representation test and gene set enrichment analysis.

If the genes are annotated by direction annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the `buildGOmap` function, which will infer indirect annotation and generate a `data.frame` that is suitable for both `enricher()` and `GSEA()`.

## GO classification

In `r Biocpkg("clusterProfiler")`, the `groupGO()` function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset `geneList` provided by `r Biocpkg("DOSE")`. 

```{r clusterprofiler-groupgo, warning=FALSE}
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]

# Entrez gene ID
head(gene)

ggo <- groupGO(gene     = gene,
               OrgDb    = org.Hs.eg.db,
               ont      = "CC",
               level    = 3,
               readable = TRUE)

head(ggo)
```

The `gene` parameter is a vector of gene IDs (can be any ID type that is supported by the corresponding `OrgDb`, see also [session 16.1](#id-convert)). If `readable` is set to `TRUE`, the input gene IDs will be converted to gene symbols.


## GO over-representation analysis {#clusterprofiler-go-ora}


The `r Biocpkg("clusterProfiler")` package implements `enrichGO()` for gene ontology over-representation test.


```{r}
ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
		readable      = TRUE)
head(ego)
```

Any gene ID type that is supported in `OrgDb` can be directly used in GO analyses. Users need to specify the `keyType` parameter to specify the input gene ID type.

```{r}
gene.df <- bitr(gene, fromType = "ENTREZID",
	   	toType = c("ENSEMBL", "SYMBOL"),
		OrgDb = org.Hs.eg.db)

ego2 <- enrichGO(gene         = gene.df$ENSEMBL,
                OrgDb         = org.Hs.eg.db,
                keyType       = 'ENSEMBL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)
head(ego2, 3)                
```

Gene IDs can be mapped to gene Symbols by using the parameter `readable=TRUE` or [`setReadable()` function](#setReadable).


## GO Gene Set Enrichment Analysis

The `r Biocpkg("clusterProfiler")` package provides the `gseGO()` function for [gene set enrichment analysis](#gsea-algorithm) using gene ontology.

```{r eval=FALSE}
ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)
```

The format of input data, `geneList`, was documented in the [FAQ](#genelist). Beware that only gene Set size in `[minGSSize, maxGSSize]` will be tested.


## GO analysis for non-model organisms {#clusterprofiler-go-non-model}

Both the `enrichGO()` and `gseGO()` functions require an `OrgDb` object as the  background annotation. For organisms that don't have `OrgDb` provided by `Bioconductor`, users can query one (if available) online via `r Biocpkg("AnnotationHub")`. If there is no `OrgDb` available, users can obtain GO annotation from other sources, e.g. from `r Biocpkg("biomaRt")`, or annotate the genes using [Blast2GO](https://www.blast2go.com/) or the [Trinotate](https://rnabio.org/module-07-trinotate/0007/02/01/Trinotate/) pipeline. Then the `enricher()` or `GSEA()` functions can be used to perform GO analysis for these organisms, similar to the examples using wikiPathways and MSigDB. Another solution is to create an `OrgDb` on your own using `r Biocpkg("AnnotationForge")` package.


Here is an example of querying GO annotation from Ensembl using `r Biocpkg("biomaRt")`. 


```{r eval=FALSE}
library(biomaRt)
ensembl <- useEnsemblGenomes(biomart = "plants_mart", dataset = "nattenuata_eg_gene")
gene2go <- getBM(attributes =c("ensembl_gene_id", "go_id"), mart=ensembl)
```

## Visualize enriched GO terms as a directed acyclic graph

The `goplot()` function can accept the output of `enrichGO` and visualize the enriched GO induced graph.

(ref:goplotscap) Goplot of enrichment analysis.

(ref:goplotcap) **Goplot of enrichment analysis.**

```{r goplot, fig.height=12, fig.width=8, fig.cap="(ref:goplotcap)", fig.scap="(ref:goplotscap)"}
goplot(ego)
```


## Summary {#clusterprofiler-go-summary}

GO semantic similarity can be calculated by `r Biocpkg("GOSemSim")` [@yu2010]. We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.