03_DOSE_Semantic_Similarity.Rmd

# DO semantic similarity analysis {#DOSE-semantic-similarity}


```{r include=FALSE}
geneSim <- DOSE::geneSim
clusterSim <- DOSE::clusterSim
mclusterSim <- DOSE::mclusterSim
```


Public health is an important driving force behind biological and medical research.
A major challenge of the post-genomic era is bridging the gap between fundamental biological research and its clinical applications.
Recent research has increasingly demonstrated that many seemingly dissimilar diseases have common molecular mechanisms.
Understanding similarities among disease aids in early diagnosis and new drug development.


Formal knowledge representation of gene-disease association is demanded for this purpose.
Ontologies, such as Gene Ontology (GO), have been successfully applied to represent biological knowledge, and many related techniques have been adopted to extract information.
Disease Ontology (DO) [@schriml_disease_2011] was developed to create a consistent description of gene products with disease perspectives, and is essential for supporting functional genomics in disease context.
Accurate disease descriptions can discover new relationships between genes and disease, and new functions for previous uncharacteried genes and alleles.


Disease Ontology (DO) [@schriml_disease_2011] aims to provide an open source ontology for the integration of biomedical data that is associated with human disease. Unlike other clinical vocabularies that defined disease related concepts disparately, DO is organized as a directed acyclic graph, laying the foundation for quantitative computation of disease knowledge.

We developed an `R` package `r Biocpkg("DOSE")` [@yu_dose_2015] for analyzing semantic similarities among DO terms and gene products annotated with DO terms.


## DO term semantic similarity measurement {#do-semantic-similarity}

Four methods determine the semantic similarity of two terms based on the Information Content of their common ancestor term were proposed by Resnik [@philip_semantic_1999], Jiang [@jiang_semantic_1997], Lin [@lin_information-theoretic_1998] and Schlicker [@schlicker_new_2006]. Wang [@wang_new_2007] presented a method to measure the similarity based on the graph structure. Each of these methods has its own advantage and weakness. `r Biocpkg("DOSE")` supports all these methods to compute semantic similarity among DO terms and gene products. For algorithm details, please refer to [chapter 1](#semantic-similarity-overview).


In `r Biocpkg("DOSE")`, we implemented `doSim` for calculating semantic similarity between two DO terms and two set of DO terms.

```{r}
library(DOSE)

a <- c("DOID:14095", "DOID:5844", "DOID:2044", "DOID:8432", "DOID:9146",
       "DOID:10588", "DOID:3209", "DOID:848", "DOID:3341", "DOID:252")
b <- c("DOID:9409", "DOID:2491", "DOID:4467", "DOID:3498", "DOID:11256")
doSim(a[1], b[1], measure="Wang")
doSim(a[1], b[1], measure="Resnik")
doSim(a[1], b[1], measure="Lin")
s <- doSim(a, b, measure="Wang")
s
```

The `doSim` function requires three parameter `DOID1`, `DOID2` and `measure`. `DOID1` and `DOID2` should be a vector of DO terms, while `measure` should be one of _Resnik_, _Jiang_, _Lin_, _Rel_, and _Wang_.


We also implement a plot function `simplot()` to visualize the similarity result as demonstrated in Figure \@ref(fig:dose-simplot).


(ref:doseSimplotscap) Similarity matrix visualization.

(ref:doseSimplotcap) **Similarity matrix visualization**. The `simplot()` function visualize similarity matrix as a heatmap.


```{r dose-simplot, fig.width=5, fig.height=4,  message=FALSE, fig.cap="(ref:doseSimplotcap)", fig.scap="(ref:doseSimplotscap)", out.extra='', warning=FALSE}

simplot(s,
        color.low="white", color.high="red",
        labs=TRUE, digits=2, labs.size=5,
        font.size=14, xlab="", ylab="")
```

Parameter `color.low` and `colow.high` are used to setting the color gradient; `labs` is a logical parameter indicating whether to show the similarity values or not, `digits` to indicate the number of decimal places to be used and `labs.size` control the font size of similarity values; `font.size` setting the font size of axis and label of the coordinate system.


## Gene semantic similarity measurement {#gene-do-semantic-similarity}

On the basis of semantic similarity between DO terms, `r Biocpkg("DOSE")` can also compute semantic similarity among gene products. `r Biocpkg("DOSE")` provides four methods which called `max`, `avg`, `rcmax` and `BMA` to combine semantic similarity scores of multiple DO terms (as described in [session 1.3](combine-methods)). The similarities among genes and gene clusters which annotated by multiple DO terms were also calculated by these combine methods. 


`r Biocpkg("DOSE")` implemented `geneSim` to measure semantic similarities among genes.

```{r warning=FALSE}
g1 <- c("84842", "2524", "10590", "3070", "91746")
g2 <- c("84289", "6045", "56999", "9869")

DOSE::geneSim(g1[1], g2[1], measure="Wang", combine="BMA")
gs <- DOSE::geneSim(g1, g2, measure="Wang", combine="BMA")
gs
```

The `geneSim` requires four parameter `geneID1`, `geneID2`, `measure` and `combine`. `geneID1` and `geneID2` should be a vector of entrez gene IDs; `measure` should be one of _Resnik_, _Jiang_, _Lin_, _Rel_, and _Wang_, while `combine` should be one of _max_, _avg_, _rcmax_ and _BMA_.


## Gene cluster semantic similarity measurement {#gene-cluster-do-semantic-similarity}

`r Biocpkg("DOSE")` also implemented `clusterSim` for calculating semantic similarity between two gene clusters and `mclusterSim` for calculating semantic similarities among multiple gene clusters.

```{r}
DOSE::clusterSim(g1, g2, measure="Wang", combine="BMA")
```

```{r}
g3 <- c("57491", "6296", "51438", "5504", "27319", "1643")
clusters <- list(a=g1, b=g2, c=g3)
DOSE::mclusterSim(clusters, measure="Wang", combine="BMA")
```

<!--

# GO semantic similarity calculation

GO Semantic similarity can be calculated by `r Biocpkg("GOSemSim")`[@yu2010].

We have developed another package `r Biocpkg("GOSemSim")`[@yu2010] to explore the functional similarity at GO perspective, including molecular function (MF), biological process (BP) and cellular component (CC).


# MeSH semantic analysis

MeSH (Medical Subject Headings) is the NLM controlled vocabulary used
to manually index articles for MEDLINE/PubMed. `r Biocpkg("meshes")`
supports enrichment (hypergeometric test and GSEA) and semantic
similarity analyses for more than 70 species.

-->