forked from YuLab-SMU/biomedical-knowledge-mining-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04_meshes_Semantic_Similarity.Rmd
116 lines (84 loc) · 4.79 KB
/
04_meshes_Semantic_Similarity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# MeSH semantic similarity analysis {#meshes-semantic-similarity}
```{r meshes-geneSim, include=FALSE}
geneSim <- meshes::geneSim
```
MeSH (Medical Subject Headings) is the NLM (U.S. National Library of
Medicine) controlled vocabulary used to manually index articles for
MEDLINE/PubMed. MeSH is a comprehensive life science vocabulary. MeSH has
19 categories and `MeSH.db` contains 16 of them. That is:
|Abbreviation |Category |
|:------------|:---------------------------------------------------------------|
|A |Anatomy |
|B |Organisms |
|C |Diseases |
|D |Chemicals and Drugs |
|E |Analytical, Diagnostic and Therapeutic Techniques and Equipment |
|F |Psychiatry and Psychology |
|G |Phenomena and Processes |
|H |Disciplines and Occupations |
|I |Anthropology, Education, Sociology and Social Phenomena |
|J |Technology and Food and Beverages |
|K |Humanities |
|L |Information Science |
|M |Persons |
|N |Health Care |
|V |Publication Type |
|Z |Geographical Locations |
MeSH terms were associated with Entrez Gene ID by three methods,
`gendoo`, `gene2pubmed` and `RBBH` (Reciprocal Blast Best Hit).
|Method|Way of corresponding Entrez Gene IDs and MeSH IDs|
|------|-------------------------------------------------|
|Gendoo|Text-mining|
|gene2pubmed|Manual curation by NCBI teams|
|RBBH|sequence homology with BLASTP search (E-value<10<sup>-50</sup>)|
## Supported organisms {#meshes-supported-organisms}
The `r Biocpkg("meshes")` package [@yu_meshes_2018] relies on `MeSHDb` to prepare semantic data for measuring simiarlity. `MeSHDb` can be downloaded from `r Biocpkg("AnnotationHub")` (see also `r Biocpkg("AHMeSHDbs")`) and about 200 species are available and are supported by the `r Biocpkg("meshes")` package.
First, we need to load/fetch species-specific MeSH annotation database:
```{r meshdb, warning=FALSE, cache=TRUE, eval=FALSE}
#############################
## BioC 2.14 to BioC 3.13 ##
#############################
##
## library(MeSH.Hsa.eg.db)
## db <- MeSH.Hsa.eg.db
##
##---------------------------
# From BioC 3.14 (Nov. 2021, with R-4.2.0)
library(AnnotationHub)
library(MeSHDbi)
ah <- AnnotationHub(localHub=TRUE)
hsa <- query(ah, c("MeSHDb", "Homo sapiens"))
file_hsa <- hsa[[1]]
db <- MeSHDbi::MeSHDb(file_hsa)
```
The semantic data can be prepared by the `meshdata()` function:
```{r meshdata, cache=TRUE, eval=FALSE}
library(meshes)
hsamd <- meshdata(db, category='A', computeIC=T, database="gendoo")
## you may want to save the result for future usage
#
# save(hsamd, file = "hsamd.rda")
#
```
```{r echo=FALSE}
## saveRDS(hsamd, file="cache/hsamd.rds")
hsamd <- readRDS("cache/hsamd.rds")
```
## MeSH semantic similarity measurement {#meshes-semantic-simiarlity}
The `r Biocpkg("meshes")` package [@yu_meshes_2018] implemented four IC-based methods (i.e. Resnik [@philip_semantic_1999], Jiang [@jiang_semantic_1997],
Lin [@lin_information-theoretic_1998] and Schlicker [@schlicker_new_2006]) and one graph-structure based method (i.e. Wang [@wang_new_2007]), to measure MeSH term semantic similarity. For algorithm details, please refer to [Chapter 1](#semantic-similarity-overview).
The `meshSim()` function is designed to measure semantic similarity between two MeSH term vectors.
```{r meshSim, cache=TRUE}
library(meshes)
meshSim("D000009", "D009130", semData=hsamd, measure="Resnik")
meshSim("D000009", "D009130", semData=hsamd, measure="Rel")
meshSim("D000009", "D009130", semData=hsamd, measure="Jiang")
meshSim("D000009", "D009130", semData=hsamd, measure="Wang")
meshSim(c("D001369", "D002462"), c("D017629", "D002890", "D008928"), semData=hsamd, measure="Wang")
```
## Gene semantic similarity measurement {#gene-meshes-semantic-similarity}
The `geneSim()` function is designed to measure semantic similarity among two gene vectors.
```{r meshes-geneSim2, cache=TRUE}
geneSim("241", "251", semData=hsamd, measure="Wang", combine="BMA")
geneSim(c("241", "251"), c("835", "5261","241", "994"), semData=hsamd, measure="Wang", combine="BMA")
```