-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'fix_all_scores_for_token' of https://github.com/mauryal…
…and/flair into fix_all_scores_for_token
- Loading branch information
Showing
14 changed files
with
750 additions
and
137 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
# HunFlair2 Tutorial 4: Customizing linking models | ||
|
||
In this tutorial you'll find information on how to customize the entity linking models according to your needs. | ||
As of now, fine-tuning the models is not supported. | ||
|
||
## Customize dictionary | ||
|
||
All linking models come with a pre-defined pairing of entity type and dictionary, | ||
e.g. "Disease" mentions are linked by default to the [CTD Diseases](https://ctdbase.org/help/diseaseDetailHelp.jsp). | ||
You can change the dictionary to which mentions are linked by following the steps below. | ||
We'll be using the [Human Phenotype Ontology](https://hpo.jax.org/app/) in our example | ||
(Download the `hp.json` file you find [here](https://hpo.jax.org/app/data/ontology) if you want to follow along). | ||
|
||
First we load from the original data a python dictionary mapping names to concept identifiers | ||
|
||
```python | ||
import json | ||
from collections import defaultdict | ||
with open("hp.json") as fp: | ||
data = json.load(fp) | ||
|
||
nodes = [n for n in data['graphs'][0]['nodes'] if n.get('type') == 'CLASS'] | ||
hpo = defaultdict(list) | ||
for node in nodes: | ||
concept_id = node['id'].replace('http://purl.obolibrary.org/obo/', '') | ||
names = [node['lbl']] + [s['val'] for s in node.get('synonym', [])] | ||
for name in names: | ||
hpo[name].append(concept_id) | ||
``` | ||
|
||
Then we can convert this mapping into a [`InMemoryEntityLinkingDictionary`](#flair.datasets.entity_linking.InMemoryEntityLinkingDictionary) that can be used by our linking model: | ||
|
||
```python | ||
from flair.datasets.entity_linking import ( | ||
InMemoryEntityLinkingDictionary, | ||
EntityCandidate, | ||
) | ||
|
||
database_name="HPO" | ||
|
||
candidates = [ | ||
EntityCandidate( | ||
concept_id=ids[0], | ||
concept_name=name, | ||
additional_ids=ids[1:], | ||
database_name=database_name, | ||
) | ||
for name, ids in hpo.items() | ||
] | ||
|
||
dictionary = InMemoryEntityLinkingDictionary( | ||
candidates=candidates, dataset_name=database_name | ||
) | ||
``` | ||
|
||
To use this dictionary you need to initialize a new linker model with it. | ||
See the section below for that. | ||
|
||
## Custom pre-trained model | ||
|
||
You can initialize a new [`EntityMentionLinker`](#flair.models.EntityMentionLinker) with both a custom model and custom dictionary (see section above) like this: | ||
|
||
```python | ||
from flair.models import EntityMentionLinker | ||
pretrained_model="cambridgeltl/SapBERT-from-PubMedBERT-fulltext" | ||
linker = EntityMentionLinker.build( | ||
pretrained_model, | ||
dictionary=dictionary, | ||
hybrid_search=False, | ||
entity_type="disease", | ||
) | ||
``` | ||
|
||
Omitting the `dictionary` parameter will load the default dictionary for the specified `entity_type`. | ||
|
||
## Customizing Prediction Labels | ||
|
||
In the default setup all linker models output their prediction into the same annotation category *link*. | ||
To record the NEN annotation in separate categories, you can use the `pred_label_type` parameter of the | ||
[`predict()`](#flair.models.EntityMentionLinker.predict) method: | ||
|
||
```python | ||
gene_linker.predict(sentence, pred_label_type="my-genes") | ||
disease_linker.predict(sentence, pred_label_type="my-diseases") | ||
|
||
print("Diseases:") | ||
for disease_tag in sentence.get_labels("my-diseases"): | ||
print(disease_tag) | ||
|
||
print("\nGenes:") | ||
for gene_tag in sentence.get_labels("my-genes"): | ||
print(gene_tag) | ||
``` | ||
|
||
This will output: | ||
|
||
``` | ||
Diseases: | ||
Span[7:9]: "X-linked adrenoleukodystrophy" → MESH:D000326/name=Adrenoleukodystrophy (195.30780029296875) | ||
Span[11:13]: "neurodegenerative disease" → MESH:D019636/name=Neurodegenerative Diseases (201.1804962158203) | ||
Genes: | ||
Span[4:5]: "ABCD1" → 215/name=ABCD1 (210.89810180664062) | ||
``` | ||
|
||
Moreover, each linker has a pre-defined configuration specifying for which NER annotations it should compute | ||
entity links: | ||
|
||
```python | ||
print(gene_linker.entity_label_types) | ||
print(disease_linker.entity_label_types) | ||
``` | ||
|
||
By default all models will use the *ner* annotation category and apply the linking algorithm for annotations | ||
of the respective entity type: | ||
|
||
```python | ||
{'ner': {'gene'}} | ||
{'ner': {'disease'}} | ||
``` | ||
|
||
You can customize this by using the `entity_label_types` parameter of the [`predict()`](#flair.models.EntityMentionLinker.predict) method: | ||
|
||
```python | ||
sentence = Sentence( | ||
"The mutation in the ABCD1 gene causes X-linked adrenoleukodystrophy, " | ||
"a neurodegenerative disease, which is exacerbated by exposure to high " | ||
"levels of mercury in mouse populations." | ||
) | ||
|
||
from flair.models import SequenceTagger | ||
|
||
# Use disease ner tagger from HunFlair v1 | ||
hunflair1_tagger = SequenceTagger.load("hunflair-disease") | ||
hunflair1_tagger.predict(sentence, label_name="my-diseases") | ||
|
||
# Use the entity_label_types parameter in predict() to specify the annotation category | ||
disease_linker.predict(sentence, entity_label_types="my-diseases") | ||
``` | ||
|
||
If you are using annotated texts with more fine-granular NER annotations you are able to specify the | ||
annotation category and tag type using a dictionary. For instance: | ||
|
||
```python | ||
gene_linker.predict(sentence, entity_label_types={"ner": {"gene": "protein"}}) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Tutorial: HunFlair2 | ||
=================== | ||
|
||
*HunFlair2* is a state-of-the-art named entity tagger and linker for biomedical texts. It comes with | ||
models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair2* | ||
builds on pretrained domain-specific language models and outperforms other biomedical | ||
NER tools on unseen corpora. | ||
|
||
.. toctree:: | ||
:glob: | ||
:maxdepth: 1 | ||
|
||
overview | ||
tagging | ||
linking | ||
training-ner-models | ||
customize-linking |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# HunFlair2 - Tutorial 2: Entity Linking | ||
|
||
[Part 1](project:./tagging.md) of the tutorial, showed how to use our pre-trained *HunFlair2* models to | ||
tag biomedical entities in your text. However, documents from different biomedical (sub-) fields may use different | ||
terms to refer to the exact same concept, e.g., “_tumor protein p53_”, “_tumor suppressor p53_”, “_TRP53_” are all | ||
valid names for the gene “TP53” ([NCBI Gene:7157](https://www.ncbi.nlm.nih.gov/gene/7157)). | ||
For improved integration and aggregation of entity mentions from multiple different documents linking / normalizing | ||
the entities to standardized ontologies or knowledge bases is required. | ||
|
||
## Linking with pre-trained HunFlair2 Models | ||
|
||
After adding named entity recognition tags to your sentence, you can link the entities to standard ontologies | ||
using distinct, type-specific linking models: | ||
|
||
```python | ||
from flair.models import EntityMentionLinker | ||
from flair.nn import Classifier | ||
from flair.data import Sentence | ||
|
||
sentence = Sentence( | ||
"The mutation in the ABCD1 gene causes X-linked adrenoleukodystrophy, " | ||
"a neurodegenerative disease, which is exacerbated by exposure to high " | ||
"levels of mercury in mouse populations." | ||
) | ||
|
||
# Tag named entities in the text | ||
ner_tagger = Classifier.load("hunflair2") | ||
ner_tagger.predict(sentence) | ||
|
||
# Load disease linker and perform disease linking | ||
disease_linker = EntityMentionLinker.load("disease-linker") | ||
disease_linker.predict(sentence) | ||
|
||
# Load gene linker and perform gene linking | ||
gene_linker = EntityMentionLinker.load("gene-linker") | ||
gene_linker.predict(sentence) | ||
|
||
# Load chemical linker and perform chemical linking | ||
chemical_linker = EntityMentionLinker.load("chemical-linker") | ||
chemical_linker.predict(sentence) | ||
|
||
# Load species linker and perform species linking | ||
species_linker = EntityMentionLinker.load("species-linker") | ||
species_linker.predict(sentence) | ||
``` | ||
|
||
```{note} | ||
the ontologies and knowledge bases used are pre-processed the first time the normalisation is executed, | ||
which might takes a certain amount of time. All further calls are then based on this pre-processing and run | ||
much faster. | ||
``` | ||
|
||
After running the code we can inspect and output the linked entities via: | ||
|
||
```python | ||
for tag in sentence.get_labels("link"): | ||
print(tag) | ||
``` | ||
|
||
This should print: | ||
|
||
``` | ||
Span[4:5]: "ABCD1" → 215/name=ABCD1 (210.89810180664062) | ||
Span[7:9]: "X-linked adrenoleukodystrophy" → MESH:D000326/name=Adrenoleukodystrophy (195.30780029296875) | ||
Span[11:13]: "neurodegenerative disease" → MESH:D019636/name=Neurodegenerative Diseases (201.1804962158203) | ||
Span[23:24]: "mercury" → MESH:D008628/name=Mercury (220.39199829101562) | ||
Span[25:26]: "mouse" → 10090/name=Mus musculus (213.6201934814453) | ||
``` | ||
|
||
For each entity, the output contains both the NER mention annotations and their ontology identifiers to which | ||
the mentions were mapped. Moreover, the official name of the entity in the ontology and the similarity score | ||
of the entity mention and the ontology concept is given. For instance, the official name for the disease | ||
"_X-linked adrenoleukodystrophy_" is adrenoleukodystrophy. The similarity scores are specific to entity type, | ||
ontology and linking model used and can therefore only be compared and related to those using the exact same | ||
setup. | ||
|
||
## Overview of pre-trained Entity Linking Models | ||
|
||
HunFlair2 comes with the following pre-trained linking models: | ||
|
||
| Entity Type | Model Name | Ontology / Dictionary | Linking Algorithm / Base Model (Data Set) | | ||
| ----------- | ----------------- | ---------------------------------------------------------- | --------------------------------------------------------------------------------------- | | ||
| Chemical | `chemical-linker` | [CTD Chemicals](https://ctdbase.org/downloads/#allchems) | [SapBERT (BC5CDR)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-chemical) | | ||
| Disease | `disease-linker` | [CTD Diseases](https://ctdbase.org/downloads/#alldiseases) | [SapBERT (NCBI Disease)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-disease) | | ||
| Gene | `gene-linker` | [NCBI Gene (Human)](https://www.ncbi.nlm.nih.gov/gene) | [SapBERT (BC2GN)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn) | | ||
| Species | `species-linker` | [NCBI Taxonmy](https://www.ncbi.nlm.nih.gov/taxonomy) | [SapBERT (UMLS)](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext) | | ||
|
||
For detailed information concerning the different models and their integration please refer to [our paper](https://arxiv.org/abs/2402.12372). | ||
|
||
If you wish to customize the models and dictionaries please refer to the [dedicated tutorial](project:./customize-linking.md). |
Oops, something went wrong.