Skip to content

Commit

Permalink
Merge branch 'fix_all_scores_for_token' of https://github.com/mauryal…
Browse files Browse the repository at this point in the history
…and/flair into fix_all_scores_for_token
  • Loading branch information
mauryaland committed May 30, 2024
2 parents 7161b1b + cec82dc commit c159707
Show file tree
Hide file tree
Showing 14 changed files with 750 additions and 137 deletions.
3 changes: 2 additions & 1 deletion docs/tutorial/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ Tutorials
intro
tutorial-basics/index
tutorial-training/index
tutorial-embeddings/index
tutorial-embeddings/index
tutorial-hunflair2/index
7 changes: 4 additions & 3 deletions docs/tutorial/tutorial-basics/entity-mention-linking.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Using and creating entity mention linker

As of Flair 0.14 we ship the [entity mention linker](#flair.models.EntityMentionLinker) - the core framework behind the [Hunflair BioNEN aproach](https://huggingface.co/hunflair)].
As of Flair 0.14 we ship the [entity mention linker](#flair.models.EntityMentionLinker) - the core framework behind the [Hunflair BioNEN approach](https://huggingface.co/hunflair)].
You can read more at the [Hunflair2 tutorials](project:../tutorial-hunflair2/overview.md)

## Example 1: Printing Entity linking outputs to console

Expand All @@ -19,7 +20,7 @@ sentence = Sentence(
use_tokenizer=SciSpacyTokenizer()
)

ner_tagger = Classifier.load("hunflair")
ner_tagger = Classifier.load("hunflair2")
ner_tagger.predict(sentence)

nen_tagger = EntityMentionLinker.load("disease-linker-no-ab3p")
Expand All @@ -31,7 +32,7 @@ for tag in sentence.get_labels():

```{note}
Here we use the `disease-linker-no-ab3p` model, as it is the simplest model to run. You might get better results by using `disease-linker` instead,
but under the hood ab3p uses an executeable that is only compiled for linux and therefore won't run on every system.
but that would require you to install `pyab3p` via `pip install pyab3p`.
Analogously to `disease` there are also linker for `chemical`, `species` and `gene`
all work with the `{entity_type}-linker` or `{entity_type}-linker-no-ab3p` naming-schema
Expand Down
146 changes: 146 additions & 0 deletions docs/tutorial/tutorial-hunflair2/customize-linking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# HunFlair2 Tutorial 4: Customizing linking models

In this tutorial you'll find information on how to customize the entity linking models according to your needs.
As of now, fine-tuning the models is not supported.

## Customize dictionary

All linking models come with a pre-defined pairing of entity type and dictionary,
e.g. "Disease" mentions are linked by default to the [CTD Diseases](https://ctdbase.org/help/diseaseDetailHelp.jsp).
You can change the dictionary to which mentions are linked by following the steps below.
We'll be using the [Human Phenotype Ontology](https://hpo.jax.org/app/) in our example
(Download the `hp.json` file you find [here](https://hpo.jax.org/app/data/ontology) if you want to follow along).

First we load from the original data a python dictionary mapping names to concept identifiers

```python
import json
from collections import defaultdict
with open("hp.json") as fp:
data = json.load(fp)

nodes = [n for n in data['graphs'][0]['nodes'] if n.get('type') == 'CLASS']
hpo = defaultdict(list)
for node in nodes:
concept_id = node['id'].replace('http://purl.obolibrary.org/obo/', '')
names = [node['lbl']] + [s['val'] for s in node.get('synonym', [])]
for name in names:
hpo[name].append(concept_id)
```

Then we can convert this mapping into a [`InMemoryEntityLinkingDictionary`](#flair.datasets.entity_linking.InMemoryEntityLinkingDictionary) that can be used by our linking model:

```python
from flair.datasets.entity_linking import (
InMemoryEntityLinkingDictionary,
EntityCandidate,
)

database_name="HPO"

candidates = [
EntityCandidate(
concept_id=ids[0],
concept_name=name,
additional_ids=ids[1:],
database_name=database_name,
)
for name, ids in hpo.items()
]

dictionary = InMemoryEntityLinkingDictionary(
candidates=candidates, dataset_name=database_name
)
```

To use this dictionary you need to initialize a new linker model with it.
See the section below for that.

## Custom pre-trained model

You can initialize a new [`EntityMentionLinker`](#flair.models.EntityMentionLinker) with both a custom model and custom dictionary (see section above) like this:

```python
from flair.models import EntityMentionLinker
pretrained_model="cambridgeltl/SapBERT-from-PubMedBERT-fulltext"
linker = EntityMentionLinker.build(
pretrained_model,
dictionary=dictionary,
hybrid_search=False,
entity_type="disease",
)
```

Omitting the `dictionary` parameter will load the default dictionary for the specified `entity_type`.

## Customizing Prediction Labels

In the default setup all linker models output their prediction into the same annotation category *link*.
To record the NEN annotation in separate categories, you can use the `pred_label_type` parameter of the
[`predict()`](#flair.models.EntityMentionLinker.predict) method:

```python
gene_linker.predict(sentence, pred_label_type="my-genes")
disease_linker.predict(sentence, pred_label_type="my-diseases")

print("Diseases:")
for disease_tag in sentence.get_labels("my-diseases"):
print(disease_tag)

print("\nGenes:")
for gene_tag in sentence.get_labels("my-genes"):
print(gene_tag)
```

This will output:

```
Diseases:
Span[7:9]: "X-linked adrenoleukodystrophy" → MESH:D000326/name=Adrenoleukodystrophy (195.30780029296875)
Span[11:13]: "neurodegenerative disease" → MESH:D019636/name=Neurodegenerative Diseases (201.1804962158203)
Genes:
Span[4:5]: "ABCD1" → 215/name=ABCD1 (210.89810180664062)
```

Moreover, each linker has a pre-defined configuration specifying for which NER annotations it should compute
entity links:

```python
print(gene_linker.entity_label_types)
print(disease_linker.entity_label_types)
```

By default all models will use the *ner* annotation category and apply the linking algorithm for annotations
of the respective entity type:

```python
{'ner': {'gene'}}
{'ner': {'disease'}}
```

You can customize this by using the `entity_label_types` parameter of the [`predict()`](#flair.models.EntityMentionLinker.predict) method:

```python
sentence = Sentence(
"The mutation in the ABCD1 gene causes X-linked adrenoleukodystrophy, "
"a neurodegenerative disease, which is exacerbated by exposure to high "
"levels of mercury in mouse populations."
)

from flair.models import SequenceTagger

# Use disease ner tagger from HunFlair v1
hunflair1_tagger = SequenceTagger.load("hunflair-disease")
hunflair1_tagger.predict(sentence, label_name="my-diseases")

# Use the entity_label_types parameter in predict() to specify the annotation category
disease_linker.predict(sentence, entity_label_types="my-diseases")
```

If you are using annotated texts with more fine-granular NER annotations you are able to specify the
annotation category and tag type using a dictionary. For instance:

```python
gene_linker.predict(sentence, entity_label_types={"ner": {"gene": "protein"}})
```
17 changes: 17 additions & 0 deletions docs/tutorial/tutorial-hunflair2/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Tutorial: HunFlair2
===================

*HunFlair2* is a state-of-the-art named entity tagger and linker for biomedical texts. It comes with
models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair2*
builds on pretrained domain-specific language models and outperforms other biomedical
NER tools on unseen corpora.

.. toctree::
:glob:
:maxdepth: 1

overview
tagging
linking
training-ner-models
customize-linking
90 changes: 90 additions & 0 deletions docs/tutorial/tutorial-hunflair2/linking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# HunFlair2 - Tutorial 2: Entity Linking

[Part 1](project:./tagging.md) of the tutorial, showed how to use our pre-trained *HunFlair2* models to
tag biomedical entities in your text. However, documents from different biomedical (sub-) fields may use different
terms to refer to the exact same concept, e.g., “_tumor protein p53_”, “_tumor suppressor p53_”, “_TRP53_” are all
valid names for the gene “TP53” ([NCBI Gene:7157](https://www.ncbi.nlm.nih.gov/gene/7157)).
For improved integration and aggregation of entity mentions from multiple different documents linking / normalizing
the entities to standardized ontologies or knowledge bases is required.

## Linking with pre-trained HunFlair2 Models

After adding named entity recognition tags to your sentence, you can link the entities to standard ontologies
using distinct, type-specific linking models:

```python
from flair.models import EntityMentionLinker
from flair.nn import Classifier
from flair.data import Sentence

sentence = Sentence(
"The mutation in the ABCD1 gene causes X-linked adrenoleukodystrophy, "
"a neurodegenerative disease, which is exacerbated by exposure to high "
"levels of mercury in mouse populations."
)

# Tag named entities in the text
ner_tagger = Classifier.load("hunflair2")
ner_tagger.predict(sentence)

# Load disease linker and perform disease linking
disease_linker = EntityMentionLinker.load("disease-linker")
disease_linker.predict(sentence)

# Load gene linker and perform gene linking
gene_linker = EntityMentionLinker.load("gene-linker")
gene_linker.predict(sentence)

# Load chemical linker and perform chemical linking
chemical_linker = EntityMentionLinker.load("chemical-linker")
chemical_linker.predict(sentence)

# Load species linker and perform species linking
species_linker = EntityMentionLinker.load("species-linker")
species_linker.predict(sentence)
```

```{note}
the ontologies and knowledge bases used are pre-processed the first time the normalisation is executed,
which might takes a certain amount of time. All further calls are then based on this pre-processing and run
much faster.
```

After running the code we can inspect and output the linked entities via:

```python
for tag in sentence.get_labels("link"):
print(tag)
```

This should print:

```
Span[4:5]: "ABCD1" → 215/name=ABCD1 (210.89810180664062)
Span[7:9]: "X-linked adrenoleukodystrophy" → MESH:D000326/name=Adrenoleukodystrophy (195.30780029296875)
Span[11:13]: "neurodegenerative disease" → MESH:D019636/name=Neurodegenerative Diseases (201.1804962158203)
Span[23:24]: "mercury" → MESH:D008628/name=Mercury (220.39199829101562)
Span[25:26]: "mouse" → 10090/name=Mus musculus (213.6201934814453)
```

For each entity, the output contains both the NER mention annotations and their ontology identifiers to which
the mentions were mapped. Moreover, the official name of the entity in the ontology and the similarity score
of the entity mention and the ontology concept is given. For instance, the official name for the disease
"_X-linked adrenoleukodystrophy_" is adrenoleukodystrophy. The similarity scores are specific to entity type,
ontology and linking model used and can therefore only be compared and related to those using the exact same
setup.

## Overview of pre-trained Entity Linking Models

HunFlair2 comes with the following pre-trained linking models:

| Entity Type | Model Name | Ontology / Dictionary | Linking Algorithm / Base Model (Data Set) |
| ----------- | ----------------- | ---------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| Chemical | `chemical-linker` | [CTD Chemicals](https://ctdbase.org/downloads/#allchems) | [SapBERT (BC5CDR)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-chemical) |
| Disease | `disease-linker` | [CTD Diseases](https://ctdbase.org/downloads/#alldiseases) | [SapBERT (NCBI Disease)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-disease) |
| Gene | `gene-linker` | [NCBI Gene (Human)](https://www.ncbi.nlm.nih.gov/gene) | [SapBERT (BC2GN)](https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn) |
| Species | `species-linker` | [NCBI Taxonmy](https://www.ncbi.nlm.nih.gov/taxonomy) | [SapBERT (UMLS)](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext) |

For detailed information concerning the different models and their integration please refer to [our paper](https://arxiv.org/abs/2402.12372).

If you wish to customize the models and dictionaries please refer to the [dedicated tutorial](project:./customize-linking.md).
Loading

0 comments on commit c159707

Please sign in to comment.