-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update documentation for Hunflair2 release #3410
Merged
+779
−5
Merged
Changes from 7 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
a0ffa82
Add support of loading of hunflair2 model via Classifier.load()
f3db229
Fix model loading and update main documentation pages
e5b99e9
Add platform check while loading entity mention linking models and sh…
f1d5b03
Update documentation
c8ba30a
Add hint for model import order in __init__ file
cf07676
Revised link on main page
ec08032
Remove sync file
7b3b79d
Fix HunFlair (v1) warnings
bfb5747
Create HUNFLAIR2_TUTORIAL_3_TRAINING_NER.md (WIP)
WangXII c1d7912
Update HUNFLAIR2_TUTORIAL_3_TRAINING_NER.md
WangXII 1f55a7a
Update HUNFLAIR2_TUTORIAL_3_TRAINING_NER.md
WangXII 1fde791
Update HUNFLAIR2_TUTORIAL_3_TRAINING_NER.md
WangXII 67960cd
feat: linking tutorial w/ customizations
7200b10
fix: single entity vs multi-entity tagger
6e00ca9
feat: add new tutorials links
132ed81
chore: fix formatting on file not related to PR
d35fd5d
chore: make mypy happy
587f53c
fix: try remove circular import
76ed90f
chore: revert changes in non-pr related files
09e0bfe
Merge branch 'master' into hunflair2-release
3a9aaa1
Merge branch 'master' into hunflair2-release
alanakbik 848ac61
Merge branch 'master' into hunflair2-release
sg-wbi e3634ea
Add conversion of string to torch device for convenience
alanakbik 189c6e2
Ruff fixes
alanakbik File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -781,6 +781,14 @@ def _fetch_model(model_name) -> str: | |
elif model_name in hu_model_map: | ||
model_path = cached_path(hu_model_map[model_name], cache_dir=cache_dir) | ||
|
||
if model_name.startswith("hunflair"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True. However, HunFlair2 will not be loaded as SequenceTaggerModel. I fix it anyway. |
||
log.warning( | ||
"HunFlair (version 1) is deprecated. Consider using HunFlair2 for improved extraction performance: " | ||
"Classifier.load('hunflair2')." | ||
"See https://github.com/flairNLP/flair/blob/master/resources/docs/HUNFLAIR2.md for further " | ||
"information." | ||
) | ||
|
||
# special handling for the taggers by the @redewiegergabe project (TODO: move to model hub) | ||
elif model_name == "de-historic-indirect": | ||
model_file = flair.cache_root / cache_dir / "indirect" / "final-model.pt" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
# HunFlair2 | ||
|
||
*HunFlair2* is a state-of-the-art named entity tagger and linker for biomedical texts. It comes with | ||
models for genes/proteins, chemicals, diseases, species and cell lines. *HunFlair2* | ||
builds on pretrained domain-specific language models and outperforms other biomedical | ||
NER tools on unseen corpora. | ||
|
||
<b>Content:</b> | ||
[Quick Start](#quick-start) | | ||
[Tool Comparison](#comparison-to-other-biomedical-entity-extraction-tools) | | ||
[Tutorials](#tutorials) | | ||
[Citing HunFlair](#citing-hunflair2) | ||
|
||
## Quick Start | ||
|
||
#### Requirements and Installation | ||
*HunFlair2* is based on Flair 0.13+ and Python 3.8+. If you do not have Python 3.8, install it first. | ||
Then, in your favorite virtual environment, simply do: | ||
``` | ||
pip install flair | ||
``` | ||
|
||
#### Example 1: Biomedical NER | ||
Let's run named entity recognition (NER) over an example sentence. All you need to do is | ||
make a Sentence, load a pre-trained model and use it to predict tags for the sentence: | ||
```python | ||
from flair.data import Sentence | ||
from flair.nn import Classifier | ||
|
||
# make a sentence | ||
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome") | ||
|
||
# load biomedical NER tagger | ||
tagger = Classifier.load("hunflair2") | ||
|
||
# tag sentence | ||
tagger.predict(sentence) | ||
``` | ||
Done! The Sentence now has entity annotations. Let's print the entities found by the tagger: | ||
```python | ||
for entity in sentence.get_labels(): | ||
print(entity) | ||
``` | ||
This should print: | ||
```console | ||
Span[0:2]: "Behavioral abnormalities" → Disease (1.0) | ||
Span[4:5]: "Fmr1" → Gene (1.0) | ||
Span[6:7]: "Mouse" → Species (1.0) | ||
Span[9:12]: "Fragile X Syndrome" → Disease (1.0) | ||
``` | ||
|
||
#### Example 2: Biomedical NEN | ||
For improved integration and aggregation from multiple different documents linking / normalizing the entities to | ||
standardized ontologies or knowledge bases is required. Let's perform entity normalization by using | ||
specialized models per entity type: | ||
```python | ||
from flair.data import Sentence | ||
from flair.models import EntityMentionLinker | ||
from flair.nn import Classifier | ||
|
||
# make a sentence | ||
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome") | ||
|
||
# load biomedical NER tagger + predict entities | ||
tagger = Classifier.load("hunflair2") | ||
tagger.predict(sentence) | ||
|
||
# load gene linker and perform normalization | ||
gene_linker = EntityMentionLinker.load("gene-linker") | ||
gene_linker.predict(sentence) | ||
|
||
# load disease linker and perform normalization | ||
disease_linker = EntityMentionLinker.load("disease-linker") | ||
disease_linker.predict(sentence) | ||
|
||
# load species linker and perform normalization | ||
species_linker = EntityMentionLinker.load("species-linker") | ||
species_linker.predict(sentence) | ||
``` | ||
**Note**, the ontologies and knowledge bases used are pre-processed the first time the normalisation is executed, | ||
which might takes a certain amount of time. All further calls are then based on this pre-processing and run | ||
much faster. | ||
|
||
Done! The Sentence now has entity normalizations. Let's print the entity identifiers found by the linkers: | ||
```python | ||
for entity in sentence.get_labels("link"): | ||
print(entity) | ||
``` | ||
This should print: | ||
```console | ||
Span[0:2]: "Behavioral abnormalities" → MESH:D001523/name=Mental Disorders (197.9467010498047) | ||
Span[4:5]: "Fmr1" → 108684022/name=FRAXA (219.9510040283203) | ||
Span[6:7]: "Mouse" → 10090/name=Mus musculus (213.6201934814453) | ||
Span[9:12]: "Fragile X Syndrome" → MESH:D005600/name=Fragile X Syndrome (193.7115020751953) | ||
``` | ||
|
||
## Comparison to other biomedical entity extraction tools | ||
Tools for biomedical entity extraction are typically trained and evaluated on single, rather small gold standard | ||
data sets. However, they are applied "in the wild" to a much larger collection of texts, often varying in | ||
topic, entity distribution, genre (e.g. patents vs. scientific articles) and text type (e.g. abstract | ||
vs. full text), which can lead to severe drops in performance. | ||
|
||
*HunFlair2* outperforms other biomedical entity extraction tools on corpora not used for training of neither | ||
*HunFlair2* or any of the competitor tools. | ||
|
||
| Corpus | Entity Type | BENT | BERN2 | PubTator Central | SciSpacy | HunFlair | | ||
|----------------------------------------------------------------------------------------------|-------------|-------|-------|------------------|----------|-------------| | ||
| [MedMentions](https://github.com/chanzuckerberg/MedMentions) | Chemical | 40.90 | 41.79 | 31.28 | 34.95 | *__51.17__* | | ||
| | Disease | 45.94 | 47.33 | 41.11 | 40.78 | *__57.27__* | | ||
| [tmVar (v3)](https://github.com/ncbi/tmVar3?tab=readme-ov-file) | Gene | 0.54 | 43.96 | *__86.02__* | - | 76.75 | | ||
| [BioID](https://biocreative.bioinformatics.udel.edu/media/store/files/2018/BC6_track1_1.pdf) | Species | 10.35 | 14.35 | *__58.90__* | 37.14 | 49.66 | | ||
||||| | ||
| Average | All | 24.43 | 36.86 | 54.33 | 37.61 | *__58.79__* | | ||
|
||
<sub>All results are F1 scores highlighting end-to-end performance, i.e., named entity recognition and normalization, | ||
using partial matching of predicted text offsets with the original char offsets of the gold standard data. | ||
We allow a shift by max one character.</sub> | ||
|
||
You can find detailed evaluations and discussions in [our paper](https://arxiv.org/abs/2402.12372). | ||
|
||
## Tutorials | ||
We provide a set of quick tutorials to get you started with *HunFlair2*: | ||
* [Tutorial 1: Tagging biomedical named entities](HUNFLAIR2_TUTORIAL_1_TAGGING.md) | ||
* [Tutorial 2: Linking biomedical named entities](HUNFLAIR2_TUTORIAL_2_LINKING.md) | ||
|
||
## Citing HunFlair2 | ||
Please cite the following paper when using *HunFlair2*: | ||
~~~ | ||
@article{sanger2024hunflair2, | ||
title={HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools}, | ||
author={S{\"a}nger, Mario and Garda, Samuele and Wang, Xing David and Weber-Genzel, Leon and Droop, Pia and Fuchs, Benedikt and Akbik, Alan and Leser, Ulf}, | ||
journal={arXiv preprint arXiv:2402.12372}, | ||
year={2024} | ||
} | ||
~~~ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# HunFlair2 - Tutorial 1: Tagging | ||
|
||
This is part 1 of the tutorial, in which we show how to use our pre-trained *HunFlair2* models to tag your text. | ||
|
||
### Tagging with Pre-trained HunFlair2-Models | ||
Let's use the pre-trained *HunFlair2* model for biomedical named entity recognition (NER). | ||
This model was trained over multiple biomedical NER data sets and can recognize 5 different entity types, | ||
i.e. cell lines, chemicals, disease, gene / proteins and species. | ||
```python | ||
from flair.nn import Classifier | ||
|
||
tagger = Classifier.load("hunflair2") | ||
``` | ||
All you need to do is use the predict() method of the tagger on a sentence. | ||
This will add predicted tags to the tokens in the sentence. | ||
Lets use a sentence with four named entities: | ||
```python | ||
from flair.data import Sentence | ||
|
||
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome") | ||
|
||
# predict NER tags | ||
tagger.predict(sentence) | ||
|
||
# print the predicted tags | ||
for entity in sentence.get_labels(): | ||
print(entity) | ||
``` | ||
This should print: | ||
```console | ||
Span[0:2]: "Behavioral abnormalities" → Disease (1.0) | ||
Span[4:5]: "Fmr1" → Gene (1.0) | ||
Span[6:7]: "Mouse" → Species (1.0) | ||
Span[9:12]: "Fragile X Syndrome" → Disease (1.0) | ||
``` | ||
The output indicates that there are two diseases mentioned in the text ("_Behavioral Abnormalities_" and | ||
"_Fragile X Syndrome_") as well as one gene ("_fmr1_") and one species ("_Mouse_"). For each entity the | ||
text span in the sentence mention it is given and Label with a value and a score (confidence in the | ||
prediction). You can also get additional information, such as the position offsets of each entity | ||
in the sentence in a structured way by calling the `to_dict()` method: | ||
|
||
```python | ||
print(sentence.to_dict()) | ||
``` | ||
This should print: | ||
```python | ||
{ | ||
'text': 'Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome', | ||
'labels': [], | ||
'entities': [ | ||
{'text': 'Behavioral abnormalities', 'start_pos': 0, 'end_pos': 24, 'labels': [{'value': 'Disease', 'confidence': 0.9999860525131226}]}, | ||
{'text': 'Fmr1', 'start_pos': 32, 'end_pos': 36, 'labels': [{'value': 'Gene', 'confidence': 0.9999895095825195}]}, | ||
{'text': 'Mouse', 'start_pos': 41, 'end_pos': 46, 'labels': [{'value': 'Species', 'confidence': 0.9999873638153076}]}, | ||
{'text': 'Fragile X Syndrome', 'start_pos': 56, 'end_pos': 74, 'labels': [{'value': 'Disease', 'confidence': 0.9999928871790568}]} | ||
], | ||
# further sentence information | ||
} | ||
``` | ||
|
||
### Using a Biomedical Tokenizer | ||
Tokenization, i.e. separating a text into tokens / words, is an important issue in natural language processing | ||
in general and biomedical text mining in particular. So far, we used a tokenizer for general domain text. | ||
This can be unfavourable if applied to biomedical texts. | ||
|
||
*HunFlair2* integrates [SciSpaCy](https://allenai.github.io/scispacy/), a library specially designed to work with scientific text. | ||
To use the library we first have to install it and download one of it's models: | ||
~~~ | ||
pip install scispacy==0.5.1 | ||
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz | ||
~~~ | ||
|
||
To use the tokenizer we just have to pass it as parameter to when instancing a sentence: | ||
```python | ||
from flair.tokenization import SciSpacyTokenizer | ||
|
||
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome", | ||
use_tokenizer=SciSpacyTokenizer()) | ||
``` | ||
|
||
### Working with longer Texts | ||
Often, we are concerned with complete scientific abstracts or full-texts when performing biomedical text mining, e.g. | ||
```python | ||
abstract = "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, " \ | ||
"coding for the FMRP protein which is largely involved in synaptic function. FXS patients present several " \ | ||
"behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive " \ | ||
"deficits. Autistic symptoms, e.g., altered social interaction and communication, are also often observed: " \ | ||
"FXS is indeed the most common monogenic cause of autism." | ||
``` | ||
|
||
To work with complete abstracts or full-text, we first have to split them into separate sentences. | ||
Again we can apply the integration of the [SciSpaCy](https://allenai.github.io/scispacy/) library: | ||
```python | ||
from flair.splitter import SciSpacySentenceSplitter | ||
|
||
# initialize the sentence splitter | ||
splitter = SciSpacySentenceSplitter() | ||
|
||
# split text into a list of Sentence objects | ||
sentences = splitter.split(abstract) | ||
|
||
# you can apply the HunFlair tagger directly to this list | ||
tagger.predict(sentences) | ||
``` | ||
We can access the annotations of the single sentences by just iterating over the list: | ||
```python | ||
for sentence in sentences: | ||
print(sentence.to_tagged_string()) | ||
``` | ||
This should print: | ||
~~~ | ||
Sentence[35]: "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, coding for the FMRP protein which is largely involved in synaptic function." \ | ||
→ ["Fragile X syndrome"/Disease, "FXS"/Disease, "developmental disorder"/Disease, "X-linked"/Gene, "FMR1"/Gene, "FMRP"/Gene] | ||
Sentence[23]: "FXS patients present several behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive deficits." \ | ||
→ ["FXS"/Disease, "patients"/Species, "behavioral abnormalities"/Disease, "hyperactivity"/Disease, "anxiety"/Disease, "sensory hyper-responsiveness"/Disease, "cognitive deficits"/Disease] | ||
Sentence[27]: "Autistic symptoms, e.g., altered social interaction and communication, are also often observed: FXS is indeed the most common monogenic cause of autism." \ | ||
→ ["Autistic symptoms"/Disease, "altered social interaction and communication"/Disease, "FXS"/Disease, "autism"/Disease] | ||
~~~ | ||
|
||
### Next | ||
Now, let us look at how to [link / normalize the entities to standard ontologies](HUNFLAIR2_TUTORIAL_2_LINKING.md) | ||
in the second tutorial. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hunflair2" also starts with "hunflair", so I think this warning would always be printed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. However, HunFlair2 will not be loaded as MultitaskModel. I fix it anyway.