-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
57 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,38 +1,61 @@ | ||
# Simple ELMo | ||
Minimal Python code to get vectors from pre-trained ELMo models in TensorFlow. | ||
_simple_elmo_ is a Python library to work with pre-trained [ELMo embeddings](https://allennlp.org/elmo) in TensorFlow. | ||
|
||
Heavily based on https://github.com/allenai/bilm-tf. | ||
Requires Python >= 3.6 | ||
This is a significantly updated wrapper to the [original ELMo implementation](https://github.com/allenai/bilm-tf). | ||
The main changes are: | ||
- more convenient and transparent data loading (including from compressed files) | ||
- code adapted to modern TensorFlow versions (including TensorFlow 2). | ||
|
||
The main changes: | ||
- more convenient data loading (including from compressed files) | ||
- code adapted to recent TensorFlow versions (including TF 2.0). | ||
# Usage | ||
|
||
# Usage example | ||
`pip install simple_elmo` | ||
|
||
`python3 get_elmo_vectors.py -i test.txt -e ~/PATH_TO_ELMO/` | ||
`model = ElmoModel()` | ||
|
||
`model.load(PATH_TO_ELMO)` | ||
|
||
`elmo_vectors = model.get_elmo_vectors(SENTENCES)` | ||
|
||
`averaged_vectors = model.get_elmo_vector_average(SENTENCES)` | ||
|
||
`PATH_TO_ELMO` is a ZIP archive downloaded from the [NLPL vector repository](http://vectors.nlpl.eu/repository/), | ||
or a directory containing 3 files extracted from such an archive: | ||
OR a directory containing 3 files extracted from such an archive: | ||
- `model.hdf5`, pre-trained ELMo weights in HDF5 format; | ||
- `options.json`, description of the model architecture in JSON; | ||
- `vocab.txt`/`vocab.txt.gz`, one-word-per-line vocabulary of the most frequent words you would like to cache during inference | ||
(not really necessary, the model will infer embeddings for OOV words from their characters). | ||
|
||
Use the `elmo_vectors` tensor for your downstream tasks. | ||
Its dimensions are: (number of sentences, the length of the longest sentence, ELMo dimensionality). | ||
`SENTENCES` is a list of sentences (lists of words). | ||
|
||
Use the `elmo_vectors` and `averaged_vectors` tensors for your downstream tasks. | ||
|
||
# Text classification | ||
`elmo_vectors` contains contextualized word embeddings. Its shape is: (number of sentences, the length of the longest sentence, ELMo dimensionality). | ||
|
||
Use this code to perform document pair classification (like in text entailment or paraphrase detection). | ||
`averaged_vectors` contains one vector per each input sentence, | ||
constructed by averaging individual contextualized word embeddings. | ||
It is a list of vectors (the shape is (ELMo dimensionality)). | ||
|
||
|
||
# Example scripts | ||
|
||
We provide two example scripts to make it easier to start using _simple_elmo_ right away: | ||
- [Token embeddings](https://github.com/ltgoslo/simple_elmo/blob/master/simple_elmo/get_elmo_vectors.py) | ||
|
||
`python3 get_elmo_vectors.py -i test.txt -e ~/PATH_TO_ELMO/` | ||
|
||
- [Text classification](https://github.com/ltgoslo/simple_elmo/blob/master/simple_elmo/text_classification.py) | ||
|
||
`python3 text_classification.py -i paraphrases_lemm.tsv.gz -e ~/PATH_TO_ELMO/` | ||
|
||
The second script can be used to perform document pair classification (like in text entailment or paraphrase detection). | ||
|
||
Simple average of ELMo embeddings for all words in a document is used; | ||
then, the cosine similarity between two documents is calculated and used as a classifier feature. | ||
|
||
Example datasets for Russian (adapted from http://paraphraser.ru/): | ||
Example paraphrase datasets for Russian (adapted from http://paraphraser.ru/): | ||
- https://rusvectores.org/static/testsets/paraphrases.tsv.gz | ||
- https://rusvectores.org/static/testsets/paraphrases_lemm.tsv.gz (lemmatized) | ||
|
||
`python3 text_classification.py -i paraphrases_lemm.tsv.gz -e ~/PATH_TO_ELMO/` | ||
|
||
The library requires Python >= 3.6. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,19 +5,20 @@ | |
|
||
setuptools.setup( | ||
name="simple_elmo", | ||
version="0.1.0", | ||
version="0.2.0", | ||
author="Andrey Kutuzov", | ||
author_email="[email protected]", | ||
description="Useful library to work with pre-trained ELMo embeddings in TensorFlow ", | ||
description="Handy library to work with pre-trained ELMo embeddings in TensorFlow", | ||
long_description=long_description, | ||
long_description_content_type="text/markdown", | ||
url="https://github.com/ltgoslo/simple_elmo", | ||
packages=setuptools.find_packages(), | ||
python_requires='>=3.6', | ||
install_requires=["tensorflow>1.15", "h5py", "numpy", "smart_open>1.8.1", "pandas", "scikit-learn"], | ||
install_requires=["tensorflow>1.15", "h5py", "numpy", "smart_open>1.8.1", "pandas", | ||
"scikit-learn"], | ||
classifiers=[ | ||
"Programming Language :: Python :: 3", | ||
"License :: OSI Approved :: MIT License", | ||
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)", | ||
"Operating System :: OS Independent", | ||
"Topic :: Scientific/Engineering :: Artificial Intelligence", | ||
"Topic :: Utilities" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
name = "simple_elmo" | ||
from simple_elmo.elmo_helpers import ElmoModel, divide_chunks | ||
from data import Batcher | ||
from model import BidirectionalLanguageModel | ||
from elmo import weight_layers | ||
from simple_elmo.data import Batcher | ||
from simple_elmo.model import BidirectionalLanguageModel | ||
from simple_elmo.elmo import weight_layers |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters