Skip to content

Latest commit

 

History

History
51 lines (35 loc) · 1.72 KB

README.md

File metadata and controls

51 lines (35 loc) · 1.72 KB

A bi-directional LSTM for sequence tagging. This model was developed for Named Entity Recognition (NER) applied to materials science. Details can be found in the following publication: Weston at al., submitted to J. Chem. Inf. Model: https://doi.org/10.26434/chemrxiv.8226068.v1

Usage

Data

The materials-science specific training data included in this repository is heavily truncated; to access the full data, contact Leigh Weston at [email protected]. To use your own data, replace the training/test sets, and embeddings file with your own data in the same format.

Load the data as follows:

from ner_tagging.model.utils import get_data, get_embedding_matrix, get_metrics
word_embedding_dim = 200
training, development, test, word_cache, char_cache = get_data()
embedding_matrix = get_embedding_matrix(word_cache["word_to_integer"], word_embedding_dim)
Training

To train the model first extract the required data:

max_sequence_length = word_cache["max_sequence_length"]
n_words = word_cache["n_words"]
max_char_sequence_length = char_cache["max_word_length"]
n_chars = char_cache["n_chars"]
n_tags = word_cache["n_tags"]

The model has to be built before fitting:

from ner_tagging.model.model import NERTagger
model = NERTagger()
model.build(embedding_matrix, max_sequence_length, n_words, max_char_sequence_length, n_chars, n_tags)
model.fit(X_train, X_train_char, y_train, num_epochs=15)
Predictions and assessment

To assess the model after training, do the following:

from ner_tagging.model.utils import get_metrics
predicted = model.predict(X_dev, X_dev_char)
actual = y_dev.argmax(axis=-1)
print(get_metrics(actual, predicted, word_cache["integer_to_label"]))