Skip to content

General training framework for NER models on MatBERT

Notifications You must be signed in to change notification settings

walkernr/MatBERT_NER

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MatBERT NER

A framework for materials science NER using the HuggingFace Transformers NLP Toolkit.

Installation

git clone https://github.com/walkernr/MatBERT_NER.git MatBERT_NER
cd MatBERT_NER
pip install -r requirements.txt .

Download the model weights here

Example Usage

The folowing command will train the MatBERT model on the solid state dataset using default parameters

python train.py -dv gpu:0 -ds solid_state -ml matbert

Additional parameters can be specified.

usage: train.py [-h] [-dv DEVICE] [-sd SEEDS] [-ts TAG_SCHEMES] [-st SPLITS] [-ds DATASETS] [-ml MODELS] [-sl] [-bs BATCH_SIZE] [-on OPTIMIZER_NAME] [-wd WEIGHT_DECAY] [-ne N_EPOCH]
                [-eu EMBEDDING_UNFREEZE] [-tu TRANSFORMER_UNFREEZE] [-el EMBEDDING_LEARNING_RATE] [-tl TRANSFORMER_LEARNING_RATE] [-cl CLASSIFIER_LEARNING_RATE] [-sf SCHEDULING_FUNCTION]   
                [-km]

optional arguments:
  -h, --help            show this help message and exit
  -dv DEVICE, --device DEVICE
                        computation device for model (e.g. cpu, gpu:0, gpu:1)
  -sd SEEDS, --seeds SEEDS
                        comma-separated seeds for data shuffling and model initialization (e.g. 1,2,3 or 2,4,8)
  -ts TAG_SCHEMES, --tag_schemes TAG_SCHEMES
                        comma-separated tagging schemes to be considered (e.g. iob1,iob2,iobes)
  -st SPLITS, --splits SPLITS
                        comma-separated training splits to be considered, in percent (e.g. 80). test split will always be 10% and the validation split will be 1/8 of the training split   
                        unless the training split is 100%
  -ds DATASETS, --datasets DATASETS
                        comma-separated datasets to be considered (e.g. solid_state,doping)
  -ml MODELS, --models MODELS
                        comma-separated models to be considered (e.g. matbert,scibert,bert)
  -sl, --sentence_level
                        switch for sentence-level learning instead of paragraph-level
  -bs BATCH_SIZE, --batch_size BATCH_SIZE
                        number of samples in each batch
  -on OPTIMIZER_NAME, --optimizer_name OPTIMIZER_NAME
                        name of optimizer, add "_lookahead" to implement lookahead on top of optimizer (not recommended for ranger or rangerlars)
  -wd WEIGHT_DECAY, --weight_decay WEIGHT_DECAY
                        weight decay for optimizer (excluding bias, gamma, and beta)
  -ne N_EPOCH, --n_epoch N_EPOCH
                        number of training epochs
  -eu EMBEDDING_UNFREEZE, --embedding_unfreeze EMBEDDING_UNFREEZE
                        epoch (index) at which bert embeddings are unfrozen
  -tu TRANSFORMER_UNFREEZE, --transformer_unfreeze TRANSFORMER_UNFREEZE
                        comma-separated number of transformers (encoders) to unfreeze at each epoch
  -el EMBEDDING_LEARNING_RATE, --embedding_learning_rate EMBEDDING_LEARNING_RATE
                        embedding learning rate
  -tl TRANSFORMER_LEARNING_RATE, --transformer_learning_rate TRANSFORMER_LEARNING_RATE
                        transformer learning rate
  -cl CLASSIFIER_LEARNING_RATE, --classifier_learning_rate CLASSIFIER_LEARNING_RATE
                        pooler/classifier learning rate
  -sf SCHEDULING_FUNCTION, --scheduling_function SCHEDULING_FUNCTION
                        function for learning rate scheduler (linear, exponential, or cosine)
  -km, --keep_model     switch for saving the best model parameters to disk

To train on custom annotated datasets, the train.py script has a dictionary data_files where additional datasets can be specified. Similarly, alternative pre-trained models can be used by modifying the model_files dictionary.

For prediction, the predict function contained within predict.py can be used. An example that was used internally can be found in the predict_script.py file. Furthermore, an example utilizing MongoDB can be found in the predict_mongo.py script. Note that these two examples will need to be edited for your specific needs to be usable.

License

About

General training framework for NER models on MatBERT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%