Skip to content

Code that implements efficient knowledge graph extraction from the textual descriptions

License

Notifications You must be signed in to change notification settings

kleines-gespenst/Grapher

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Graph Generation From Text

Paper Conference

Description

Grapher is an end-to-end multi-stage Knowledge Graph (KG) construction system, that separates the overall generation process into two stages.

The graph nodes are generated first using pretrained language model, such as T5.The input text is transformed into a sequence of text entities. The features corresponding to each entity (node) is extracted and then sent to the edge generation module.

Edge construction, using generation (e.g.,GRU) or a classifier head. Blue circles represent the features corresponding to the actual graph edges (solid lines) and the white circles are the features that are decoded into ⟨NO_EDGE⟩ (dashed line).

Environment

To run this code, please install PyTorch and Pytorch Lightning (we tested the code on Pytorch 1.13 and Pytorch Lightning 1.8.1)

Setup

Install dependencies

# clone project   
git clone [email protected]:IBM/Grapher.git

# navigate to the directory
cd Grapher

# clone an external repository for reading the data
git submodule add https://gitlab.com/webnlg/corpus-reader.git corpusreader

# clone another external repositories for scoring the results
git submodule add https://github.com/WebNLG/WebNLG-Text-to-triples.git WebNLG_Text_to_triples

install requirements

pip install -r requirements.txt

Data

WebNLG 3.0 dataset

# download the dataset   
git clone https://gitlab.com/shimorina/webnlg-dataset.git

How to train

There are two scripts to run two versions of the algorithm

# naviagate to scripts directory
cd scripts

# run Grapher with the edge generation head
./train_gen.sh

# run Grapher with the classifier edge head
./train_class.sh

How to test

# run the test on experiment "webnlg_version_1" using latest checkpoint last.ckpt
python3 main.py --run test --version 1 --default_root_dir output --data_path webnlg-dataset/release_v3.0/en

# run the test on experiment "webnlg_version_1" using checkpoint at iteration 5000
python3 main.py --run test --version 1 --default_root_dir output --data_path webnlg-dataset/release_v3.0/en --checkpoint_model_id 5000

How to run inference

# run inference on experiment "webnlg_version_1" using latest checkpoint last.ckpt
python main.py --run inference --version 1 --default_root_dir output --inference_input_text "Danielle Harris had a main role in Super Capers, a 98 minute long movie."

Results

Results can be visualized in Tensorboard

tensorboard --logdir output

Citation

@inproceedings{grapher2022,
  title={Knowledge Graph Generation From Text},
  author={Igor Melnyk, Pierre Dognin, Payel Das},
  booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP)},
  year={2022}
}

Reproducibility study

  • The scripts ./train_gen and ./train_class.sh both produce an error during execution due to several non-existing arguments passed during the call of main.py (see below).
main.py: error: unrecognized arguments: --max_epochs 100 --accelerator gpu --num_nodes 1 --num_sanity_val_steps 0 --fast_dev_run 0 --overfit_batches 0 --limit_train_batches 1.0 --limit_val_batches 1.0 --limit_test_batches 1.0 --accumulate_grad_batches 10 --detect_anomaly True --log_every_n_steps 100 --val_check_interval 1000
  • Some additional libraries are required, such as SentencePie(See below).
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones that match your environment. Please note that you may need to restart your runtime after installation.
  • Instead of just cloning the two separately needed git repositories corpusreader and WebNLG-Text-to-triples, I added them as submodules (See Setup section) and included them in gitignore in order to not track the pycache changes.

  • I added the parameter cache_dir to point to a directory with enough space, because the model training requires 2950.74 MB for T5.

  • I updated two function names and interfaces in litgrapher.py due to incompatibilities with pytorch_lightning >= 2.0.0:

validation_epoch_end -> on_validation_epoch_end
test_epoch_end -> on_test_epoch_end

About

Code that implements efficient knowledge graph extraction from the textual descriptions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.4%
  • Shell 6.6%