Skip to content

Latest commit

 

History

History
 
 

entity_disambiguation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Global Entity Disambiguation with BERT

This is the source code for our paper Global Entity Disambiguation with BERT.

This model addresses entity disambiguation based on LUKE using local (word-based) and global (entity-based) contextual information. The model is trained by predicting randomly masked entities in Wikipedia, and achieves state-of-the-art results on five standard entity disambiguation datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI.

Reproducing Experiments

  • Model checkpoint file: Link
  • Dataset file: Link

Zero-shot evaluation

python examples/entity_disambiguation/evaluate.py
  --model-dir=<MODEL_DIR> \
  --dataset-dir=<DATASET_DIR> \
  --titles-file=<DATASET_DIR>/enwiki_20181220_titles.txt \
  --redirects-file=<DATASET_DIR>/enwiki_20181220_redirects.tsv \
  --inference-mode=global \
  --document-split-mode=per_mention

Please decompress the checkpoint file and dataset file and replace <MODEL_DIR> and <DATASET_DIR> to the corresponding paths.

Fine-tuning using CoNLL dataset

Training:

python examples/entity_disambiguation/train.py \
  --model-dir=<MODEL_DIR> \
  --dataset-dir=<DATASET_DIR> \
  --titles-file=<DATASET_DIR>/enwiki_20181220_titles.txt \
  --redirects-file=<DATASET_DIR>/enwiki_20181220_redirects.tsv \
  --output-dir=<OUTPUT_DIR>

Evaluation:

python examples/entity_disambiguation/evaluate.py
  --model-dir=<OUTPUT_DIR> \
  --dataset-dir=<DATASET_DIR> \
  --titles-file=<DATASET_DIR>/enwiki_20181220_titles.txt \
  --redirects-file=<DATASET_DIR>/enwiki_20181220_redirects.tsv \
  --inference-mode=global \
  --document-split-mode=per_mention

Fast Inference

If you need fast inference speed, please set --inference-mode=local and --document-split-mode=simple. This slightly degrades the performance, but the code runs much faster.

python examples/entity_disambiguation/evaluate.py
  --model-dir=<MODEL_DIR> \
  --dataset-dir=<DATASET_DIR> \
  --titles-file=<DATASET_DIR>/enwiki_20181220_titles.txt \
  --redirects-file=<DATASET_DIR>/enwiki_20181220_redirects.tsv \
  --inference-mode=local \
  --document-split-mode=simple

Training from Scratch

1. Install required packages

poetry install --extras "pretraining opennlp"

2. Build database from Wikipedia dump

python luke/cli.py build-dump-db \
  enwiki-latest-pages-articles.xml.bz2 \
  enwiki.db

The dump file can be downloaded from Wikimedia Downloads.

3. Create entity vocabulary

export PYTHONPATH=examples/entity_disambiguation
python examples/entity_disambiguation/scripts/create_candidate_data.py \
  --db-file=enwiki.db \
  --dataset-dir=<DATASET_DIR> \
  --output-file=candidates.txt

python luke/cli.py build-entity-vocab \
  enwiki.db \
  entity_vocab.jsonl \
  --white-list=candidates.txt \
  --white-list-only

4. Create training dataset

python luke/cli.py build-wikipedia-pretraining-dataset \
  enwiki.db \
  bert-base-uncased \
  entity_vocab.jsonl \
  pretraining_dataset_for_ed \
  --sentence-splitter=opennlp \
  --include-unk-entities

5. Train model

Please see here for the details of the pretraining of LUKE.

The DeepSpeed configuration files corresponding to our publicized checkpoints are available here.

Stage 1:

python luke/cli.py \
    compute-total-training-steps \
    --dataset-dir=pretraining_dataset_for_ed \
    --train-batch-size=2048 \
    --num-epochs=1
deepspeed \
    --num_gpus=<NUM_GPUS> \
    luke/pretraining/train.py \
    --output-dir=<OUTPUT_DIR> \
    --deepspeed-config-file=<DEEPSPEED_CONFIG_STAGE1_JSON_FILE> \
    --dataset-dir=pretraining_dataset_for_ed/ \
    --bert-model-name=bert-base-uncased \
    --num-epochs=1 \
    --masked-lm-prob=0.0 \
    --masked-entity-prob=0.3 \
    --fix-bert-weights

Stage 2:

python luke/cli.py \
    compute-total-training-steps \
    --dataset-dir=pretraining_dataset_for_ed \
    --train-batch-size=2048 \
    --num-epochs=6
deepspeed \
    --num_gpus=<NUM_GPUS> \
    luke/pretraining/train.py \
    --output-dir=<OUTPUT_DIR> \
    --deepspeed-config-file=<DEEPSPEED_CONFIG_STAGE2_JSON_FILE> \
    --dataset-dir=pretraining_dataset_for_ed/ \
    --bert-model-name=bert-base-uncased \
    --num-epochs=6 \
    --masked-lm-prob=0.0 \
    --masked-entity-prob=0.3 \
    --reset-optimization-states \
    --resume-checkpoint-id=<OUTPUT_DIR>/checkpoints/epoch1

6. Create Wikipedia data files

export PYTHONPATH=examples/entity_disambiguation
python examples/entity_disambiguation/scripts/create_title_data.py \
  --db-file=enwiki.db \
  --output-file=titles.txt

python examples/entity_disambiguation/scripts/create_redirect_data.py \
  --db-file=enwiki.db \
  --output-file=redirects.tsv

7. Convert checkpoint file

export PYTHONPATH=examples/entity_disambiguation
python examples/entity_disambiguation/scripts/convert_checkpoint.py \
    --checkpoint-file=<OUTPUT_DIR>/checkpoints/epoch6/mp_rank_00_model_states.pt \
    --metadata-file=<OUTPUT_DIR>/metadata.json \
    --entity-vocab-file=<OUTPUT_DIR>/entity_vocab.jsonl \
    --output-dir=<MODEL_DIR>

8. Evaluate model

python examples/entity_disambiguation/evaluate.py
  --model-dir=<MODEL_DIR> \
  --dataset-dir=<DATASET_DIR> \
  --titles-file=titles.txt \
  --redirects-file=redirects.txt \
  --inference-mode=global \
  --document-split-mode=per_mention

Citation

If you find this work useful, please cite our paper:

@inproceedings{yamada-etal-2022-global-ed,
    title = "Global Entity Disambiguation with BERT",
    author = "Yamada, Ikuya  and
      Washio, Koki  and
      Shindo, Hiroyuki  and
      Matsumoto, Yuji",
    booktitle = "NAACL",
    year = "2022",
    publisher = "Association for Computational Linguistics"
}