This is the repository of BERT ParsCit and is under active development at National University of Singapore (NUS), Singapore. The project was built upon a template by ashleve. BERT ParsCit is a BERT version of Neural ParsCit built by researchers under WING@NUS.
# clone project
git clone https://github.com/ljhgabe/BERT-ParsCit
cd BERT-ParsCit
# [OPTIONAL] create conda environment
conda create -n myenv python=3.8
conda activate myenv
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
The current doc2json
tool is used to convert PDF to JSON. It uses Grobid to first process each PDF into XML, then extracts paper components from the XML.
To setup Doc2Json, you should run:
sh bin/doc2json/scripts/run.sh
This will setup Doc2Json and Grobid. And after installation, it starts the Grobid server in the background by default.
from src.pipelines.bert_parscit import predict_for_string, predict_for_text, predict_for_pdf
str_result = predict_for_string(
"Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64.")
text_result = predict_for_text("test.txt")
pdf_result = predict_for_pdf("test.pdf")
Train model with default configuration
# train on CPU
python train.py trainer=cpu
# train on GPU
python train.py trainer=gpu
Train model with chosen experiment configuration from configs/experiment/
python train.py experiment=experiment_name.yaml
You can override any parameter from command line like this
python train.py trainer.max_epochs=20 datamodule.batch_size=64
To show the full stack trace for error occurred during training or testing
HYDRA_FULL_ERROR=1 python train.py