This repository includes the implementation of the C2E2 contrastive learning method for a Korean fact-check dataset.
The Korean fact-checking dataset can be obtained from this repository.
data/wiki_claims.json
: human-annotated Dataset for the Factcheckdata/train_val_test_ids.json
: Lists of claim ids for train/validation/test splitdata/wiki/wiki_docs.json
: Wikipedia documents corresponding to claims inwiki_claims.json
dr/dr_results.json
pretrain/data/c2e2_data.csv
pretrain/data/simcse_data.csv
- C2E2
cd pretrain python ./train.py --input_df="c2e2_data.csv" --pos_neg="c2e2"
- SimCSE
cd pretrain python ./train.py --input_df="simcse_data.csv" --pos_neg="simcse"
- The backbone model is fixed in our implementation as KPFBERT.
- You can obtain the KPFBERT-C2E2 pretrained checkpoint here.
python sentence_selection/embedding_based_similarity.py --split="test" --gpu_number=0 --checkpoints_dir="./pretrain/checkpoints/" --max_length=512 --model="kosimcse_kpfbert_c2e2" --model_name="kpfbert_c2e2_checkpoint.pt"
For more details on the task and method, please take a look at the paper published in the Journal of KIISE (in Korean).
@article{송선영2023팩트체킹,
title={자동화 팩트체킹을 위한 대조학습 방법},
author={송선영 and 안제준 and 박건우},
journal={정보과학회논문지},
year={2023}
}