HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims

This repository provides the code for 🌟HerO🌟, the runner-up 🏃 for the AveriTeC shared task.

The system description paper is published in the proceedings of the 7th FEVER workshop (co-located with EMNLP 2024) [paper].

Task: AVeriTeC

The AVeriTeC task is to verify a real-world claim by retrieving evidence from the web. Given a claim and its metadata, a system needs to retrieve evidence that supports and/or refutes the claim, either from the Web or from the document collection provided along with the dataset.
This code is for our fact-checking pipeline that utilizes open large language models for the shared task hosted by the 7th FEVER workshop (co-located with EMNLP). For more details about the task and dataset, please refer to the shared task paper.

Method: HerO

HerO, the herd of open large language models for real-world claims, is our pipelined system for verifying real-world claims.
🎉 Our system achieved 2nd place in the shared task! As the winner utilizes GPT-4o for their pipeline, HerO is the best one among those using open LLMs.

The above figure illustrates our system's inference pipeline. We configure three modules using only open LLMs: evidence retrieval, question generation, and veracity prediction.
- Evidence retrieval: We implement a 2-stage retrieval pipeline using BM25 and SFR-Embedding-2_R. We expand the query by prompting an LLM to generate hypothetical fact-checking documents.
- Question generation: We use an LLM to generate a verifying question for an answer candidate. We improve the baseline prompt by using the claim as an additional context.
- Veracity prediction: We fully fine-tune an LLM to generate justifications and verdicts.

Veracity Prediction Model and Fine-tuning Dataset

The model checkpoints and instruction datasets are available at Hugging Face Hub 🤗

Veracity Prediction Model Checkpoints

We fine-tune the 8b model and the 70b model for veracity prediction.

humane-lab/Meta-Llama-3.1-8B-HerO is our fine-tuned 8b model for veracity prediction and justification generation. We use Meta-Llama-3.1-8B for the base model.
humane-lab/Meta-Llama-3.1-70B-HerO is our fine-tuned 70b model for veracity prediction and justification generation. We use Meta-Llama-3.1-70B as the base model.

Fine-tuning Dataset

We created our fine-tuning dataset using our own prompts along with AVeriTeC justifications and verdicts to train the veracity prediction model

humane-lab/AVeriTeC-HerO is our training dataset for the veraity prediction and justification generation model. We modify the AVeriTeC dataset to be used for instruction training.

How to Run

Installation

git clone https://github.com/ssu-humane/HerO.git
cd HerO
pip install -r requirements.txt

AVeriTeC Data Preparation

Download the AVeriTeC dataset and place it in the data_store/averitec directory. More details can be found in the data_store/averitec/README.md

Evidence retrieval

Hypothetical fact-checking documents (HyDE-FC)

python hyde_fc_generation.py --target_data "data_store/averitec/dev.json" --json_output "data_store/dev_hyde_fc.json"

Retrieval and reranking

python retrieval.py --knowledge_store_dir "knowledge_store/dev" --target_data "data_store/dev_hyde_fc.json" --json_output "data_store/dev_retrieval_top_k.json"
python reranking.py --target_data "data_store/dev_retrieval_top_k.json" --json_output "data_store/dev_reranking_top_k.json"

The evidence retrieval pipeline takes about 6 hours in two H100.

Question generation

python question_generation.py --reference_corpus "data_store/averitec/train.json" --top_k_target_knowledge "data_store/dev_reranking_top_k.json" --output_questions "data_store/dev_top_k_qa.json" --model "meta-llama/Meta-Llama-3-8B-Instruct"

Generate questions for the dev set (8b LLM) takes about 25 minutes in two H100.

Veracity prediction

python veracity_prediction.py --target_data "data_store/dev_top_k_qa.json" --output_file "data_store/dev_veracity_prediction.json" --model "humane-lab/Meta-Llama-3.1-70B-HerO"

Veracity prediction for the dev set (70b Finetuned LLM) takes about 12 minutes in two H100.

Evaluation

python averitec_evaluate.py --prediction_file "data_store/dev_veracity_prediction.json" --label_file "data_store/averitec/dev.json"

You can also evaluate using hidden test set at https://eval.ai/web/challenges/challenge-page/2285/overview

License & Attribution

The code and dataset are shared under CC BY-NC 4.0.

@inproceedings{yoon-etal-2024-hero,
    title = "{H}er{O} at {AV}eri{T}e{C}: The Herd of Open Large Language Models for Verifying Real-World Claims",
    author = "Yoon, Yejun  and
      Jung, Jaeyoon  and
      Yoon, Seunghyun  and
      Park, Kunwoo",
    booktitle = "Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)",
    month = nov,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.fever-1.15",
    pages = "130--136",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims

Task: AVeriTeC

Method: HerO

Veracity Prediction Model and Fine-tuning Dataset

Veracity Prediction Model Checkpoints

Fine-tuning Dataset

How to Run

Installation

AVeriTeC Data Preparation

Evidence retrieval

Hypothetical fact-checking documents (HyDE-FC)

Retrieval and reranking

Question generation

Veracity prediction

Evaluation

License & Attribution

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data_store		data_store
.gitignore		.gitignore
README.md		README.md
averitec_evaluate.py		averitec_evaluate.py
hyde_fc_generation.py		hyde_fc_generation.py
question_generation.py		question_generation.py
requirements.txt		requirements.txt
reranking.py		reranking.py
retrieval.py		retrieval.py
veracity_prediction.py		veracity_prediction.py

ssu-humane/HerO

Folders and files

Latest commit

History

Repository files navigation

HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims

Task: AVeriTeC

Method: HerO

Veracity Prediction Model and Fine-tuning Dataset

Veracity Prediction Model Checkpoints

Fine-tuning Dataset

How to Run

Installation

AVeriTeC Data Preparation

Evidence retrieval

Hypothetical fact-checking documents (HyDE-FC)

Retrieval and reranking

Question generation

Veracity prediction

Evaluation

License & Attribution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages