Please take the following steps to get the benchmark dataset.
Please login to BioASQ, then do the following:
-
In
Datasets for task a
, downloadallMeSH_2022.zip
throuth the entry ofTraining v.2022 (txt)
in the table. Unzip it toraw_data/bioasq/allMeSH_2022.json
. Note that this JSON file is of 27G large, please make sure you have enough disk space. -
In
Datasets for task b
, download files throuth the links columnTest data
in the table from 2014 to 2023. Unzip the files and put the JSON files into the folderraw_data/bioasq
:{2~9}B{1~5}_golden.json
10B{1~6}_golden.json
11B{1~4}_golden.json
Download the LoTTE corpus here: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz and unzip to folder raw_data
.
Get access to NovelQA dataset: https://huggingface.co/datasets/NovelQA/NovelQA . Then login your huggingface account:
pip install huggingface_hub
huggingface-cli login
Run the following script, the benchmark dataset will be processed to the folder processed_data
:
sh data_process.sh