- Source code and data used in the papers:
- ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities (Lerner et al., SIGIR’22)
- Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering (Lerner et al., ECIR'23)
- Cross-modal Retrieval for Knowledge-based Visual Question Answering (Lerner et al., ECIR'24)
See also MEERQAT project.
The data is provided in two formats: HF’s datasets
(based on Apache
Arrow) and plain-text JSONL files (one JSON object per line). Both
formats can be used in the same way as datasets
parses objects into
python dict
(see below), however our code only supports (and is
heavily based upon) datasets
. Images are distributed separately, in
standard formats (e.g. jpg).
Here’s how to get the images grounding the questions of the dataset:
# get the images. TODO integrate this in a single dataset
git clone https://huggingface.co/datasets/PaulLerner/viquae_images
# to get ALL images (dataset+KB) use https://huggingface.co/datasets/PaulLerner/viquae_all_images instead
cd viquae_images
# in viquae_all_images, the archive is split into parts of 5GB
# cat parts/* > images.tar.gz
tar -xzvf images.tar.gz
export VIQUAE_IMAGES_PATH=$PWD/images
Alternatively, you can download images from Wikimedia Commons using
meerqat.data.kilt2vqa download
(see below).
If you don’t want to use datasets
you can get the data directly from
https://huggingface.co/datasets/PaulLerner/viquae_dataset
(e.g. git clone https://huggingface.co/datasets/PaulLerner/viquae_dataset
).
The dataset format largely follows KILT. Here I’ll describe the dataset without pre-computed features. Pre-computed features are basically the output of each step described in EXPERIMENTS.rst.
In [1]: from datasets import load_dataset
...: dataset = load_dataset('PaulLerner/viquae_dataset')
In [2]: dataset
Out[2]:
DatasetDict({
train: Dataset({
features: ['image', 'input', 'kilt_id', 'id', 'meta', 'original_question', 'output', 'url', 'wikidata_id'],
num_rows: 1190
})
validation: Dataset({
features: ['image', 'input', 'kilt_id', 'id', 'meta', 'original_question', 'output', 'url', 'wikidata_id'],
num_rows: 1250
})
test: Dataset({
features: ['image', 'input', 'kilt_id', 'id', 'meta', 'original_question', 'output', 'url', 'wikidata_id'],
num_rows: 1257
})
})
In [3]: item = dataset['test'][0]
# this is now a dict, like the JSON object loaded from the JSONL files
In [4]: type(item)
Out[4]: dict
# url of the grounding image
In [5]: item['url']
Out[5]: 'http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Jackie_Wilson.png/512px-Jackie_Wilson.png'
# file name of the grounding image as stored in $VIQUAE_IMAGES_PATH
In [6]: item['image']
Out[6]: '512px-Jackie_Wilson.png'
# you can thus load the image from $VIQUAE_IMAGES_PATH/item['image']
# meerqat.data.loading.load_image_batch does that for you
In [7]: from meerqat.data.loading import load_image_batch
# fake batch of size 1
In [8]: image = load_image_batch([item['image']])[0]
# it returns a PIL Image, all images have been resized to a width of 512
In [9]: type(image), image.size
Out[9]: (PIL.Image.Image, (512, 526))
# question string
In [10]: item['input']
Out[10]: "this singer's re-issued song became the UK Christmas number one after helping to advertise what brand?"
# answer string
In [11]: item['output']['original_answer']
Out[11]: "Levi's"
# processing the data:
In [12]: dataset.map(my_function)
# this is almost the same as (see how can you adapt the code if you don’t want to use the `datasets` library)
In [13]: for item in dataset:
...: my_function(item)
Again, the format of the KB is very similar to KILT’s Wikipedia so I will not describe all fields exhaustively.
# again you can also clone directly from https://huggingface.co/datasets/PaulLerner/viquae_wikipedia to get the raw data
>>> data_files = dict(
humans_with_faces='humans_with_faces.jsonl.gz',
humans_without_faces='humans_without_faces.jsonl.gz',
non_humans='non_humans.jsonl.gz'
)
>>> kb = load_dataset('PaulLerner/viquae_wikipedia', data_files=data_files)
>>> kb
DatasetDict({
humans_with_faces: Dataset({
features: ['anchors', 'categories', 'image', 'kilt_id', 'text', 'url', 'wikidata_info', 'wikipedia_id', 'wikipedia_title'],
num_rows: 506237
})
humans_without_faces: Dataset({
features: ['anchors', 'categories', 'image', 'kilt_id', 'text', 'url', 'wikidata_info', 'wikipedia_id', 'wikipedia_title'],
num_rows: 35736
})
non_humans: Dataset({
features: ['anchors', 'categories', 'image', 'kilt_id', 'text', 'url', 'wikidata_info', 'wikipedia_id', 'wikipedia_title'],
num_rows: 953379
})
})
>>> item = kb['humans_with_faces'][0]
>>> item['wikidata_info']['wikidata_id'], item['wikidata_info']['wikipedia_title']
('Q313590', 'Alain Connes')
# file name of the reference image as stored in $VIQUAE_IMAGES_PATH
# you can use meerqat.data.loading.load_image_batch like above
>>> item['image']
'512px-Alain_Connes.jpg'
# the text is stored in a list of string, one per paragraph
>>> type(item['text']['paragraph']), len(item['text']['paragraph'])
(list, 25)
>>> item['text']['paragraph'][1]
"Alain Connes (; born 1 April 1947) is a French mathematician, \
currently Professor at the Collège de France, IHÉS, Ohio State University and Vanderbilt University. \
He was an Invited Professor at the Conservatoire national des arts et métiers (2000).\n"
# you might want to concatenate these three datasets to get a single dataset (e.g. to split the articles in passages)
>>> from datasets import concatenate_datasets
>>> kb['humans_with_faces'] = kb['humans_with_faces'].map(lambda item: {'is_human': True})
>>> kb['humans_without_faces'] = kb['humans_without_faces'].map(lambda item: {'is_human': True})
>>> kb['non_humans'] = kb['non_humans'].map(lambda item: {'is_human': False})
>>> kb_recat = concatenate_datasets([kb['non_humans'], kb['humans_with_faces'], kb['humans_without_faces']])
>>> kb_recat.save_to_disk('data/viquae_wikipedia_recat')
To format the articles into text passages, follow instructions at
EXPERIMENTS.rst (Preprocessing passages section).
Alternatively, get them from https://huggingface.co/datasets/PaulLerner/viquae_v4-alpha_passages
(load_dataset('PaulLerner/viquae_v4-alpha_passages')
).
WIT (Srinavasan et al. http://arxiv.org/abs/2103.01913) is available at https://github.com/google-research-datasets/wit.
(By any chance, if you have access to Jean Zay, it is available at $DSDIR/WIT
).
Follow instructions at meerqat.data.wit
(see meerqat.data.wit.html
) or get it
from https://huggingface.co/datasets/PaulLerner/wit_for_mict (load_dataset('PaulLerner/wit_for_mict')
)
Please refer to ANNOTATION.md for the annotation instructions
Please refer to EXPERIMENTS.rst for instructions to reproduce our experiments
If you use the ViQuAE dataset or KB, please cite:
@inproceedings{lerner2022viquae, author = {Paul Lerner and Olivier Ferret and Camille Guinaudeau and Le Borgne, Hervé and Romaric Besançon and Moreno, Jose G and Lovón Melgarejo, Jesús }, year={2022}, title={{ViQuAE}, a Dataset for Knowledge-based Visual Question Answering about Named Entities}, booktitle = {Proceedings of The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, series = {SIGIR’22}, URL = {https://hal.archives-ouvertes.fr/hal-03650618}, DOI = {10.1145/3477495.3531753}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA} }
If you use this code for multimodal information retrieval or early fusion or Inverse Cloze Task pre-training, please cite:
@inproceedings{lerner2023ict, TITLE = {{Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering}}, AUTHOR = {Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille}, URL = {https://hal.science/hal-03933089}, BOOKTITLE = {{European Conference on Information Retrieval (ECIR 2023)}}, ADDRESS = {Dublin, Ireland}, YEAR = {2023}, MONTH = Apr, KEYWORDS = {Visual Question Answering ; Pre-training ; Multimodal Fusion}, PDF = {https://hal.science/hal-03933089v2/file/ecir-2023-vf-authors.pdf}, HAL_ID = {hal-03933089}, HAL_VERSION = {v2}, }
If you use this code for mono- or cross-modal information retrieval with CLIP or fine-tuning CLIP, please cite:
@unpublished{lerner2024cross, TITLE = {{Cross-modal Retrieval for Knowledge-based Visual Question Answering}}, AUTHOR = {Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille}, URL = {https://hal.science/hal-04384431}, NOTE = {Accepted at ECIR 2024}, YEAR = {2024}, MONTH = Jan, KEYWORDS = {Visual Question Answering ; Multimodal ; Cross-modal Retrieval ; Named Entities}, PDF = {https://hal.science/hal-04384431/file/camera_ecir_2024_cross_modal_arXiv.pdf}, HAL_ID = {hal-04384431}, HAL_VERSION = {v1}, }
Install PyTorch 1.9.0 following the official document wrt to your distribution (preferably in a virtual environment)
Also install ElasticSearch (and run it) or pyserini if you want to do sparse retrieval.
The rest should be installed using pip
:
$ git clone https://github.com/PaulLerner/ViQuAE.git
$ pip install -e ViQuAE
$ python
>>> import meerqat
To build the docs locally, run sphinx-apidoc -o source_docs/ -f -e -M meerqat
then sphinx-build -b html source_docs/ docs/