Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for MaXM #425

Closed
SamuelCahyawijaya opened this issue Feb 13, 2024 · 4 comments · Fixed by #554
Closed

Create dataset loader for MaXM #425

SamuelCahyawijaya opened this issue Feb 13, 2024 · 4 comments · Fixed by #554
Assignees
Labels
pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: maxm/maxm.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?maxm

Dataset maxm
Description MaXM, a test-only VQA benchmark in 7 diverse languages, including Thai. The dataset is generated by first applying a translation-based framework to mVQA and then applying framework to the multilingual captions in the Crossmodal-3600 dataset.
Subsets MaXM v1 -th
Languages tha
Tasks Question Answering
License Other (other)
Homepage https://github.com/google-research-datasets/maxm
HF URL -
Paper URL https://aclanthology.org/2023.findings-emnlp.176
@akhdanfadh
Copy link
Collaborator

Hi, the dataset is organized as follows:

dataset                 str: dataset name
version                 str: dataset version
split                   str: language ID
annotations             List of image-question-answers triplets, each of which is
-- image_id             str: image ID
-- image_url            str: image URL
-- qa_pairs             List of question-answer pairs, each of which is
---- question_id        str: question ID
---- question           str: raw question
---- answers            List of str: ground-truth answers
---- processed_answers  List of str: processed ground-truth answers. 16 tokenized answers.
---- is_collection      bool: "true" if the question is of the "Collection" type; "false" otherwise..

In question answering schema, the features are:

id             (str)
question_id    (str)
document_id    (str)
question       (str)
type           (str)
choices        (list[str])
context        (str)
answer         (list[str])
meta           (dict[Any])
  1. Should I assign is_collection to type, context, or inside meta?
  2. Also, should I put image_id or image_url for the document_id?

Copy link

github-actions bot commented Mar 1, 2024

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@akhdanfadh
Copy link
Collaborator

Hmm, I think I need to mention for faster response @sabilmakbar @holylovenia

@holylovenia
Copy link
Contributor

I didn't realize I missed so many mentions from you. 😭 Sorry!!

Could you please use Tasks.VISUAL_QUESTION_ANSWERING? It employs the imqa schema.

  1. Should I assign is_collection to type, context, or inside meta?

Inside meta would be perfect. type is typically open-ended, multiple-choice, extractive, abstractive, etc.

  1. Also, should I put image_id or image_url for the document_id?

document_id is related to the context (if there is).

@akhdanfadh akhdanfadh added the in-progress Assignee has given confirmation on progress and ETA label Mar 29, 2024
@akhdanfadh akhdanfadh added pr-ready A PR that closes this issue is Ready to be reviewed and removed in-progress Assignee has given confirmation on progress and ETA labels Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
3 participants