Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Belebele #7

Closed
SamuelCahyawijaya opened this issue Oct 29, 2023 · 3 comments
Closed

Create dataset loader for Belebele #7

SamuelCahyawijaya opened this issue Oct 29, 2023 · 3 comments
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Oct 29, 2023

Dataloader name: belebele/belebele.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?belebele

Dataset belebele
Description Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
Subsets ceb_Latn, ilo_Latn, ind_Latn, jav_Latn, kac_Latn, khm_Khmr, lao_Laoo, mya_Mymr, shn_Mymr, sun_Latn, tgl_Latn, tha_Thai, vie_Latn, war_Latn, zsm_Latn
Languages ceb, ilo, ind, jav, kac, khm, lao, mya, shn, sun, tgl, vie, war, zsm
Tasks Question Answering
License Creative Commons Attribution Non Commercial Share Alike 4.0 (cc-by-nc-sa-4.0)
Homepage https://github.com/facebookresearch/belebele
HF URL https://huggingface.co/datasets/facebook/belebele
Paper URL https://arxiv.org/pdf/2308.16884v1.pdf
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Oct 29, 2023
@SamuelCahyawijaya SamuelCahyawijaya added good first issue Good for newcomers help wanted Extra attention is needed labels Oct 29, 2023
@fajri91 fajri91 moved this to Done in SEACrowd Data Hub Oct 30, 2023
@fajri91 fajri91 removed the status in SEACrowd Data Hub Oct 30, 2023
@gagan3012
Copy link

Can I take this?

@holylovenia
Copy link
Contributor

@gagan3012 Yes! You can assign yourself in the project and follow the guide in here. 😄

@mnjkhtri
Copy link
Contributor

mnjkhtri commented Nov 3, 2023

#self-assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
Status: Done
Development

No branches or pull requests

5 participants