Create dataset loader for Belebele #7

SamuelCahyawijaya · 2023-10-29T11:42:43Z

Dataset	belebele
Description	Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
Subsets	ceb_Latn, ilo_Latn, ind_Latn, jav_Latn, kac_Latn, khm_Khmr, lao_Laoo, mya_Mymr, shn_Mymr, sun_Latn, tgl_Latn, tha_Thai, vie_Latn, war_Latn, zsm_Latn
Languages	ceb, ilo, ind, jav, kac, khm, lao, mya, shn, sun, tgl, vie, war, zsm
Tasks	Question Answering
License	Creative Commons Attribution Non Commercial Share Alike 4.0 (cc-by-nc-sa-4.0)
Homepage	https://github.com/facebookresearch/belebele
HF URL	https://huggingface.co/datasets/facebook/belebele
Paper URL	https://arxiv.org/pdf/2308.16884v1.pdf

The text was updated successfully, but these errors were encountered:

gagan3012 · 2023-10-31T16:22:36Z

Can I take this?

holylovenia · 2023-11-01T00:24:21Z

@gagan3012 Yes! You can assign yourself in the project and follow the guide in here. 😄

mnjkhtri · 2023-11-03T07:03:47Z

#self-assign

SamuelCahyawijaya added this to SEACrowd Data Hub Oct 29, 2023

SamuelCahyawijaya converted this from a draft issue Oct 29, 2023

SamuelCahyawijaya added good first issue Good for newcomers help wanted Extra attention is needed labels Oct 29, 2023

fajri91 assigned jcblaisecruz02 and holylovenia Oct 30, 2023

fajri91 moved this to Done in SEACrowd Data Hub Oct 30, 2023

fajri91 unassigned jcblaisecruz02 and holylovenia Oct 30, 2023

fajri91 removed the status in SEACrowd Data Hub Oct 30, 2023

github-actions bot assigned mnjkhtri Nov 3, 2023

jamesjaya closed this as completed in 7a4e7cc Nov 16, 2023

github-project-automation bot moved this to Done in SEACrowd Data Hub Nov 16, 2023

Provide feedback