Closes #84 | Implement CulturaX Dataloader #98

akhdanfadh · 2023-11-21T10:25:41Z

Closes #84

I implemented one config per language/subset. Thus, configs will look like this: culturax_id_source, culturax_jv_seacrowd_ssp, etc. When testing, pass culturax_<subset> to the --subset_id parameter.

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

ljvmiranda921

Thanks for this PR, @akhdanfadh ! 😄

I loaded each subset and had no issues with the data loader. Note for other reviewers that downloading the dataset might take long because each subset is more than a gig in size.

Just a few small comments and suggestions on formatting, but functionally, the data loader works.

seacrowd/sea_datasets/culturax/culturax.py

- remove parquet article reference - newline list comprehension

ljvmiranda921

LGTM awesome work @akhdanfadh :) Let's just wait for the other reviewer's comments :)

seacrowd/sea_datasets/culturax/culturax.py

akhdanfadh · 2023-11-25T17:44:06Z

Done, should be ready to merge @SamuelCahyawijaya

SamuelCahyawijaya

LGTM!

akhdanfadh added 3 commits November 20, 2023 21:54

implement culturax dataloader

b57c834

Merge remote-tracking branch 'origin/master'

8c2924a

remove __main__ | minor naming changes

0379b42

akhdanfadh requested review from holylovenia, SamuelCahyawijaya, fajri91 and afaji as code owners November 21, 2023 10:25

ljvmiranda921 self-assigned this Nov 23, 2023

ljvmiranda921 reviewed Nov 24, 2023

View reviewed changes

holylovenia requested review from ljvmiranda921 and removed request for afaji, fajri91 and holylovenia November 24, 2023 02:44

holylovenia assigned SamuelCahyawijaya Nov 24, 2023

minor changes

006edc6

- remove parquet article reference - newline list comprehension

akhdanfadh requested review from sabilmakbar, jamesjaya, ryanignatius, yongzx and gentaiscool as code owners November 24, 2023 12:51

ljvmiranda921 approved these changes Nov 25, 2023

View reviewed changes

akhdanfadh commented Nov 25, 2023

View reviewed changes

seacrowd/sea_datasets/culturax/culturax.py Outdated Show resolved Hide resolved

add licenses.others.value

adbbe7c

SamuelCahyawijaya approved these changes Nov 27, 2023

View reviewed changes

SamuelCahyawijaya merged commit 4f68406 into SEACrowd:master Nov 27, 2023
1 check passed

akhdanfadh deleted the culturax branch November 27, 2023 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #84 | Implement CulturaX Dataloader #98

Closes #84 | Implement CulturaX Dataloader #98

akhdanfadh commented Nov 21, 2023

ljvmiranda921 left a comment •

edited

Loading

ljvmiranda921 left a comment

akhdanfadh commented Nov 25, 2023 •

edited

Loading

SamuelCahyawijaya left a comment

Closes #84 | Implement CulturaX Dataloader #98

Closes #84 | Implement CulturaX Dataloader #98

Conversation

akhdanfadh commented Nov 21, 2023

Checkbox

ljvmiranda921 left a comment • edited Loading

Choose a reason for hiding this comment

ljvmiranda921 left a comment

Choose a reason for hiding this comment

akhdanfadh commented Nov 25, 2023 • edited Loading

SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

ljvmiranda921 left a comment •

edited

Loading

akhdanfadh commented Nov 25, 2023 •

edited

Loading