Refactor FLORES-200 script #733

SamuelCahyawijaya · 2024-12-07T01:30:15Z

Refactor FLORES-200 script change source schema & update script for loading the new FLORES-200 v2.0 from openlanguagedata/flores_plus

Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples:

Title: Update FLORES-200

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Note: the source and seacrowd_t2t have completely different schema now, where the source only has the sentence for a certain language (as how it is implemented in openlanguagedata/flores_plus) and the seacrowd_t2t is implemented for the actual MT task.

…oading for the new FLORES-200 v2.0 in openlanguagedata/flores_plus

Fix Typo

refactor FLORES-200 script change source schema & update script for l…

30e626a

…oading for the new FLORES-200 v2.0 in openlanguagedata/flores_plus

SamuelCahyawijaya self-assigned this Dec 7, 2024

SamuelCahyawijaya requested review from holylovenia, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, danjohnvelasco, MJonibek and tellarin as code owners December 7, 2024 01:30

Update flores200.py

b37a842

Fix Typo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor FLORES-200 script #733

Refactor FLORES-200 script #733

SamuelCahyawijaya commented Dec 7, 2024

Refactor FLORES-200 script #733

Are you sure you want to change the base?

Refactor FLORES-200 script #733

Conversation

SamuelCahyawijaya commented Dec 7, 2024

Checkbox