Closes #5 | Add tatoeba dataset loader #22

ljvmiranda921 · 2023-11-05T01:18:05Z

Closes #5

Some notable things to point out from this dataset:

Instead of implementing one loader per language, I created several configs per language in a single loader. TThis means that the configs would look like this: tatoeba.ind_seacrowd_t2t, tatoeba.ind_source and so on. When testing, we should pass a value to the --subset_id parameter like so:

python -m tests.test_seacrowd seacrwod/sea_datasets/tatoeba/tatoeba.py --subset_id tatoeba.ind

The source_lang field in the original HF link seems incorrect? I'm not familiar with this dataset so for my implementation I just copied over what's in the source. Let me know if I should "override" this with the actual language code.

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Closes SEACrowd#5

seacrowd/sea_datasets/tatoeba/tatoeba.py

ryanignatius · 2023-11-11T08:03:24Z

The source_lang field in the original HF link seems incorrect? I'm not familiar with this dataset so for my implementation I just copied over what's in the source. Let me know if I should "override" this with the actual language code.

About this, I think it will be better to "override" this with the actual language code,
What do you think @SamuelCahyawijaya ?

holylovenia · 2023-11-16T03:41:08Z

@ryanignatius I agree with you. Could you please override it with the actual language codes? @ljvmiranda921

ljvmiranda921 · 2023-11-16T03:58:50Z

Could you please override it with the actual language codes? @ljvmiranda921

Should be done: aba9c65

holylovenia

One last thing, could you also please provide tatoeba_source and tatoeba_seacrowd_t2t schemas where they load all data? @ljvmiranda921

holylovenia · 2023-11-17T12:59:18Z

seacrowd/sea_datasets/tatoeba/tatoeba.py

+_SEACROWD_VERSION = "1.0.0"
+
+
+class TatoebaDatset(datasets.GeneratorBasedBuilder):


Could you please change the class name to TatoebaDataset?

Thanks for catching! Fixed d5bef23

holylovenia · 2023-11-17T13:00:51Z

seacrowd/sea_datasets/tatoeba/tatoeba.py

+
+    SEACROWD_SCHEMA_NAME = "t2t"
+
+    dataset_names = sorted([f"tatoeba.{lang}" for lang in _LANGUAGES])


Instead of using . as the delimiter, could you please change it to _?

Sure! e95f83e

ljvmiranda921 · 2023-11-17T14:07:24Z

Converting this to draft for now. Will still work on the schema that loads all datasets at once.

ljvmiranda921 · 2023-11-17T14:53:12Z

could you also please provide tatoeba_source and tatoeba_seacrowd_t2t schemas where they load all data?

Hi @holylovenia ! I implemented it here: a44be2b

The default config now is tatoeba_source (loads all subsets from every SEA language using the source schema).

holylovenia

Thank you so much for the changes, @ljvmiranda921! Looks perfect to me. 👍 Let's wait for @ryanignatius's review then we can merge it.

ryanignatius

it looks good to me too,
thank you for the help @ljvmiranda921

Add tatoeba dataset loader

026d00a

Closes SEACrowd#5

ljvmiranda921 requested review from holylovenia, SamuelCahyawijaya, fajri91 and afaji as code owners November 5, 2023 01:18

SamuelCahyawijaya requested review from RosenZhang, jcblaisecruz02 and ryanignatius and removed request for RosenZhang and jcblaisecruz02 November 6, 2023 14:26

ryanignatius reviewed Nov 11, 2023

View reviewed changes

seacrowd/sea_datasets/tatoeba/tatoeba.py Outdated Show resolved Hide resolved

ljvmiranda921 added 2 commits November 11, 2023 18:21

Strip each row

0ee8922

Rerun make check_run on the source

c7f2fd7

holylovenia removed request for SamuelCahyawijaya, fajri91 and afaji November 16, 2023 03:41

Override lang with the actual language code

aba9c65

holylovenia requested changes Nov 17, 2023

View reviewed changes

Fix incorrect spelling in loader class

d5bef23

ljvmiranda921 marked this pull request as draft November 17, 2023 14:06

Change delimiter from period to underscore

e95f83e

ljvmiranda921 force-pushed the add/tatoeba branch from 9317dea to e95f83e Compare November 17, 2023 14:11

Add builder configs that load all instances

a44be2b

ljvmiranda921 force-pushed the add/tatoeba branch from a434bfa to a44be2b Compare November 17, 2023 14:51

ljvmiranda921 marked this pull request as ready for review November 17, 2023 14:53

Run formatter on source file

414b313

holylovenia approved these changes Nov 18, 2023

View reviewed changes

ryanignatius approved these changes Nov 19, 2023

View reviewed changes

holylovenia merged commit 0a5fa46 into SEACrowd:master Nov 19, 2023
1 check passed

ljvmiranda921 deleted the add/tatoeba branch November 19, 2023 10:05

ljvmiranda921 mentioned this pull request Nov 24, 2023

Closes #84 | Implement CulturaX Dataloader #98

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #5 | Add tatoeba dataset loader #22

Closes #5 | Add tatoeba dataset loader #22

ljvmiranda921 commented Nov 5, 2023 •

edited

Loading

ryanignatius commented Nov 11, 2023

holylovenia commented Nov 16, 2023

ljvmiranda921 commented Nov 16, 2023 •

edited

Loading

holylovenia left a comment

holylovenia Nov 17, 2023

ljvmiranda921 Nov 17, 2023

holylovenia Nov 17, 2023

ljvmiranda921 Nov 17, 2023 •

edited

Loading

ljvmiranda921 commented Nov 17, 2023

ljvmiranda921 commented Nov 17, 2023

holylovenia left a comment

ryanignatius left a comment

		_SEACROWD_VERSION = "1.0.0"


		class TatoebaDatset(datasets.GeneratorBasedBuilder):


		SEACROWD_SCHEMA_NAME = "t2t"

		dataset_names = sorted([f"tatoeba.{lang}" for lang in _LANGUAGES])

Closes #5 | Add tatoeba dataset loader #22

Closes #5 | Add tatoeba dataset loader #22

Conversation

ljvmiranda921 commented Nov 5, 2023 • edited Loading

Checkbox

ryanignatius commented Nov 11, 2023

holylovenia commented Nov 16, 2023

ljvmiranda921 commented Nov 16, 2023 • edited Loading

holylovenia left a comment

Choose a reason for hiding this comment

holylovenia Nov 17, 2023

Choose a reason for hiding this comment

ljvmiranda921 Nov 17, 2023

Choose a reason for hiding this comment

holylovenia Nov 17, 2023

Choose a reason for hiding this comment

ljvmiranda921 Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

ljvmiranda921 commented Nov 17, 2023

ljvmiranda921 commented Nov 17, 2023

holylovenia left a comment

Choose a reason for hiding this comment

ryanignatius left a comment

Choose a reason for hiding this comment

ljvmiranda921 commented Nov 5, 2023 •

edited

Loading

ljvmiranda921 commented Nov 16, 2023 •

edited

Loading

ljvmiranda921 Nov 17, 2023 •

edited

Loading