Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #5 | Add tatoeba dataset loader #22

Merged
merged 8 commits into from
Nov 19, 2023

Conversation

ljvmiranda921
Copy link
Collaborator

@ljvmiranda921 ljvmiranda921 commented Nov 5, 2023

Closes #5

Some notable things to point out from this dataset:

  • Instead of implementing one loader per language, I created several configs per language in a single loader. TThis means that the configs would look like this: tatoeba.ind_seacrowd_t2t, tatoeba.ind_source and so on. When testing, we should pass a value to the --subset_id parameter like so:
python -m tests.test_seacrowd seacrwod/sea_datasets/tatoeba/tatoeba.py --subset_id tatoeba.ind
  • The source_lang field in the original HF link seems incorrect? I'm not familiar with this dataset so for my implementation I just copied over what's in the source. Let me know if I should "override" this with the actual language code.

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

@ryanignatius
Copy link
Collaborator

The source_lang field in the original HF link seems incorrect? I'm not familiar with this dataset so for my implementation I just copied over what's in the source. Let me know if I should "override" this with the actual language code.

About this, I think it will be better to "override" this with the actual language code,
What do you think @SamuelCahyawijaya ?

@holylovenia
Copy link
Contributor

@ryanignatius I agree with you. Could you please override it with the actual language codes? @ljvmiranda921

@ljvmiranda921
Copy link
Collaborator Author

ljvmiranda921 commented Nov 16, 2023

Could you please override it with the actual language codes? @ljvmiranda921

Should be done: aba9c65

Copy link
Contributor

@holylovenia holylovenia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, could you also please provide tatoeba_source and tatoeba_seacrowd_t2t schemas where they load all data? @ljvmiranda921

_SEACROWD_VERSION = "1.0.0"


class TatoebaDatset(datasets.GeneratorBasedBuilder):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change the class name to TatoebaDataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching! Fixed d5bef23


SEACROWD_SCHEMA_NAME = "t2t"

dataset_names = sorted([f"tatoeba.{lang}" for lang in _LANGUAGES])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using . as the delimiter, could you please change it to _?

Copy link
Collaborator Author

@ljvmiranda921 ljvmiranda921 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! e95f83e

@ljvmiranda921 ljvmiranda921 marked this pull request as draft November 17, 2023 14:06
@ljvmiranda921
Copy link
Collaborator Author

Converting this to draft for now. Will still work on the schema that loads all datasets at once.

@ljvmiranda921
Copy link
Collaborator Author

could you also please provide tatoeba_source and tatoeba_seacrowd_t2t schemas where they load all data?

Hi @holylovenia ! I implemented it here: a44be2b

The default config now is tatoeba_source (loads all subsets from every SEA language using the source schema).

@ljvmiranda921 ljvmiranda921 marked this pull request as ready for review November 17, 2023 14:53
Copy link
Contributor

@holylovenia holylovenia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the changes, @ljvmiranda921! Looks perfect to me. 👍 Let's wait for @ryanignatius's review then we can merge it.

Copy link
Collaborator

@ryanignatius ryanignatius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good to me too,
thank you for the help @ljvmiranda921

@holylovenia holylovenia merged commit 0a5fa46 into SEACrowd:master Nov 19, 2023
1 check passed
@ljvmiranda921 ljvmiranda921 deleted the add/tatoeba branch November 19, 2023 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for Tatoeba
3 participants