Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #623 | Add/Update Dataloader MedEV #639

Merged
merged 6 commits into from
May 31, 2024

Conversation

patrickamadeus
Copy link
Collaborator

Closes #623

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
  • [.] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Tests

image

Copy link
Collaborator

@elyanah-aco elyanah-aco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @patrickamadeus! Just suggesting some small changes:

seacrowd/sea_datasets/medev/medev.py Outdated Show resolved Hide resolved
"""Yields examples as (key, example) tuples."""
with open(filepath["en"], "r") as f:
en_lines = f.readlines()
with open(filepath["vie"], "r") as f:
Copy link
Collaborator

@elyanah-aco elyanah-aco May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you determine what encoding your computer uses for the Vietnamese file? Both utf-8 and latin-1 don't give me the intended result, and other encodings don't work at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine is utf-8. it's working well from my side.

image

seacrowd/sea_datasets/medev/medev.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/medev/medev.py Outdated Show resolved Hide resolved
Comment on lines 93 to 95
{
"text": datasets.Value("string"),
}
Copy link
Collaborator

@elyanah-aco elyanah-aco May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
"text": datasets.Value("string"),
}
{
"id": datasets.Value("string"),
"text": datasets.Value("string"),
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding to this, do we really want to not match the English text and Vietnamese translation together? I know that the dataset viewer in the homepage shows the data in a stack, but I think for a dataloader, we should add them together. Wdyt @elyanah-aco?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be more useful. Maybe something like id, vie_text and eng_text fields for source is okay

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'll address @akhdanfadh 's suggestion after everything concluded from ur guys' side!

Copy link
Collaborator

@akhdanfadh akhdanfadh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the dataloader! Just add some minor edit and additional discussion.

seacrowd/sea_datasets/medev/medev.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/medev/medev.py Outdated Show resolved Hide resolved
Comment on lines 93 to 95
{
"text": datasets.Value("string"),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding to this, do we really want to not match the English text and Vietnamese translation together? I know that the dataset viewer in the homepage shows the data in a stack, but I think for a dataloader, we should add them together. Wdyt @elyanah-aco?

@patrickamadeus
Copy link
Collaborator Author

Hi @elyanah-aco ! I've addressed all of the suggestions! Appreciate the detailed review.

I will address suggestion from @akhdanfadh after your second opinion 🙏.

@holylovenia
Copy link
Contributor

Also adding to this, do we really want to not match the English text and Vietnamese translation together? I know that the dataset viewer in the homepage shows the data in a stack, but I think for a dataloader, we should add them together. Wdyt @elyanah-aco?

Hi @elyanah-aco ! I've addressed all of the suggestions! Appreciate the detailed review.

I will address suggestion from @akhdanfadh after your second opinion 🙏.

A friendly reminder for @elyanah-aco in case she missed it.

Copy link
Collaborator

@elyanah-aco elyanah-aco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @patrickamadeus! OK now on my end (even the Vietnamese text)

Comment on lines 93 to 95
{
"text": datasets.Value("string"),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be more useful. Maybe something like id, vie_text and eng_text fields for source is okay

@patrickamadeus
Copy link
Collaborator Author

Hi all @akhdanfadh @elyanah-aco ! The minor language expand is done! Thank you for all of the reviews. 🙏

@holylovenia
Copy link
Contributor

Hi @akhdanfadh, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @patrickamadeus

Copy link
Collaborator

@akhdanfadh akhdanfadh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Dataloader tested and worked. Merging in a bit.

@akhdanfadh akhdanfadh merged commit bbc6e55 into SEACrowd:master May 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for MedEV
4 participants