Closes #623 | Add/Update Dataloader MedEV #639

patrickamadeus · 2024-04-12T07:38:41Z

Closes #623

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[.] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Tests

elyanah-aco

Thanks for the PR @patrickamadeus! Just suggesting some small changes:

seacrowd/sea_datasets/medev/medev.py

elyanah-aco · 2024-05-02T00:18:35Z

seacrowd/sea_datasets/medev/medev.py

+        """Yields examples as (key, example) tuples."""
+        with open(filepath["en"], "r") as f:
+            en_lines = f.readlines()
+        with open(filepath["vie"], "r") as f:


Can you determine what encoding your computer uses for the Vietnamese file? Both utf-8 and latin-1 don't give me the intended result, and other encodings don't work at all.

Mine is utf-8. it's working well from my side.

seacrowd/sea_datasets/medev/medev.py

elyanah-aco · 2024-05-02T00:28:05Z

seacrowd/sea_datasets/medev/medev.py

+                {
+                    "text": datasets.Value("string"),
+                }


Suggested change

{

"text": datasets.Value("string"),

}

{

"id": datasets.Value("string"),

"text": datasets.Value("string"),

}

Also adding to this, do we really want to not match the English text and Vietnamese translation together? I know that the dataset viewer in the homepage shows the data in a stack, but I think for a dataloader, we should add them together. Wdyt @elyanah-aco?

I agree that would be more useful. Maybe something like id, vie_text and eng_text fields for source is okay

Hi! I'll address @akhdanfadh 's suggestion after everything concluded from ur guys' side!

akhdanfadh

Thank you for the dataloader! Just add some minor edit and additional discussion.

seacrowd/sea_datasets/medev/medev.py

akhdanfadh · 2024-05-09T00:56:41Z

seacrowd/sea_datasets/medev/medev.py

+                {
+                    "text": datasets.Value("string"),
+                }


Also adding to this, do we really want to not match the English text and Vietnamese translation together? I know that the dataset viewer in the homepage shows the data in a stack, but I think for a dataloader, we should add them together. Wdyt @elyanah-aco?

Co-authored-by: Elyanah Aco <[email protected]>

patrickamadeus · 2024-05-19T10:32:38Z

Hi @elyanah-aco ! I've addressed all of the suggestions! Appreciate the detailed review.

I will address suggestion from @akhdanfadh after your second opinion 🙏.

holylovenia · 2024-05-21T08:33:22Z

Also adding to this, do we really want to not match the English text and Vietnamese translation together? I know that the dataset viewer in the homepage shows the data in a stack, but I think for a dataloader, we should add them together. Wdyt @elyanah-aco?

Hi @elyanah-aco ! I've addressed all of the suggestions! Appreciate the detailed review.

I will address suggestion from @akhdanfadh after your second opinion 🙏.

A friendly reminder for @elyanah-aco in case she missed it.

elyanah-aco

Thanks for the changes @patrickamadeus! OK now on my end (even the Vietnamese text)

elyanah-aco · 2024-05-09T01:14:07Z

seacrowd/sea_datasets/medev/medev.py

+                {
+                    "text": datasets.Value("string"),
+                }


I agree that would be more useful. Maybe something like id, vie_text and eng_text fields for source is okay

patrickamadeus · 2024-05-22T16:27:31Z

Hi all @akhdanfadh @elyanah-aco ! The minor language expand is done! Thank you for all of the reviews. 🙏

holylovenia · 2024-05-30T04:41:09Z

Hi @akhdanfadh, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @patrickamadeus

akhdanfadh

LGTM! Dataloader tested and worked. Merging in a bit.

feat: add MedEV dataloader

3261098

patrickamadeus requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners April 12, 2024 07:38

nitpick

87405db

holylovenia requested review from akhdanfadh and elyanah-aco and removed request for tellarin, gentaiscool, jamesjaya, SamuelCahyawijaya, ljvmiranda921, holylovenia, yongzx, MJonibek, danjohnvelasco, jensan-1 and sabilmakbar April 27, 2024 15:15

holylovenia assigned akhdanfadh and elyanah-aco Apr 27, 2024

elyanah-aco reviewed May 2, 2024

View reviewed changes

akhdanfadh reviewed May 9, 2024

View reviewed changes

patrickamadeus and others added 3 commits May 19, 2024 16:50

nitpick encoding generate examples

49bf476

Co-authored-by: Elyanah Aco <[email protected]>

nitpick yield id

e5fe6e0

Co-authored-by: Elyanah Aco <[email protected]>

fix: URLs structure, encoding, lang isocode

544d3c5

elyanah-aco approved these changes May 21, 2024

View reviewed changes

feat: expand vie + eng source schema

93108c6

akhdanfadh approved these changes May 31, 2024

View reviewed changes

akhdanfadh merged commit bbc6e55 into SEACrowd:master May 31, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #623 | Add/Update Dataloader MedEV #639

Closes #623 | Add/Update Dataloader MedEV #639

patrickamadeus commented Apr 12, 2024

elyanah-aco left a comment

elyanah-aco May 2, 2024 •

edited

Loading

patrickamadeus May 19, 2024

elyanah-aco May 2, 2024 •

edited

Loading

akhdanfadh May 9, 2024

elyanah-aco May 9, 2024

patrickamadeus May 19, 2024

akhdanfadh left a comment

akhdanfadh May 9, 2024

patrickamadeus commented May 19, 2024

holylovenia commented May 21, 2024

elyanah-aco left a comment

elyanah-aco May 9, 2024

patrickamadeus commented May 22, 2024

holylovenia commented May 30, 2024

akhdanfadh left a comment

Closes #623 | Add/Update Dataloader MedEV #639

Closes #623 | Add/Update Dataloader MedEV #639

Conversation

patrickamadeus commented Apr 12, 2024

Checkbox

Tests

elyanah-aco left a comment

Choose a reason for hiding this comment

elyanah-aco May 2, 2024 • edited Loading

Choose a reason for hiding this comment

patrickamadeus May 19, 2024

Choose a reason for hiding this comment

elyanah-aco May 2, 2024 • edited Loading

Choose a reason for hiding this comment

akhdanfadh May 9, 2024

Choose a reason for hiding this comment

elyanah-aco May 9, 2024

Choose a reason for hiding this comment

patrickamadeus May 19, 2024

Choose a reason for hiding this comment

akhdanfadh left a comment

Choose a reason for hiding this comment

akhdanfadh May 9, 2024

Choose a reason for hiding this comment

patrickamadeus commented May 19, 2024

holylovenia commented May 21, 2024

elyanah-aco left a comment

Choose a reason for hiding this comment

elyanah-aco May 9, 2024

Choose a reason for hiding this comment

patrickamadeus commented May 22, 2024

holylovenia commented May 30, 2024

akhdanfadh left a comment

Choose a reason for hiding this comment

elyanah-aco May 2, 2024 •

edited

Loading

elyanah-aco May 2, 2024 •

edited

Loading