Closes #571 | Add/Update Dataloader UP2.0 #660

fhudi · 2024-05-01T13:36:50Z

Closes #571

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Currently implemented as source-only, as there seems to be no matching task.
(Based on the information provided in the card)

Issue was initially raised in the issue thread #571.

SamuelCahyawijaya

@fhudi : LGTM!

ljvmiranda921

Ran the data loader and worked! I guess this one doesn't have a SEACrowd schema, no? Some commented blocks that can be deleted (optionally). But good to merge as is 👍

seacrowd/sea_datasets/up2/up2.py

ljvmiranda921 · 2024-05-16T03:24:17Z

seacrowd/sea_datasets/up2/up2.py

+        # *[SEACrowdConfig(
+        #     name=f"{_DATASETNAME}{'_' if _LANG else ''}{_LANG}_seacrowd_[seacrowd_schema_name]",
+        #     version=datasets.Version(_SEACROWD_VERSION),
+        #     description=f"{_DATASETNAME} SEACrowd schema",
+        #     schema="seacrowd_[seacrowd_schema_name]",
+        #     subset_id=f"{_DATASETNAME}{'_' if _LANG else ''}{_LANG}",
+        # ) for _LANG in ['', *_LANGUAGES]],


I assume this one doesn't have a SEACrowd schema. I guess we can delete this commented block?

seacrowd/sea_datasets/up2/up2.py

khelli07 · 2024-05-16T15:27:24Z

Unfortunately the test did not pass on my end.
Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

khelli07 · 2024-05-16T15:37:10Z

Changed utils/common_parser.py > load_ud_data

dataset_raw = parse(open(filepath).read())

to

with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())

But please confirm if these result are correct or not

Ran through loader if __name__ == "__main__":
A test sample:

holylovenia · 2024-05-21T08:50:28Z

Hi @SamuelCahyawijaya @ljvmiranda921 @khelli07 and @fhudi, I won't merge this dataloader as-is due to the error reported by @khelli07. Let's wait until @fhudi addresses @ljvmiranda921 and @khelli07's feedback.

PS: Sorry I messed up by assigning this dataloader to 3 people. 🫠 I will adjust the reviewing scores later.

fhudi · 2024-05-21T11:01:40Z

Unfortunately the test did not pass on my end. Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

@khelli07
Note that this is source-only dataloader.

As mentioned in the reviewing checklist point 3, you need to execute test script using the one that is appropriate under the test folder.
Please use the one for source_only.

You can rever to the example of executable test script on how to use it.

P.S.:
@holylovenia
I can't find any readme file referring to how to test source-only dataset. Maybe lost when migrating from NusaCrowd 🥲.

Co-authored-by: Lj Miranda <[email protected]>

fhudi · 2024-05-21T12:57:00Z

Changed utils/common_parser.py > load_ud_data
dataset_raw = parse(open(filepath).read())
to
with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())
But please confirm if these result are correct or not

@khelli07
Thanks for the suggestion. LGTM.
But may I know the concise reasoning on these changes?

khelli07 · 2024-05-21T13:31:01Z

Hi, the change is to avoid this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

It appears when I ran datasets.load_dataset too.

After I include that change, I successfully ran datasets.load_dataset and python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/up2/up2.py --subset_id up2.

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

fhudi · 2024-05-21T15:08:07Z

Hi, the change is to avoid this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

Could you try with the following combination and see whether it is solved with any of the followings:

open(filepath, encoding='utf-8')
open(filepath, errors='replace')

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

Are you using WSL?

khelli07 · 2024-05-26T03:15:24Z

Nope, I use Windows without WSL. Let me try it.

khelli07 · 2024-05-26T04:10:29Z

Ok just finished testing, both of them works. So you might be able to choose either one.

holylovenia · 2024-05-30T04:45:45Z

Hi @fhudi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Has everything been addressed now? Can we merge this PR?

ljvmiranda921 · 2024-05-30T05:21:13Z

I think this is just waiting for @khelli07 's review. But from the last comment it seems that it has been resolved 🤔 . I was able to run this so I think it's mostly a platform-specific issue. Dataloader works though so I think it's safe to merge!

khelli07

The load_ud_data in the common_parser.py hasn't been changed yet. Without that change, I don't think I can run successfully in my Windows.

And yes, as @ljvmiranda921 said, I agree that this is a platform-specific issue. If that can be tolerated, then can merge.

fhudi · 2024-05-30T16:45:39Z

@khelli07
Sorry for the late reply.
Honestly, adding the parameter errors='replace' might change the overall behaviour.
As this function is used by other DataLoader as well, I am a bit skeptical in including this change.
Also, based on your reply, the root cause does not seem to be from the additional parameters.
However, I have made the adjustment as seen from the commit.
cc: @holylovenia @ljvmiranda921

khelli07

LGTM!

fhudi · 2024-05-31T08:54:19Z

@khelli07 @ljvmiranda921
I believe this is save for merging?

sabilmakbar · 2024-05-31T09:11:27Z

open(filepath, encoding='utf-8')
open(filepath, errors='replace')

Sorry for jumping it late. I think this is a platform-based issue where, for non-Linux/MacOS, the encoding doesn't default to utf-8. Hence, the first one is preferable.

sabilmakbar · 2024-05-31T09:12:31Z

Since both reviewers have given the approver (and the dataloader assignee best not to merge it themselves even though it has been approved), I'll proceed with the merging.

fyi @khelli07 @ljvmiranda921 @SamuelCahyawijaya

Create dataset loader for UP2.0 (SEACrowd#571)

f3d1b9a

fhudi requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners May 1, 2024 13:36

holylovenia requested review from khelli07 and removed request for tellarin, gentaiscool, jamesjaya, ljvmiranda921, holylovenia, yongzx, MJonibek, danjohnvelasco, jensan-1 and sabilmakbar May 7, 2024 09:16

holylovenia assigned SamuelCahyawijaya and khelli07 May 7, 2024

holylovenia requested review from elyanah-aco and removed request for SamuelCahyawijaya May 13, 2024 08:53

holylovenia assigned elyanah-aco and unassigned SamuelCahyawijaya May 13, 2024

holylovenia removed the request for review from elyanah-aco May 13, 2024 08:53

holylovenia unassigned elyanah-aco May 13, 2024

SamuelCahyawijaya approved these changes May 13, 2024

View reviewed changes

ljvmiranda921 approved these changes May 16, 2024

View reviewed changes

fhudi and others added 4 commits May 21, 2024 20:02

Update seacrowd/sea_datasets/up2/up2.py

a3281a3

Co-authored-by: Lj Miranda <[email protected]>

Update seacrowd/sea_datasets/up2/up2.py

fea5abc

Co-authored-by: Lj Miranda <[email protected]>

Update seacrowd/sea_datasets/up2/up2.py

cfd49c3

Co-authored-by: Lj Miranda <[email protected]>

Update up2.py

9d55ba2

khelli07 requested changes May 30, 2024

View reviewed changes

fhudi added 2 commits May 31, 2024 01:40

Update up2.py, reformat from makefile.

8ba35b5

Update common_parser.py for a safer IO process.

91d389c

khelli07 self-requested a review May 31, 2024 05:28

khelli07 approved these changes May 31, 2024

View reviewed changes

fhudi mentioned this pull request May 31, 2024

Closes #4 | Extend & Merge carryover ud dataloaders #247

Merged

8 tasks

sabilmakbar merged commit d873404 into SEACrowd:master May 31, 2024
1 check passed

fhudi deleted the dl-up2 branch May 31, 2024 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #571 | Add/Update Dataloader UP2.0 #660

Closes #571 | Add/Update Dataloader UP2.0 #660

fhudi commented May 1, 2024

SamuelCahyawijaya left a comment

ljvmiranda921 left a comment

ljvmiranda921 May 16, 2024

fhudi May 21, 2024

khelli07 commented May 16, 2024 •

edited

Loading

khelli07 commented May 16, 2024

holylovenia commented May 21, 2024 •

edited

Loading

fhudi commented May 21, 2024

fhudi commented May 21, 2024

khelli07 commented May 21, 2024 •

edited

Loading

fhudi commented May 21, 2024

khelli07 commented May 26, 2024

khelli07 commented May 26, 2024

holylovenia commented May 30, 2024

ljvmiranda921 commented May 30, 2024

khelli07 left a comment •

edited

Loading

fhudi commented May 30, 2024

khelli07 left a comment

fhudi commented May 31, 2024

sabilmakbar commented May 31, 2024

sabilmakbar commented May 31, 2024 •

edited

Loading

Closes #571 | Add/Update Dataloader UP2.0 #660

Closes #571 | Add/Update Dataloader UP2.0 #660

Conversation

fhudi commented May 1, 2024

Checkbox

SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

ljvmiranda921 left a comment

Choose a reason for hiding this comment

ljvmiranda921 May 16, 2024

Choose a reason for hiding this comment

fhudi May 21, 2024

Choose a reason for hiding this comment

khelli07 commented May 16, 2024 • edited Loading

khelli07 commented May 16, 2024

holylovenia commented May 21, 2024 • edited Loading

fhudi commented May 21, 2024

fhudi commented May 21, 2024

khelli07 commented May 21, 2024 • edited Loading

fhudi commented May 21, 2024

khelli07 commented May 26, 2024

khelli07 commented May 26, 2024

holylovenia commented May 30, 2024

ljvmiranda921 commented May 30, 2024

khelli07 left a comment • edited Loading

Choose a reason for hiding this comment

fhudi commented May 30, 2024

khelli07 left a comment

Choose a reason for hiding this comment

fhudi commented May 31, 2024

sabilmakbar commented May 31, 2024

sabilmakbar commented May 31, 2024 • edited Loading

khelli07 commented May 16, 2024 •

edited

Loading

holylovenia commented May 21, 2024 •

edited

Loading

khelli07 commented May 21, 2024 •

edited

Loading

khelli07 left a comment •

edited

Loading

sabilmakbar commented May 31, 2024 •

edited

Loading