Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #571 | Add/Update Dataloader UP2.0 #660

Merged
merged 7 commits into from
May 31, 2024
Merged

Conversation

fhudi
Copy link
Collaborator

@fhudi fhudi commented May 1, 2024

Closes #571

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Currently implemented as source-only, as there seems to be no matching task.
(Based on the information provided in the card)

Issue was initially raised in the issue thread #571.

Copy link
Collaborator

@SamuelCahyawijaya SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhudi : LGTM!

Copy link
Collaborator

@ljvmiranda921 ljvmiranda921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the data loader and worked! I guess this one doesn't have a SEACrowd schema, no? Some commented blocks that can be deleted (optionally). But good to merge as is 👍

seacrowd/sea_datasets/up2/up2.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/up2/up2.py Outdated Show resolved Hide resolved
Comment on lines 110 to 116
# *[SEACrowdConfig(
# name=f"{_DATASETNAME}{'_' if _LANG else ''}{_LANG}_seacrowd_[seacrowd_schema_name]",
# version=datasets.Version(_SEACROWD_VERSION),
# description=f"{_DATASETNAME} SEACrowd schema",
# schema="seacrowd_[seacrowd_schema_name]",
# subset_id=f"{_DATASETNAME}{'_' if _LANG else ''}{_LANG}",
# ) for _LANG in ['', *_LANGUAGES]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this one doesn't have a SEACrowd schema. I guess we can delete this commented block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. 👍

seacrowd/sea_datasets/up2/up2.py Outdated Show resolved Hide resolved
@khelli07
Copy link
Collaborator

khelli07 commented May 16, 2024

Unfortunately the test did not pass on my end.
Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

image

@khelli07
Copy link
Collaborator

Changed utils/common_parser.py > load_ud_data

dataset_raw = parse(open(filepath).read())

to

with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())

But please confirm if these result are correct or not

  1. Ran through loader if __name__ == "__main__":
    image

  2. A test sample:
    image

@holylovenia
Copy link
Contributor

holylovenia commented May 21, 2024

Hi @SamuelCahyawijaya @ljvmiranda921 @khelli07 and @fhudi, I won't merge this dataloader as-is due to the error reported by @khelli07. Let's wait until @fhudi addresses @ljvmiranda921 and @khelli07's feedback.

PS: Sorry I messed up by assigning this dataloader to 3 people. 🫠 I will adjust the reviewing scores later.

@fhudi
Copy link
Collaborator Author

fhudi commented May 21, 2024

Unfortunately the test did not pass on my end. Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

@khelli07
Note that this is source-only dataloader.

As mentioned in the reviewing checklist point 3, you need to execute test script using the one that is appropriate under the test folder.
Please use the one for source_only.

You can rever to the example of executable test script on how to use it.


P.S.:
@holylovenia
I can't find any readme file referring to how to test source-only dataset. Maybe lost when migrating from NusaCrowd 🥲.

@fhudi
Copy link
Collaborator Author

fhudi commented May 21, 2024

Changed utils/common_parser.py > load_ud_data

dataset_raw = parse(open(filepath).read())

to

with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())

But please confirm if these result are correct or not

@khelli07
Thanks for the suggestion. LGTM.
But may I know the concise reasoning on these changes?

@khelli07
Copy link
Collaborator

khelli07 commented May 21, 2024

Hi, the change is to avoid this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

It appears when I ran datasets.load_dataset too.

After I include that change, I successfully ran datasets.load_dataset and python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/up2/up2.py --subset_id up2.

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

@fhudi
Copy link
Collaborator Author

fhudi commented May 21, 2024

Hi, the change is to avoid this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

Could you try with the following combination and see whether it is solved with any of the followings:

  1. open(filepath, encoding='utf-8')
  2. open(filepath, errors='replace')

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

Are you using WSL?

@khelli07
Copy link
Collaborator

Nope, I use Windows without WSL. Let me try it.

@khelli07
Copy link
Collaborator

Ok just finished testing, both of them works. So you might be able to choose either one.

@holylovenia
Copy link
Contributor

Hi @fhudi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Has everything been addressed now? Can we merge this PR?

@ljvmiranda921
Copy link
Collaborator

I think this is just waiting for @khelli07 's review. But from the last comment it seems that it has been resolved 🤔 . I was able to run this so I think it's mostly a platform-specific issue. Dataloader works though so I think it's safe to merge!

Copy link
Collaborator

@khelli07 khelli07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_ud_data in the common_parser.py hasn't been changed yet. Without that change, I don't think I can run successfully in my Windows.

And yes, as @ljvmiranda921 said, I agree that this is a platform-specific issue. If that can be tolerated, then can merge.

@fhudi
Copy link
Collaborator Author

fhudi commented May 30, 2024

@khelli07
Sorry for the late reply.
Honestly, adding the parameter errors='replace' might change the overall behaviour.
As this function is used by other DataLoader as well, I am a bit skeptical in including this change.
Also, based on your reply, the root cause does not seem to be from the additional parameters.
However, I have made the adjustment as seen from the commit.
cc: @holylovenia @ljvmiranda921

@khelli07 khelli07 self-requested a review May 31, 2024 05:28
Copy link
Collaborator

@khelli07 khelli07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fhudi
Copy link
Collaborator Author

fhudi commented May 31, 2024

@khelli07 @ljvmiranda921
I believe this is save for merging?

@sabilmakbar
Copy link
Collaborator

open(filepath, encoding='utf-8')
open(filepath, errors='replace')

Sorry for jumping it late. I think this is a platform-based issue where, for non-Linux/MacOS, the encoding doesn't default to utf-8. Hence, the first one is preferable.

@sabilmakbar
Copy link
Collaborator

sabilmakbar commented May 31, 2024

Since both reviewers have given the approver (and the dataloader assignee best not to merge it themselves even though it has been approved), I'll proceed with the merging.

fyi @khelli07 @ljvmiranda921 @SamuelCahyawijaya

@sabilmakbar sabilmakbar merged commit d873404 into SEACrowd:master May 31, 2024
1 check passed
@fhudi fhudi deleted the dl-up2 branch May 31, 2024 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for UP2.0
7 participants