Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #339 | Update dataloader for Leipzig #483

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

TysonYu
Copy link
Collaborator

@TysonYu TysonYu commented Mar 3, 2024

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

@holylovenia holylovenia changed the title Closes #339 update dataloader for Leipzig Closes #339 | Update dataloader for Leipzig Mar 11, 2024
@holylovenia holylovenia linked an issue Mar 11, 2024 that may be closed by this pull request
@TysonYu TysonYu requested a review from tellarin as a code owner March 25, 2024 08:14
Copy link
Collaborator

@SamuelCahyawijaya SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TysonYu, thank you for your contribution! The dataset looks good, nonetheless, there are 2 things that need to be updated:

  1. We make a small typo in the issue name, the data loader name should be leipzig_corpora instead of leipzig_copora, could you change the folder and file name accordingly?
  2. Could you please add subset for different language, so that the dataloader can be use to download only specific-language data?

Thank you!

SOURCE_VERSION = datasets.Version(_SOURCE_VERSION)
SEACROWD_VERSION = datasets.Version(_SEACROWD_VERSION)

BUILDER_CONFIGS = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add per language subset so that It can be useful as a source of monolingual pertaining data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to add subset? Can you help give an example?

Copy link
Collaborator

@SamuelCahyawijaya SamuelCahyawijaya Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TysonYu , sorry for the late reply. I think it should be similar to how we define the monolingual subsets in the cc100.py where we have the combined source and seacrowd_ssp subsets and the per language subsets:

def seacrowd_config_constructor(lang, schema, version):
"""Construct SEACrowdConfig with cc100_{lang}_{schema} as the name format."""
if schema != "source" and schema != f"seacrowd_{_SEACROWD_SCHEMA_NAME}":
raise ValueError(f"Invalid schema: {schema}")
if lang == "":
return SEACrowdConfig(
name=f"cc100_{schema}",
version=datasets.Version(version),
description=f"CC100 with {schema} schema for all languages",
schema=schema,
subset_id="cc100",
)
elif lang in _LANGUAGES:
return SEACrowdConfig(
name=f"cc100_{lang}_{schema}",
version=datasets.Version(version),
description=f"CC100 with {schema} schema for {lang} language",
schema=schema,
subset_id="cc100",
)
else:
raise ValueError(f"Invalid language: {lang}. Choose one of these languages: {_LANGUAGES}.")
class CC100(datasets.GeneratorBasedBuilder):
"""Monolingual Datasets from Web Crawl Data."""
BUILDER_CONFIGS = (
[seacrowd_config_constructor(lang, "source", _SOURCE_VERSION) for lang in _LANGUAGES_MAP]
+ [seacrowd_config_constructor(lang, f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SEACROWD_VERSION) for lang in _LANGUAGES_MAP]
+ [
seacrowd_config_constructor("", "source", _SOURCE_VERSION),
seacrowd_config_constructor("", f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SOURCE_VERSION),
]
)

@holylovenia holylovenia removed the request for review from tellarin March 25, 2024 09:05
@holylovenia
Copy link
Contributor

A friendly reminder to follow up, @TysonYu @raileymontalan.

Copy link
Collaborator

@raileymontalan raileymontalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TysonYu, could you please fix the folder name to leipzig_corpora (i.e. seacrowd/sea_datasets/leipzig_corpora/leipzig_corpora.py? And provide per-language subsets.
Other than that, the code LGTM. Thanks!

@TysonYu
Copy link
Collaborator Author

TysonYu commented Apr 15, 2024

Hi @TysonYu, could you please fix the folder name to leipzig_corpora (i.e. seacrowd/sea_datasets/leipzig_corpora/leipzig_corpora.py? And provide per-language subsets. Other than that, the code LGTM. Thanks!
Done~

Copy link
Collaborator

@raileymontalan raileymontalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _DATASETNAME and DEFAULT_CONFIG_NAME variables ware reverted back to copora again. Please change again to corpora. Thanks.

@holylovenia
Copy link
Contributor

Hi @raileymontalan and @SamuelCahyawijaya, I changed the "copora" to "corpora". Please feel free to let @TysonYu know if other changes are required.

@raileymontalan
Copy link
Collaborator

Hi @TysonYu, are you working on creating subsets per language, as per @SamuelCahyawijaya's request?

@holylovenia
Copy link
Contributor

holylovenia commented May 13, 2024

Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

@holylovenia
Copy link
Contributor

Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

@holylovenia
Copy link
Contributor

Hi @TysonYu, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @SamuelCahyawijaya @raileymontalan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for Leipzig Corpora Collection
4 participants