-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #339 | Update dataloader for Leipzig #483
base: master
Are you sure you want to change the base?
Conversation
…donesian_madurese_bible_translation.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @TysonYu, thank you for your contribution! The dataset looks good, nonetheless, there are 2 things that need to be updated:
- We make a small typo in the issue name, the data loader name should be
leipzig_corpora
instead ofleipzig_copora
, could you change the folder and file name accordingly? - Could you please add subset for different language, so that the dataloader can be use to download only specific-language data?
Thank you!
SOURCE_VERSION = datasets.Version(_SOURCE_VERSION) | ||
SEACROWD_VERSION = datasets.Version(_SEACROWD_VERSION) | ||
|
||
BUILDER_CONFIGS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add per language subset so that It can be useful as a source of monolingual pertaining data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to add subset? Can you help give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @TysonYu , sorry for the late reply. I think it should be similar to how we define the monolingual subsets in the cc100.py
where we have the combined source
and seacrowd_ssp
subsets and the per language subsets:
seacrowd-datahub/seacrowd/sea_datasets/cc100/cc100.py
Lines 164 to 199 in a19097e
def seacrowd_config_constructor(lang, schema, version): | |
"""Construct SEACrowdConfig with cc100_{lang}_{schema} as the name format.""" | |
if schema != "source" and schema != f"seacrowd_{_SEACROWD_SCHEMA_NAME}": | |
raise ValueError(f"Invalid schema: {schema}") | |
if lang == "": | |
return SEACrowdConfig( | |
name=f"cc100_{schema}", | |
version=datasets.Version(version), | |
description=f"CC100 with {schema} schema for all languages", | |
schema=schema, | |
subset_id="cc100", | |
) | |
elif lang in _LANGUAGES: | |
return SEACrowdConfig( | |
name=f"cc100_{lang}_{schema}", | |
version=datasets.Version(version), | |
description=f"CC100 with {schema} schema for {lang} language", | |
schema=schema, | |
subset_id="cc100", | |
) | |
else: | |
raise ValueError(f"Invalid language: {lang}. Choose one of these languages: {_LANGUAGES}.") | |
class CC100(datasets.GeneratorBasedBuilder): | |
"""Monolingual Datasets from Web Crawl Data.""" | |
BUILDER_CONFIGS = ( | |
[seacrowd_config_constructor(lang, "source", _SOURCE_VERSION) for lang in _LANGUAGES_MAP] | |
+ [seacrowd_config_constructor(lang, f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SEACROWD_VERSION) for lang in _LANGUAGES_MAP] | |
+ [ | |
seacrowd_config_constructor("", "source", _SOURCE_VERSION), | |
seacrowd_config_constructor("", f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SOURCE_VERSION), | |
] | |
) |
A friendly reminder to follow up, @TysonYu @raileymontalan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @TysonYu, could you please fix the folder name to leipzig_corpora (i.e. seacrowd/sea_datasets/leipzig_corpora/leipzig_corpora.py
? And provide per-language subsets.
Other than that, the code LGTM. Thanks!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _DATASETNAME
and DEFAULT_CONFIG_NAME
variables ware reverted back to copora
again. Please change again to corpora
. Thanks.
Hi @raileymontalan and @SamuelCahyawijaya, I changed the "copora" to "corpora". Please feel free to let @TysonYu know if other changes are required. |
Hi @TysonYu, are you working on creating subsets per language, as per @SamuelCahyawijaya's request? |
Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Hi @TysonYu, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! |
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.