Closes #339 | Update dataloader for Leipzig #483

TysonYu · 2024-03-03T12:39:15Z

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

…init__.py

…donesian_madurese_bible_translation.py

SamuelCahyawijaya

Hi @TysonYu, thank you for your contribution! The dataset looks good, nonetheless, there are 2 things that need to be updated:

We make a small typo in the issue name, the data loader name should be leipzig_corpora instead of leipzig_copora, could you change the folder and file name accordingly?
Could you please add subset for different language, so that the dataloader can be use to download only specific-language data?

Thank you!

SamuelCahyawijaya · 2024-03-25T08:22:29Z

seacrowd/sea_datasets/leipzig_copora/leipzig_corpora.py

+    SOURCE_VERSION = datasets.Version(_SOURCE_VERSION)
+    SEACROWD_VERSION = datasets.Version(_SEACROWD_VERSION)
+
+    BUILDER_CONFIGS = [


Can you add per language subset so that It can be useful as a source of monolingual pertaining data?

How to add subset? Can you help give an example?

Hi @TysonYu , sorry for the late reply. I think it should be similar to how we define the monolingual subsets in the cc100.py where we have the combined source and seacrowd_ssp subsets and the per language subsets:

seacrowd-datahub/seacrowd/sea_datasets/cc100/cc100.py

Lines 164 to 199 in a19097e

def seacrowd_config_constructor(lang, schema, version):

"""Construct SEACrowdConfig with cc100_{lang}_{schema} as the name format."""

if schema != "source" and schema != f"seacrowd_{_SEACROWD_SCHEMA_NAME}":

raise ValueError(f"Invalid schema: {schema}")

if lang == "":

return SEACrowdConfig(

name=f"cc100_{schema}",

version=datasets.Version(version),

description=f"CC100 with {schema} schema for all languages",

schema=schema,

subset_id="cc100",

)

elif lang in _LANGUAGES:

return SEACrowdConfig(

name=f"cc100_{lang}_{schema}",

version=datasets.Version(version),

description=f"CC100 with {schema} schema for {lang} language",

schema=schema,

subset_id="cc100",

)

else:

raise ValueError(f"Invalid language: {lang}. Choose one of these languages: {_LANGUAGES}.")

class CC100(datasets.GeneratorBasedBuilder):

"""Monolingual Datasets from Web Crawl Data."""

BUILDER_CONFIGS = (

[seacrowd_config_constructor(lang, "source", _SOURCE_VERSION) for lang in _LANGUAGES_MAP]

+ [seacrowd_config_constructor(lang, f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SEACROWD_VERSION) for lang in _LANGUAGES_MAP]

+ [

seacrowd_config_constructor("", "source", _SOURCE_VERSION),

seacrowd_config_constructor("", f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SOURCE_VERSION),

]

)

holylovenia · 2024-04-08T08:25:25Z

A friendly reminder to follow up, @TysonYu @raileymontalan.

raileymontalan

Hi @TysonYu, could you please fix the folder name to leipzig_corpora (i.e. seacrowd/sea_datasets/leipzig_corpora/leipzig_corpora.py? And provide per-language subsets.
Other than that, the code LGTM. Thanks!

TysonYu · 2024-04-15T15:44:27Z

Hi @TysonYu, could you please fix the folder name to leipzig_corpora (i.e. seacrowd/sea_datasets/leipzig_corpora/leipzig_corpora.py? And provide per-language subsets. Other than that, the code LGTM. Thanks!
Done~

raileymontalan

The _DATASETNAME and DEFAULT_CONFIG_NAME variables ware reverted back to copora again. Please change again to corpora. Thanks.

holylovenia · 2024-04-27T13:22:39Z

Hi @raileymontalan and @SamuelCahyawijaya, I changed the "copora" to "corpora". Please feel free to let @TysonYu know if other changes are required.

raileymontalan · 2024-05-06T02:22:00Z

Hi @TysonYu, are you working on creating subsets per language, as per @SamuelCahyawijaya's request?

holylovenia · 2024-05-13T07:27:03Z

Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

holylovenia · 2024-05-30T04:39:00Z

Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

holylovenia · 2024-07-08T06:11:44Z

Hi @TysonYu, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @SamuelCahyawijaya @raileymontalan

TysonYu added 6 commits January 22, 2024 09:13

add dataloader for indonesian_madurese_bible_translation

c6acc8f

update the license of indonesian_madurese_bible_translation

7d29494

Update indonesian_madurese_bible_translation.py

6593318

modify based on comments from holylovenia

99a5bc6

[indonesian_madurese_bible_translation]

a6b903b

update dateloader for leipzig_copora

717e8be

TysonYu requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco and MJonibek as code owners March 3, 2024 12:39

TysonYu added 2 commits March 3, 2024 20:39

Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/__…

732c51a

…init__.py

Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/in…

462ac4c

…donesian_madurese_bible_translation.py

holylovenia changed the title ~~Closes #339 update dataloader for Leipzig~~ Closes #339 | Update dataloader for Leipzig Mar 11, 2024

holylovenia linked an issue Mar 11, 2024 that may be closed by this pull request

Create dataset loader for Leipzig Corpora Collection #339

Open

holylovenia requested review from raileymontalan and removed request for gentaiscool, jamesjaya, ljvmiranda921, holylovenia, yongzx, MJonibek, danjohnvelasco, jensan-1 and sabilmakbar March 18, 2024 09:20

holylovenia assigned SamuelCahyawijaya and raileymontalan Mar 18, 2024

Update and rename leipzig_copora.py to leipzig_corpora.py

0999fe1

TysonYu requested a review from tellarin as a code owner March 25, 2024 08:14

SamuelCahyawijaya requested changes Mar 25, 2024

View reviewed changes

holylovenia removed the request for review from tellarin March 25, 2024 09:05

raileymontalan requested changes Apr 9, 2024

View reviewed changes

rename leipzig_corpora to leipzig_copora

38e5e89

rename leipzig_copora to leipzig_corpora

0512991

raileymontalan requested changes Apr 22, 2024

View reviewed changes

Change "copora" to "corpora"

d665862

Update leipzig_corpora.py

d87fadb

github-actions bot added the need-fu-pr label Jun 14, 2024

github-actions bot removed the need-fu-pr label Jul 9, 2024

github-actions bot added the need-fu-pr label Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #339 | Update dataloader for Leipzig #483

Closes #339 | Update dataloader for Leipzig #483

TysonYu commented Mar 3, 2024 •

edited

Loading

SamuelCahyawijaya left a comment

SamuelCahyawijaya Mar 25, 2024

TysonYu Apr 15, 2024

SamuelCahyawijaya Apr 20, 2024 •

edited

Loading

holylovenia commented Apr 8, 2024

raileymontalan left a comment

TysonYu commented Apr 15, 2024

raileymontalan left a comment

holylovenia commented Apr 27, 2024

raileymontalan commented May 6, 2024

holylovenia commented May 13, 2024 •

edited

Loading

holylovenia commented May 30, 2024

holylovenia commented Jul 8, 2024

	def seacrowd_config_constructor(lang, schema, version):
	"""Construct SEACrowdConfig with cc100_{lang}_{schema} as the name format."""
	if schema != "source" and schema != f"seacrowd_{_SEACROWD_SCHEMA_NAME}":
	raise ValueError(f"Invalid schema: {schema}")

	if lang == "":
	return SEACrowdConfig(
	name=f"cc100_{schema}",
	version=datasets.Version(version),
	description=f"CC100 with {schema} schema for all languages",
	schema=schema,
	subset_id="cc100",
	)
	elif lang in _LANGUAGES:
	return SEACrowdConfig(
	name=f"cc100_{lang}_{schema}",
	version=datasets.Version(version),
	description=f"CC100 with {schema} schema for {lang} language",
	schema=schema,
	subset_id="cc100",
	)
	else:
	raise ValueError(f"Invalid language: {lang}. Choose one of these languages: {_LANGUAGES}.")


	class CC100(datasets.GeneratorBasedBuilder):
	"""Monolingual Datasets from Web Crawl Data."""

	BUILDER_CONFIGS = (
	[seacrowd_config_constructor(lang, "source", _SOURCE_VERSION) for lang in _LANGUAGES_MAP]
	+ [seacrowd_config_constructor(lang, f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SEACROWD_VERSION) for lang in _LANGUAGES_MAP]
	+ [
	seacrowd_config_constructor("", "source", _SOURCE_VERSION),
	seacrowd_config_constructor("", f"seacrowd_{_SEACROWD_SCHEMA_NAME}", _SOURCE_VERSION),
	]
	)

Closes #339 | Update dataloader for Leipzig #483

Are you sure you want to change the base?

Closes #339 | Update dataloader for Leipzig #483

Conversation

TysonYu commented Mar 3, 2024 • edited Loading

Checkbox

SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

SamuelCahyawijaya Mar 25, 2024

Choose a reason for hiding this comment

TysonYu Apr 15, 2024

Choose a reason for hiding this comment

SamuelCahyawijaya Apr 20, 2024 • edited Loading

Choose a reason for hiding this comment

holylovenia commented Apr 8, 2024

raileymontalan left a comment

Choose a reason for hiding this comment

TysonYu commented Apr 15, 2024

raileymontalan left a comment

Choose a reason for hiding this comment

holylovenia commented Apr 27, 2024

raileymontalan commented May 6, 2024

holylovenia commented May 13, 2024 • edited Loading

holylovenia commented May 30, 2024

holylovenia commented Jul 8, 2024

TysonYu commented Mar 3, 2024 •

edited

Loading

SamuelCahyawijaya Apr 20, 2024 •

edited

Loading

holylovenia commented May 13, 2024 •

edited

Loading