Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config helper (WIP) and missed constants in existing dataloaders #605

Merged
merged 30 commits into from
Jun 19, 2024

Conversation

holylovenia
Copy link
Contributor

  • Derive SEACrowdConfigHelper from NusantaraConfigHelper. Basic function works but haven't modified SEACrowdMetadataHelper, constants, etc.
  • Add some missed constants (e.g., _LANGUAGES and _LOCAL) found during the config helper creation.

Copy link
Collaborator

@SamuelCahyawijaya SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@sabilmakbar sabilmakbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init review, for the constant missed I'll cross-check it with my checker & fix (prob will add the check script here too if needed)

from .utils.constants import Tasks, SCHEMA_TO_TASKS
import pandas as pd

_LARGE_CONFIG_NAMES = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to update it later? or we just write it as-is from NusaCrowd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update later! Might also deprecate this function depending on its use case. Do you have any suggestion regarding this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob we can change the value according to the SEACrowd Dataset Size Info. But since NusaCrowd v1.0 is a subset of SEACrowd, we can keep it as-is for now and update it later


]

BENCHMARK_DICT = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we add SEACrowd-based Benchmark to this later? or could it be done on this review since the dataset picking has been done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabilmakbar Yes yes. I'll add it on the next PR since we're still reviewing some of the speech dataloaders.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for benchmark, we can use the datasets used for SEACrowd Experiment, hence the value should be overriden from NC

@sabilmakbar
Copy link
Collaborator

btw @holylovenia is the config still on WIP?

@holylovenia holylovenia merged commit 9db6d22 into master Jun 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request need-fu-pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants