-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #536 | Add/Update Dataloader Onto4All #635
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @patrickamadeus, thank you for contributing! I found two problems in this dataloader:
- Somehow the conversation data is empty on both
_source
and_seacrowd_*
schemas
- For the
_seacrowd_*
schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?
Hi @SamuelCahyawijaya ! Thank you for the review 😄 . Please kindly check the latest commit for the fix. |
Sorry for the late reply, I missed this mention. Is this supposedly to accommodate a multi-turn chat template with the While this What do you think, @patrickamadeus @SamuelCahyawijaya @yongzx? cc: @sabilmakbar |
@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly? the schema would be basically consists of
One question though, should we also normalize the |
I'm of this opinion.
Let's normalize it for the |
@patrickamadeus : would it be ok for you to create the new schema, and adjust the dataloader accordingly? |
With pleasure @SamuelCahyawijaya !
I will refer the schema from here for now. |
Hi @SamuelCahyawijaya @holylovenia ! Could you please review the new schema and implementation? I named it |
The schema looks great to me! Let us know if you've separated the schema and new task so we can approve it. |
It's done @SamuelCahyawijaya @holylovenia . |
Could you please link the PR for the new schema and task here, @patrickamadeus? cc: @sabilmakbar because I'll put more focus on the experiments going forward. |
Oops, sorry! I put them altogether here in the last commit 😬 @holylovenia . Should I create a separate PR for it? Sorry for my ignorance. |
Yes, it'd be great if we could have a separate PR. Thanks in advance, @patrickamadeus!! |
Hi! here is the |
Quick question: What's the difference between using this new |
TOD relies on belief state and system act apart from the utterances. In practice, most TOD works are a derivative of or follow the WOZ dataset's style, so it would be better to keep that schema for TOD. |
Hi @patrickamadeus, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
{ | ||
"id": datasets.Value("string"), | ||
"input": datasets.Sequence({ | ||
"role": datasets.ClassLabel(names=["system", "user", "assistant"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @SamuelCahyawijaya @yongzx just letting you know the changes on schema has been merged to master, but with this role
field being changed to string
(datasets.Value("string")
) due to possibilities of additional/custom roles and HF mechanics that return an indices of the label for their examples had it been set as ClassLabel (which is less intuitive than string)
Hi @patrickamadeus, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! |
Closes #536
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.TESTS
NOTES
Please use
huggingface-cli
or insert your API_KEY totoken=
parameter inload_dataset
method since this is a gated dataset 😄