-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #4 | Extend & Merge carryover ud dataloaders #247
Conversation
Hi @ijindal, thanks for the contribution in constructing a dataloader for the SEACrowd Project! At a glance, the implementation looks okay. However, it's not aligned with SEACrowd Dataloaders. In SEACrowd, one dataloader script is intended for one dataset, not a subset of a dataset. Hence, the expected code is a Dataset Loading Script where all subsets mentioned in the issue ticket are implemented. In that case, would you like to revise the implementation? Many thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the contribution. I suggest to separate the change for the global util in a new PR. @SamuelCahyawijaya @holylovenia fyi
I agree with @sabilmakbar's suggestion. Let's make them separate. |
Hi, I am working on fixing these issue. However, I have one comment on UD parse. Do we really want separate folder for each dataset such as ud_vi_vtb? If we go by this it will blow up the number of folders in the data repo (Assuming more addition of SEA languages in UD) |
Thank for this pointer. I have updated the reader as per the pointer. |
@sabilmakbar please review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviews after trying to run the scripts
idk why the tagalog subsets can't be loaded (prob due to different .conll format). I'll take a look on it.
seacrowd/sea_datasets/ud/ud.py
Outdated
|
||
|
||
|
||
_SUPPORTED_TASKS = [Tasks.POS_TAGGING] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously we have DEPENDENCY_PARSING
and MACHINE_TRANSLATION
tasks for this dataloader, which u can refer to the following dataloaders:
- https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/ud_jv_csui/ud_jv_csui.py
- https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/ud_id_csui/ud_id_csui.py
while the first two have all 3 tasks, the following dataloaders are also included in this dataset, but only DEPENDENCY_PARSING
has been implemented previously. do you have any take whether it shd be extended to 2 other tasks, @holylovenia?
3. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/indolem_ud_id_gsd/indolem_ud_id_gsd.py
4. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/indolem_ud_id_pud/indolem_ud_id_pud.py
Hi @ijindal, do you think we will be able to finish it by the end of the month? I can help modify the dataloaders accordingly if you can't (and the point will still be fully attributed to yours regardless). Also, I think @elyanah-aco will be away next week, right? (So we can pass it on to another reviewer if she hasn't provided any approval by then.) |
changing it to +3 with the consent from @holylovenia after discussing refactor & expanding subset efforts on this |
Co-authored-by: Salsabil Maulana Akbar <[email protected]>
Co-authored-by: Salsabil Maulana Akbar <[email protected]>
Co-authored-by: Salsabil Maulana Akbar <[email protected]>
@sabilmakbar Yeah, I have provided the fix as per the comments. Let me know what tasks are remaining I can finish by end of this month. |
Also, I find that #592 issue is very similar to this issue. I do not think that it requires a separate loader. If you feel the same I can add the conll17 loader in this loader as well |
Hi @ijindal, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Hi @ijindal, I've created an extended version that works for the other two tasks (but some subsets are yet to be implemented in all three tasks). If you don't mind, I'll code-diff it and push it to this branch (to speed up the process). I'll also add the test output log for each task (with subsets adjusted accordingly due to coverage). Also, for one of the subsets
Prob we can think abt it later (not on this issue, tho). |
extend `ud` dataloader to multiple tasks, adjust data loading methods based on existing dataloaders, and add custom citations per subsets
my test-run result (in log file): how to run ( SUBSET_CONFIG=(id_csui id_pud id_gsd vi_vtb tl_trg tl_ugnayan)
schema="SEQ_LABEL"
for val in ${!SUBSET_CONFIG[@]}; do
subset=${SUBSET_CONFIG[$val]}
echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done
SUBSET_CONFIG=(id_csui id_gsd vi_vtb tl_trg)
schema="KB"
for val in ${!SUBSET_CONFIG[@]}; do
subset=${SUBSET_CONFIG[$val]}
echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done
SUBSET_CONFIG=(id_csui id_pud tl_trg tl_ugnayan)
schema="SEQ_LABEL"
for val in ${!SUBSET_CONFIG[@]}; do
subset=${SUBSET_CONFIG[$val]}
echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done |
for anyone wondering why the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Each subset has passed the tests.
The log output following previous review is attached: test_ud.log.
Passed all the checklist.
As per discussed previously, this PR won't include Javanese UD (and probably other subsets) yet, and it is left for another PR.
I think this PR is good to merge.
cc: @holylovenia
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Closes #4
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.