-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #629 | Add/Update Dataloader VLSP2020 MT #642
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, the dataloader and its metadata work fine. However, since there might be some issue with the data split definition, I'll revisit it once it has been addressed.
It's done! @sabilmakbar thankyou for the review |
By the way, @patrickamadeus, did you happen to observe the different number of samples generated from your dataloader implementation compared to the one reported in their GH? I noticed two subsets have minor amt of sample diff:
|
Hi @sabilmakbar ! That's interesting, they mentioned that they have Since they don't give any clear definition of how to define each "sentence", do you think it's reasonable for now if we assume each line for each sentence? Because I have reviewed each dataset and it's true that each generated example's number matches to each file's line count. |
Okay then, since I think everything else has the same number as reported, we can acknowledge it (prob adding it to inline comment would be better). |
thanks for the work, @patrickamadeus! let's wait for @raileymontalan's review |
Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. cc: @patrickamadeus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for directly changing the files to follow the black
formatter.
Normally I would just request for changes, but it already passed the test without formatter hence I directly changed it.
LGTM.
All subsets passed the test.
Passed the reviewing checklist.
Thanks for the hard-work.
Merging this soon.
cc: @sabilmakbar
Closes #629
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.Tests
indomain-news (3 splits)
basic (1 split)
VLSP20-official (1 split)
NB
openSub
&mono-vi
for future development (Large Drive file download bottleneck), already opened thread in Discord discussion.