Closes #629 | Add/Update Dataloader VLSP2020 MT #642

patrickamadeus · 2024-04-13T13:58:35Z

Closes #629

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[.] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Tests

indomain-news (3 splits)

basic (1 split)

VLSP20-official (1 split)

NB

Skipping openSub & mono-vi for future development (Large Drive file download bottleneck), already opened thread in Discord discussion.
Tested 3 out of 7 subsets

seacrowd/sea_datasets/vlsp2020_mt_envi/vlsp2020_mt_envi.py

sabilmakbar

As of now, the dataloader and its metadata work fine. However, since there might be some issue with the data split definition, I'll revisit it once it has been addressed.

patrickamadeus · 2024-05-19T09:49:36Z

It's done! @sabilmakbar thankyou for the review ☺️

sabilmakbar · 2024-05-20T17:41:49Z

By the way, @patrickamadeus, did you happen to observe the different number of samples generated from your dataloader implementation compared to the one reported in their GH?

I noticed two subsets have minor amt of sample diff:

Subset VLSP20-official 789 (reported on Src GH) vs 790 (generated examples)
wiki-alt 20000 (reported on Src GH) vs 20106 (generated examples)

patrickamadeus · 2024-05-22T16:37:13Z

By the way, @patrickamadeus, did you happen to observe the different number of samples generated from your dataloader implementation compared to the one reported in their GH?

I noticed two subsets have minor amt of sample diff:

Subset VLSP20-official 789 (reported on Src GH) vs 790 (generated examples)

wiki-alt 20000 (reported on Src GH) vs 20106 (generated examples)

Hi @sabilmakbar ! That's interesting, they mentioned that they have N sentences. Actually I splitted the dataset every \n instead of assuming . for the end of the sentence (would even explode the numbers even more).

Since they don't give any clear definition of how to define each "sentence", do you think it's reasonable for now if we assume each line for each sentence? Because I have reviewed each dataset and it's true that each generated example's number matches to each file's line count.

sabilmakbar · 2024-05-24T10:21:23Z

Hi @sabilmakbar ! That's interesting, they mentioned that they have N sentences. Actually I splitted the dataset every \n instead of assuming . for the end of the sentence (would even explode the numbers even more).

Since they don't give any clear definition of how to define each "sentence", do you think it's reasonable for now if we assume each line for each sentence? Because I have reviewed each dataset and it's true that each generated example's number matches to each file's line count.

Okay then, since I think everything else has the same number as reported, we can acknowledge it (prob adding it to inline comment would be better).

sabilmakbar · 2024-05-26T07:33:27Z

thanks for the work, @patrickamadeus! let's wait for @raileymontalan's review

holylovenia · 2024-05-30T04:41:33Z

Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @patrickamadeus

…rmatter.

fhudi

Sorry for directly changing the files to follow the black formatter.
Normally I would just request for changes, but it already passed the test without formatter hence I directly changed it.

LGTM.
All subsets passed the test.
Passed the reviewing checklist.
Thanks for the hard-work.
Merging this soon.

cc: @sabilmakbar

feat: add dataloader for VLSP2020 MT

6d4062c

patrickamadeus requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners April 13, 2024 13:58

sabilmakbar self-assigned this May 1, 2024

sabilmakbar reviewed May 2, 2024

View reviewed changes

seacrowd/sea_datasets/vlsp2020_mt_envi/vlsp2020_mt_envi.py Outdated Show resolved Hide resolved

holylovenia assigned raileymontalan May 7, 2024

holylovenia requested review from raileymontalan and removed request for tellarin, gentaiscool, jamesjaya, SamuelCahyawijaya, ljvmiranda921, holylovenia, yongzx, MJonibek, danjohnvelasco and jensan-1 May 7, 2024 08:45

sabilmakbar self-requested a review May 12, 2024 12:30

sabilmakbar reviewed May 17, 2024

View reviewed changes

fix: splitname review + create extra subset

20b6a11

sabilmakbar approved these changes May 24, 2024

View reviewed changes

Update vlsp2020_mt_envi.py from makefile to align with the black fo…

c244cd6

…rmatter.

fhudi assigned fhudi and unassigned raileymontalan May 31, 2024

fhudi requested review from fhudi and removed request for raileymontalan May 31, 2024 13:02

fhudi added 2 commits May 31, 2024 22:06

Update vlsp2020_mt_envi.py from makefile to align with the black fo…

38ee536

…rmatter.

Update vlsp2020_mt_envi.py from makefile to align with the black fo…

8014015

…rmatter.

fhudi approved these changes May 31, 2024

View reviewed changes

fhudi merged commit 18926d9 into SEACrowd:master May 31, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #629 | Add/Update Dataloader VLSP2020 MT #642

Closes #629 | Add/Update Dataloader VLSP2020 MT #642

patrickamadeus commented Apr 13, 2024

sabilmakbar left a comment

patrickamadeus commented May 19, 2024

sabilmakbar commented May 20, 2024

patrickamadeus commented May 22, 2024

sabilmakbar commented May 24, 2024

sabilmakbar commented May 26, 2024

holylovenia commented May 30, 2024

fhudi left a comment

Closes #629 | Add/Update Dataloader VLSP2020 MT #642

Closes #629 | Add/Update Dataloader VLSP2020 MT #642

Conversation

patrickamadeus commented Apr 13, 2024

Checkbox

Tests

indomain-news (3 splits)

basic (1 split)

VLSP20-official (1 split)

sabilmakbar left a comment

Choose a reason for hiding this comment

patrickamadeus commented May 19, 2024

sabilmakbar commented May 20, 2024

patrickamadeus commented May 22, 2024

sabilmakbar commented May 24, 2024

sabilmakbar commented May 26, 2024

holylovenia commented May 30, 2024

fhudi left a comment

Choose a reason for hiding this comment