-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #16 | Create dataset loader for ALT Burmese Treebank #297
Closes #16 | Create dataset loader for ALT Burmese Treebank #297
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @MJonibek, when trying with this command python -m tests.test_seacrowd seacrowd/sea_datasets/alt_burmese_treebank/alt_burmese_treebank.py
, the script throws an error:
======================================================================
FAIL: runTest (__main__.TestDataLoader) [IDs globally unique]
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/holylovenia/Projects/seacrowd-datahub/tests/test_seacrowd.py", line 71, in runTest
self.test_are_ids_globally_unique(dataset_seacrowd)
File "/Users/holylovenia/Projects/seacrowd-datahub/tests/test_seacrowd.py", line 234, in test_are_ids_globally_unique
self._assert_ids_globally_unique(example, ids_seen=ids_seen)
File "/Users/holylovenia/Projects/seacrowd-datahub/tests/test_seacrowd.py", line 211, in _assert_ids_globally_unique
self._assert_ids_globally_unique(v, ids_seen)
File "/Users/holylovenia/Projects/seacrowd-datahub/tests/test_seacrowd.py", line 219, in _assert_ids_globally_unique
self.assertNotIn(v, ids_seen)
AssertionError: 'SNT.80188.1' unexpectedly found in {'SNT.80188.1'}
Everything else looks good to me based on manual check. 👍
Hi, I see, problem with unique id, I will solve this issue this week |
@holylovenia Done. Found the reason of this bug, during creating subnodes they have their ids, and for every new sentence id for subnodes was starting from 0 again, as the result subnodes from different sentences had same ids. Made some changes for subnodes ids: instead of id 0 it will be sent_id_0. Checked manually and with test.py, everything worked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your hard work and the bug explanation, @MJonibek. 👍 It makes a lot of sense. All is well during my test and review process. I'm upping the dataloader points with a +2 bonus considering the high difficulty. Thank you for tackling this dataloader.
For now, let's wait for @sabilmakbar's review.
Re-assigning the reviewer from @sabilmakbar to @SamuelCahyawijaya to give more breathing room to @sabilmakbar. 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @MJonibek, thank you for implementing the dataloader.
I found one issue where the subnodes of the root node do not correspond to any node's id (they have a completely different format). I think there is a minor formatting error on L49
of alt_burmese_treebank.py
sub_node_ids.extend([i + 1 for i in range(len(root_subnodes))]) | ||
root_text = extract_sentence(root_sent) | ||
|
||
nodes.append({"id": f"{sentence_id+'.'+str(0)}", "type": "ROOT", "text": root_text, "offsets": [0, len(root_text) - 1], "subnodes": [f"{len(nodes)+i+1}" for i in range(len(sub_nodes))]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subnodes naming is different to the one on L69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it seems I forgot to change the format on that line. Thank you, will ping you when it is fixed.
@SamuelCahyawijaya , fixed this bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #16
Note:
PR 295 is related to this PR and issue it closes
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.