Create dataset loader for ALT Burmese Treebank #16

SamuelCahyawijaya · 2023-11-01T15:39:35Z

Dataloader name: alt_burmese_treebank/alt_burmese_treebank.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?alt_burmese_treebank

Dataset	alt_burmese_treebank
Description	A 20,000-sentence Burmese (Myanmar) treebank on news articles containing complete phrase structure annotation. As the final result of the Burmese component in the Asian Language Treebank Project, this is the first large-scale, open-access treebank for the Burmese language.
Subsets	-
Languages	mya
Tasks	Constituency Parsing
License	Creative Commons Attribution Non Commercial Share Alike 4.0 (cc-by-nc-sa-4.0)
Homepage	https://zenodo.org/records/3463010
HF URL	-
Paper URL	https://dl.acm.org/doi/10.1145/3373268

The text was updated successfully, but these errors were encountered:

gagan3012 · 2023-11-03T22:04:00Z

#self-assign

github-actions · 2023-11-24T02:02:22Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar · 2023-12-01T14:34:03Z

Hi @gagan3012, may I know the current status of this dataloader creation? Feel free to discuss in here if you have any difficulties, thx!

holylovenia · 2023-12-10T10:06:27Z

Hi @gagan3012, we received no response from you regarding this dataloader, so I will remove your assignment.

Anyone interested in taking this dataloader, please feel free to #self-assign.

minghao-wu · 2023-12-10T17:15:39Z

#self-assign

MJonibek · 2023-12-18T15:29:20Z

#self-assign

MJonibek · 2023-12-25T17:10:08Z

Can you please advise, which schema I need to use for this dataset?

Here is example from dataset:

SNT.42638.98 (ROOT (NOUN (NOUN (VERB (verb ပူးပေါင်း) (part ဖို့) ) (NOUN (NOUN (noun (noun ပါတီ) (part တွေ) ) (adp ရဲ့) ) (noun (verb ငြင်းဆို) (part မှု) ) ) ) (adp ကို) ) (VERB (VERB (VERB (punct ") (noun (verb စိတ်ပျက်) (part စရာ) ) (punct ") ) (part လို့) ) (VERB (NOUN (noun (noun ပီတာ) (noun ဟိန်း) ) (adp က) ) (verb (verb ခေါ်) (verb ဆို) (part ပါ) (part တယ်) ) ) ) (punct ။) )

holylovenia · 2023-12-26T03:50:29Z

Can you please advise, which schema I need to use for this dataset?

Here is example from dataset:

SNT.42638.98 (ROOT (NOUN (NOUN (VERB (verb ပူးပေါင်း) (part ဖို့) ) (NOUN (NOUN (noun (noun ပါတီ) (part တွေ) ) (adp ရဲ့) ) (noun (verb ငြင်းဆို) (part မှု) ) ) ) (adp ကို) ) (VERB (VERB (VERB (punct ") (noun (verb စိတ်ပျက်) (part စရာ) ) (punct ") ) (part လို့) ) (VERB (NOUN (noun (noun ပီတာ) (noun ဟိန်း) ) (adp က) ) (verb (verb ခေါ်) (verb ဆို) (part ပါ) (part တယ်) ) ) ) (punct ။) )

Hi @MJonibek!! Nice to e-meet you again~ Could you please take a look at the kb schema and let me know what you think?

MJonibek · 2023-12-26T11:23:18Z

Hi @holylovenia, nice to meet you too :)

Regarding kb schema, I am not sure it is possible to transform such data to this format. Maybe we can somehow use "entities" for the lowest level (like noun, part, punct), but I am not sure how to represent other levels of the tree (like VERB, NOUN, ROOT).

Maybe we need to use a schema like this:
{
"id": datasets.Value("string"),
"passage": {
"id": datasets.Value("string"),
"type": datasets.Value("string"),
"text": datasets.Sequence(datasets.Value("string")),
"offsets": datasets.Sequence([datasets.Value("int32")]),
},
"nodes": [{
"id": datasets.Value("string"),
"type": datasets.Value("string"), (noun, verb, punct or VERB, NOUN, ROOT)
"text": datasets.Value("string"),
"offsets": [datasets.Value("int32"), datasets.Value("int32")],
"subnodes": datasets.Sequence({
"id": datasets.Sequence(datasets.Value("string")), (ids of nodes, that are subnodes of current node)
}),
}]
}

MJonibek · 2024-01-04T08:44:43Z

Hi @holylovenia, can you please comment on the proposed schema? If this is ok, I will create this dataloader using this schema.

holylovenia · 2024-01-05T05:46:35Z

Hi @holylovenia, nice to meet you too :)

Regarding kb schema, I am not sure it is possible to transform such data to this format. Maybe we can somehow use "entities" for the lowest level (like noun, part, punct), but I am not sure how to represent other levels of the tree (like VERB, NOUN, ROOT).

Maybe we need to use a schema like this: { "id": datasets.Value("string"), "passage": { "id": datasets.Value("string"), "type": datasets.Value("string"), "text": datasets.Sequence(datasets.Value("string")), "offsets": datasets.Sequence([datasets.Value("int32")]), }, "nodes": [{ "id": datasets.Value("string"), "type": datasets.Value("string"), (noun, verb, punct or VERB, NOUN, ROOT) "text": datasets.Value("string"), "offsets": [datasets.Value("int32"), datasets.Value("int32")], "subnodes": datasets.Sequence({ "id": datasets.Sequence(datasets.Value("string")), (ids of nodes, that are subnodes of current node) }), }] }

Hi @MJonibek! Sorry for the late reply.

I've discussed this with @SamuelCahyawijaya and this schema looks great to us! Could you please make a PR for this tree schema and the CONSTITUENCY_PARSING task? 🙏

MJonibek · 2024-01-05T13:17:43Z

Great, will try to do it till the end of this week

Related #16 | Add Tree schema and CONSTITUENCY_PARSING task

Closes #16 | Create dataset loader for ALT Burmese Treebank

SamuelCahyawijaya added this to SEACrowd Data Hub Nov 1, 2023

SamuelCahyawijaya converted this from a draft issue Nov 1, 2023

SamuelCahyawijaya assigned gagan3012 Nov 4, 2023

github-actions bot added the no-issue-activity label Nov 24, 2023

holylovenia removed the no-issue-activity label Dec 10, 2023

holylovenia unassigned gagan3012 Dec 10, 2023

holylovenia added the help wanted Extra attention is needed label Dec 10, 2023

github-actions bot assigned minghao-wu Dec 10, 2023

minghao-wu removed their assignment Dec 10, 2023

github-actions bot assigned MJonibek Dec 18, 2023

This was referenced Jan 6, 2024

Related #16 | Add Tree schema and CONSTITUENCY_PARSING task #295

Merged

Closes #16 | Create dataset loader for ALT Burmese Treebank #296

Closed

Closes #16 | Create dataset loader for ALT Burmese Treebank #297

Merged

sabilmakbar added the pr-ready A PR that closes this issue is Ready to be reviewed label Jan 7, 2024

SamuelCahyawijaya added a commit that referenced this issue Jan 9, 2024

Merge pull request #295 from MJonibek/tree_schema

91623fe

Related #16 | Add Tree schema and CONSTITUENCY_PARSING task

SamuelCahyawijaya closed this as completed in #297 Feb 5, 2024

SamuelCahyawijaya added a commit that referenced this issue Feb 5, 2024

Merge pull request #297 from MJonibek/alt_burmese_treebank

ca28de5

Closes #16 | Create dataset loader for ALT Burmese Treebank

github-project-automation bot moved this to Done in SEACrowd Data Hub Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for ALT Burmese Treebank #16

Create dataset loader for ALT Burmese Treebank #16

SamuelCahyawijaya commented Nov 1, 2023 •

edited

Loading

gagan3012 commented Nov 3, 2023

github-actions bot commented Nov 24, 2023

sabilmakbar commented Dec 1, 2023

holylovenia commented Dec 10, 2023

minghao-wu commented Dec 10, 2023

MJonibek commented Dec 18, 2023

MJonibek commented Dec 25, 2023

holylovenia commented Dec 26, 2023

MJonibek commented Dec 26, 2023 •

edited

Loading

MJonibek commented Jan 4, 2024

holylovenia commented Jan 5, 2024

MJonibek commented Jan 5, 2024

Create dataset loader for ALT Burmese Treebank #16

Create dataset loader for ALT Burmese Treebank #16

Comments

SamuelCahyawijaya commented Nov 1, 2023 • edited Loading

gagan3012 commented Nov 3, 2023

github-actions bot commented Nov 24, 2023

sabilmakbar commented Dec 1, 2023

holylovenia commented Dec 10, 2023

minghao-wu commented Dec 10, 2023

MJonibek commented Dec 18, 2023

MJonibek commented Dec 25, 2023

holylovenia commented Dec 26, 2023

MJonibek commented Dec 26, 2023 • edited Loading

MJonibek commented Jan 4, 2024

holylovenia commented Jan 5, 2024

MJonibek commented Jan 5, 2024

SamuelCahyawijaya commented Nov 1, 2023 •

edited

Loading

MJonibek commented Dec 26, 2023 •

edited

Loading