Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for ALT Burmese Treebank #16

Closed
SamuelCahyawijaya opened this issue Nov 1, 2023 · 12 comments · Fixed by #297
Closed

Create dataset loader for ALT Burmese Treebank #16

SamuelCahyawijaya opened this issue Nov 1, 2023 · 12 comments · Fixed by #297
Assignees
Labels
help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Nov 1, 2023

Dataloader name: alt_burmese_treebank/alt_burmese_treebank.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?alt_burmese_treebank

Dataset alt_burmese_treebank
Description A 20,000-sentence Burmese (Myanmar) treebank on news articles containing complete phrase structure annotation. As the final result of the Burmese component in the Asian Language Treebank Project, this is the first large-scale, open-access treebank for the Burmese language.
Subsets -
Languages mya
Tasks Constituency Parsing
License Creative Commons Attribution Non Commercial Share Alike 4.0 (cc-by-nc-sa-4.0)
Homepage https://zenodo.org/records/3463010
HF URL -
Paper URL https://dl.acm.org/doi/10.1145/3373268
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Nov 1, 2023
@gagan3012
Copy link

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sabilmakbar
Copy link
Collaborator

Hi @gagan3012, may I know the current status of this dataloader creation? Feel free to discuss in here if you have any difficulties, thx!

@holylovenia
Copy link
Contributor

Hi @gagan3012, we received no response from you regarding this dataloader, so I will remove your assignment.

Anyone interested in taking this dataloader, please feel free to #self-assign.

@holylovenia holylovenia added the help wanted Extra attention is needed label Dec 10, 2023
@minghao-wu
Copy link

#self-assign

@minghao-wu minghao-wu removed their assignment Dec 10, 2023
@MJonibek
Copy link
Collaborator

#self-assign

@MJonibek
Copy link
Collaborator

Can you please advise, which schema I need to use for this dataset?

Here is example from dataset:

SNT.42638.98 (ROOT (NOUN (NOUN (VERB (verb ပူးပေါင်း) (part ဖို့) ) (NOUN (NOUN (noun (noun ပါတီ) (part တွေ) ) (adp ရဲ့) ) (noun (verb ငြင်းဆို) (part မှု) ) ) ) (adp ကို) ) (VERB (VERB (VERB (punct ") (noun (verb စိတ်ပျက်) (part စရာ) ) (punct ") ) (part လို့) ) (VERB (NOUN (noun (noun ပီတာ) (noun ဟိန်း) ) (adp က) ) (verb (verb ခေါ်) (verb ဆို) (part ပါ) (part တယ်) ) ) ) (punct ။) )

@holylovenia
Copy link
Contributor

Can you please advise, which schema I need to use for this dataset?

Here is example from dataset:

SNT.42638.98 (ROOT (NOUN (NOUN (VERB (verb ပူးပေါင်း) (part ဖို့) ) (NOUN (NOUN (noun (noun ပါတီ) (part တွေ) ) (adp ရဲ့) ) (noun (verb ငြင်းဆို) (part မှု) ) ) ) (adp ကို) ) (VERB (VERB (VERB (punct ") (noun (verb စိတ်ပျက်) (part စရာ) ) (punct ") ) (part လို့) ) (VERB (NOUN (noun (noun ပီတာ) (noun ဟိန်း) ) (adp က) ) (verb (verb ခေါ်) (verb ဆို) (part ပါ) (part တယ်) ) ) ) (punct ။) )

Hi @MJonibek!! Nice to e-meet you again~ Could you please take a look at the kb schema and let me know what you think?

@MJonibek
Copy link
Collaborator

MJonibek commented Dec 26, 2023

Hi @holylovenia, nice to meet you too :)

Regarding kb schema, I am not sure it is possible to transform such data to this format. Maybe we can somehow use "entities" for the lowest level (like noun, part, punct), but I am not sure how to represent other levels of the tree (like VERB, NOUN, ROOT).

Maybe we need to use a schema like this:
{
"id": datasets.Value("string"),
"passage": {
"id": datasets.Value("string"),
"type": datasets.Value("string"),
"text": datasets.Sequence(datasets.Value("string")),
"offsets": datasets.Sequence([datasets.Value("int32")]),
},
"nodes": [{
"id": datasets.Value("string"),
"type": datasets.Value("string"), (noun, verb, punct or VERB, NOUN, ROOT)
"text": datasets.Value("string"),
"offsets": [datasets.Value("int32"), datasets.Value("int32")],
"subnodes": datasets.Sequence({
"id": datasets.Sequence(datasets.Value("string")), (ids of nodes, that are subnodes of current node)
}),
}]
}

@MJonibek
Copy link
Collaborator

MJonibek commented Jan 4, 2024

Hi @holylovenia, can you please comment on the proposed schema? If this is ok, I will create this dataloader using this schema.

@holylovenia
Copy link
Contributor

Hi @holylovenia, nice to meet you too :)

Regarding kb schema, I am not sure it is possible to transform such data to this format. Maybe we can somehow use "entities" for the lowest level (like noun, part, punct), but I am not sure how to represent other levels of the tree (like VERB, NOUN, ROOT).

Maybe we need to use a schema like this: { "id": datasets.Value("string"), "passage": { "id": datasets.Value("string"), "type": datasets.Value("string"), "text": datasets.Sequence(datasets.Value("string")), "offsets": datasets.Sequence([datasets.Value("int32")]), }, "nodes": [{ "id": datasets.Value("string"), "type": datasets.Value("string"), (noun, verb, punct or VERB, NOUN, ROOT) "text": datasets.Value("string"), "offsets": [datasets.Value("int32"), datasets.Value("int32")], "subnodes": datasets.Sequence({ "id": datasets.Sequence(datasets.Value("string")), (ids of nodes, that are subnodes of current node) }), }] }

Hi @MJonibek! Sorry for the late reply.

I've discussed this with @SamuelCahyawijaya and this schema looks great to us! Could you please make a PR for this tree schema and the CONSTITUENCY_PARSING task? 🙏

@MJonibek
Copy link
Collaborator

MJonibek commented Jan 5, 2024

Great, will try to do it till the end of this week

@sabilmakbar sabilmakbar added the pr-ready A PR that closes this issue is Ready to be reviewed label Jan 7, 2024
SamuelCahyawijaya added a commit that referenced this issue Jan 9, 2024
Related #16 | Add Tree schema and CONSTITUENCY_PARSING task
SamuelCahyawijaya added a commit that referenced this issue Feb 5, 2024
Closes #16 | Create dataset loader for ALT Burmese Treebank
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
6 participants