Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for SREDFM #48

Closed
SamuelCahyawijaya opened this issue Nov 14, 2023 · 20 comments · Fixed by #495
Closed

Create dataset loader for SREDFM #48

SamuelCahyawijaya opened this issue Nov 14, 2023 · 20 comments · Fixed by #495
Assignees
Labels
bonus +1 pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Nov 14, 2023

Dataloader name: sredfm/sredfm.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sredfm

Dataset sredfm
Description SREDFM is an automatically annotated dataset for relation extraction task covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. SREDFM includes Vietnamnese.
Subsets SREDFM_vi
Languages vie
Tasks Relation Extraction
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/babelscape/rebel
HF URL https://huggingface.co/datasets/Babelscape/SREDFM
Paper URL https://aclanthology.org/2023.acl-long.237/
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Nov 14, 2023
@sabilmakbar
Copy link
Collaborator

This is interesting, I'll take this first and see whether this can be done under a week, else I might release the task to others.

Btw, @SamuelCahyawijaya do we have a increasing bonus system after some time if the issue hasn't been picked up by anyone?

@sabilmakbar
Copy link
Collaborator

#self-assign

@sabilmakbar
Copy link
Collaborator

sabilmakbar commented Dec 2, 2023

I'll start on this once #62 has been reviewed. Will start it by next week.

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sabilmakbar
Copy link
Collaborator

sabilmakbar commented Dec 18, 2023

Will try to get this done by EoW (since my other dataloader has been merged recently)

Copy link

github-actions bot commented Jan 2, 2024

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sabilmakbar
Copy link
Collaborator

Ugh, didn't have the chance to do this. will release this and see if anyone else can take this instead.

@sabilmakbar sabilmakbar removed their assignment Jan 3, 2024
@sabilmakbar sabilmakbar added help wanted Extra attention is needed bonus +1 and removed staled-issue labels Jan 3, 2024
@khelli07
Copy link
Collaborator

khelli07 commented Jan 4, 2024

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@khelli07
Copy link
Collaborator

Hi, I did worked on this. The dataloader works, but apparently the test are hard to get passed, especially the seacrowd schema one. IIRC, I was having issues with IDs (duplicate). Currently have no time yet to fix it.

Copy link

github-actions bot commented Feb 4, 2024

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@khelli07
Copy link
Collaborator

khelli07 commented Feb 5, 2024

Currently discussing it with @SamuelCahyawijaya

@khelli07
Copy link
Collaborator

Basically I got two problems

  1. Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different)
    image

  2. Id uniqueness problem
    image

For second problem, I am still not sure why because:

  • For source schema, I assume it only checks the yielded id (the yield ..., { ... } at the example loop)
  • So for ID uniqueness, it is most likely coming from the seacrowd shema, BUT assuming that this is the ids the test check
    "id": example["docid"], --> skip the same doc id
    "passages": passages, --> use custom id (counter)
    "entities": entities, --> counter
    "relations": relations, --> counter

Can look at the current full code here

@holylovenia
Copy link
Contributor

Basically I got two problems

  1. Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different)
    image
  2. Id uniqueness problem
    image

For second problem, I am still not sure why because:

  • For source schema, I assume it only checks the yielded id (the yield ..., { ... } at the example loop)
  • So for ID uniqueness, it is most likely coming from the seacrowd shema, BUT assuming that this is the ids the test check
    "id": example["docid"], --> skip the same doc id
    "passages": passages, --> use custom id (counter)
    "entities": entities, --> counter
    "relations": relations, --> counter

Can look at the current full code here

Hi @khelli07, are you still discussing it with @SamuelCahyawijaya or do you guys need another pair of eyes?

@khelli07
Copy link
Collaborator

I think @SamuelCahyawijaya is currently busy. Might need another help.

@sabilmakbar sabilmakbar added the in-progress Assignee has given confirmation on progress and ETA label Feb 25, 2024
@sabilmakbar
Copy link
Collaborator

For 1, I believe the text_offset was generated from the text field, but I saw on the offset text that a new line isn't present in the original text. Is it expected?

For 2, when I rechecked the schema, it checked the IDs defined on the example level to ensure they were unique. In your implementation, you're defining the entity & relation ID only by specifying an iteration counter, which leads to duplication on the check. Perhaps the workaround will be similar to what you did on the passage ID, appending it to some unique identifier for distinguishing entity ID and relation ID.

@khelli07
Copy link
Collaborator

khelli07 commented Mar 2, 2024

OK, will take a look into this again in near time.

@khelli07
Copy link
Collaborator

khelli07 commented Mar 9, 2024

For number 2) it is resolved now. I think I misunderstood the concept id at first.

Now, for problem number 1) -> yes, the problem is in the newline. Content-wise is the same

@khelli07
Copy link
Collaborator

khelli07 commented Mar 9, 2024

image
omg, I passed the tests 😂

The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.

That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).

@holylovenia
Copy link
Contributor

image omg, I passed the tests 😂

The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.

That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).

Awesome, hahaha. Glad to hear that you've resolved the issue! 👍

@holylovenia holylovenia added pr-ready A PR that closes this issue is Ready to be reviewed and removed help wanted Extra attention is needed in-progress Assignee has given confirmation on progress and ETA labels Mar 11, 2024
holylovenia pushed a commit that referenced this issue Apr 27, 2024
* [New Feature] Add SREDFM dataloader (temp)

* [Fix] Inequal string and unique id bug for SREDFM Dataloader

* [Fix] Refactor based on reviews

* [Fix] Remove redundant RE task in constants.py

* [Fix] Implement reviews

* [Fix] Implement review feedbacks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonus +1 pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants