Create dataset loader for SREDFM #48

SamuelCahyawijaya · 2023-11-14T11:05:39Z

Dataloader name: sredfm/sredfm.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sredfm

Dataset	sredfm
Description	SREDFM is an automatically annotated dataset for relation extraction task covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. SREDFM includes Vietnamnese.
Subsets	SREDFM_vi
Languages	vie
Tasks	Relation Extraction
License	Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage	https://github.com/babelscape/rebel
HF URL	https://huggingface.co/datasets/Babelscape/SREDFM
Paper URL	https://aclanthology.org/2023.acl-long.237/

The text was updated successfully, but these errors were encountered:

sabilmakbar · 2023-11-20T09:50:23Z

This is interesting, I'll take this first and see whether this can be done under a week, else I might release the task to others.

Btw, @SamuelCahyawijaya do we have a increasing bonus system after some time if the issue hasn't been picked up by anyone?

sabilmakbar · 2023-11-20T09:50:33Z

#self-assign

sabilmakbar · 2023-12-02T18:03:19Z

I'll start on this once #62 has been reviewed. Will start it by next week.

github-actions · 2023-12-17T02:07:40Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar · 2023-12-18T17:05:41Z

Will try to get this done by EoW (since my other dataloader has been merged recently)

github-actions · 2024-01-02T02:05:42Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar · 2024-01-03T02:40:09Z

Ugh, didn't have the chance to do this. will release this and see if anyone else can take this instead.

khelli07 · 2024-01-04T09:15:01Z

#self-assign

github-actions · 2024-01-19T02:07:27Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

khelli07 · 2024-01-19T15:56:27Z

Hi, I did worked on this. The dataloader works, but apparently the test are hard to get passed, especially the seacrowd schema one. IIRC, I was having issues with IDs (duplicate). Currently have no time yet to fix it.

github-actions · 2024-02-04T02:02:26Z

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

khelli07 · 2024-02-05T05:46:34Z

Currently discussing it with @SamuelCahyawijaya

khelli07 · 2024-02-17T05:38:43Z

Basically I got two problems

Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different)
Id uniqueness problem

For second problem, I am still not sure why because:

For source schema, I assume it only checks the yielded id (the yield ..., { ... } at the example loop)
So for ID uniqueness, it is most likely coming from the seacrowd shema, BUT assuming that this is the ids the test check
"id": example["docid"], --> skip the same doc id
"passages": passages, --> use custom id (counter)
"entities": entities, --> counter
"relations": relations, --> counter

Can look at the current full code here

holylovenia · 2024-02-19T06:55:20Z

Basically I got two problems

Spacing problem (the text and text offset asserted not equal. The content is the same, but the spacing is different)

Id uniqueness problem

For second problem, I am still not sure why because:

For source schema, I assume it only checks the yielded id (the yield ..., { ... } at the example loop)

So for ID uniqueness, it is most likely coming from the seacrowd shema, BUT assuming that this is the ids the test check
"id": example["docid"], --> skip the same doc id
"passages": passages, --> use custom id (counter)
"entities": entities, --> counter
"relations": relations, --> counter

Can look at the current full code here

Hi @khelli07, are you still discussing it with @SamuelCahyawijaya or do you guys need another pair of eyes?

khelli07 · 2024-02-20T02:46:16Z

I think @SamuelCahyawijaya is currently busy. Might need another help.

sabilmakbar · 2024-02-25T10:25:34Z

For 1, I believe the text_offset was generated from the text field, but I saw on the offset text that a new line isn't present in the original text. Is it expected?

For 2, when I rechecked the schema, it checked the IDs defined on the example level to ensure they were unique. In your implementation, you're defining the entity & relation ID only by specifying an iteration counter, which leads to duplication on the check. Perhaps the workaround will be similar to what you did on the passage ID, appending it to some unique identifier for distinguishing entity ID and relation ID.

khelli07 · 2024-03-02T04:37:48Z

OK, will take a look into this again in near time.

khelli07 · 2024-03-09T13:29:21Z

For number 2) it is resolved now. I think I misunderstood the concept id at first.

Now, for problem number 1) -> yes, the problem is in the newline. Content-wise is the same

khelli07 · 2024-03-09T14:21:07Z

omg, I passed the tests 😂

The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.

That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).

holylovenia · 2024-03-11T07:53:20Z

omg, I passed the tests 😂

The problem actually lies in the real dataset. The entities from the source datasets have "tidy" whitespaces (they don't have newlines). If you think about it, if the real passages are messy, the entity taken from it should also be messy.

That being said, I suspect the source dataset does not actually take the entity from the passage cause the passage is a bit chaotic (in terms of whitespaces).

Awesome, hahaha. Glad to hear that you've resolved the issue! 👍

* [New Feature] Add SREDFM dataloader (temp) * [Fix] Inequal string and unique id bug for SREDFM Dataloader * [Fix] Refactor based on reviews * [Fix] Remove redundant RE task in constants.py * [Fix] Implement reviews * [Fix] Implement review feedbacks

SamuelCahyawijaya added this to SEACrowd Data Hub Nov 14, 2023

SamuelCahyawijaya converted this from a draft issue Nov 14, 2023

github-actions bot assigned sabilmakbar Nov 20, 2023

github-actions bot added the staled-issue label Dec 17, 2023

github-actions bot removed the staled-issue label Dec 19, 2023

github-actions bot added the staled-issue label Jan 2, 2024

sabilmakbar removed their assignment Jan 3, 2024

sabilmakbar added help wanted Extra attention is needed bonus +1 and removed staled-issue labels Jan 3, 2024

github-actions bot assigned khelli07 Jan 4, 2024

github-actions bot added the staled-issue label Jan 19, 2024

github-actions bot removed the staled-issue label Jan 20, 2024

github-actions bot added the staled-issue label Feb 4, 2024

github-actions bot removed the staled-issue label Feb 6, 2024

sabilmakbar added the in-progress Assignee has given confirmation on progress and ETA label Feb 25, 2024

khelli07 mentioned this issue Mar 10, 2024

Closes #48 | Create dataset loader for SREDFM #495

Merged

8 tasks

holylovenia added pr-ready A PR that closes this issue is Ready to be reviewed and removed help wanted Extra attention is needed in-progress Assignee has given confirmation on progress and ETA labels Mar 11, 2024

holylovenia closed this as completed in #495 Apr 27, 2024

github-project-automation bot moved this to Done in SEACrowd Data Hub Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for SREDFM #48

Create dataset loader for SREDFM #48

SamuelCahyawijaya commented Nov 14, 2023 •

edited

Loading

sabilmakbar commented Nov 20, 2023

sabilmakbar commented Nov 20, 2023

sabilmakbar commented Dec 2, 2023 •

edited

Loading

github-actions bot commented Dec 17, 2023

sabilmakbar commented Dec 18, 2023 •

edited

Loading

github-actions bot commented Jan 2, 2024

sabilmakbar commented Jan 3, 2024

khelli07 commented Jan 4, 2024

github-actions bot commented Jan 19, 2024

khelli07 commented Jan 19, 2024

github-actions bot commented Feb 4, 2024

khelli07 commented Feb 5, 2024

khelli07 commented Feb 17, 2024

holylovenia commented Feb 19, 2024

khelli07 commented Feb 20, 2024

sabilmakbar commented Feb 25, 2024

khelli07 commented Mar 2, 2024

khelli07 commented Mar 9, 2024

khelli07 commented Mar 9, 2024

holylovenia commented Mar 11, 2024

Create dataset loader for SREDFM #48

Create dataset loader for SREDFM #48

Comments

SamuelCahyawijaya commented Nov 14, 2023 • edited Loading

sabilmakbar commented Nov 20, 2023

sabilmakbar commented Nov 20, 2023

sabilmakbar commented Dec 2, 2023 • edited Loading

github-actions bot commented Dec 17, 2023

sabilmakbar commented Dec 18, 2023 • edited Loading

github-actions bot commented Jan 2, 2024

sabilmakbar commented Jan 3, 2024

khelli07 commented Jan 4, 2024

github-actions bot commented Jan 19, 2024

khelli07 commented Jan 19, 2024

github-actions bot commented Feb 4, 2024

khelli07 commented Feb 5, 2024

khelli07 commented Feb 17, 2024

holylovenia commented Feb 19, 2024

khelli07 commented Feb 20, 2024

sabilmakbar commented Feb 25, 2024

khelli07 commented Mar 2, 2024

khelli07 commented Mar 9, 2024

khelli07 commented Mar 9, 2024

holylovenia commented Mar 11, 2024

SamuelCahyawijaya commented Nov 14, 2023 •

edited

Loading

sabilmakbar commented Dec 2, 2023 •

edited

Loading

sabilmakbar commented Dec 18, 2023 •

edited

Loading