Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for XL-Sum #32

Closed
SamuelCahyawijaya opened this issue Nov 9, 2023 · 9 comments · Fixed by #498
Closed

Create dataset loader for XL-Sum #32

SamuelCahyawijaya opened this issue Nov 9, 2023 · 9 comments · Fixed by #498
Assignees
Labels
bonus +1 pr-ready A PR that closes this issue is Ready to be reviewed top-priority Needs to get done ASAP for the experiments

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Nov 9, 2023

Dataloader name: xl_sum/xl_sum.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?xl_sum

Dataset xl_sum
Description XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, was extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, including 4 indigenous languages spoken in Southeast Asia region.
Subsets XL-Sum Burmese, XL-Sum Indonesian, XL-Sum Thai, XL-Sum Vietnamnese
Languages mya, ind, tha, vie
Tasks Abstractive Summarization
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/csebuetnlp/xl-sum
HF URL https://huggingface.co/datasets/csebuetnlp/xlsum
Paper URL https://aclanthology.org/2021.findings-acl.413/
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Nov 9, 2023
@rmahendra
Copy link

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@rmahendra
Copy link

Hi, I'm still willing to work on this issue. However, I am quite busy at this moment. I'll try to PR by the end of this month.

@holylovenia holylovenia moved this to In Progress in SEACrowd Data Hub Dec 10, 2023
@sabilmakbar sabilmakbar added in-progress Assignee has given confirmation on progress and ETA and removed no-issue-activity labels Dec 10, 2023
@holylovenia
Copy link
Contributor

Okay then, @rmahendra. Feel free to let us know if you need any help!

@sabilmakbar
Copy link
Collaborator

Hi @rmahendra, have you got the time to implement this dataloader?

@holylovenia holylovenia removed the in-progress Assignee has given confirmation on progress and ETA label Feb 26, 2024
@sabilmakbar sabilmakbar self-assigned this Feb 29, 2024
@sabilmakbar
Copy link
Collaborator

Btw we have xl_sum already in SEACrowd but only for ID-EN pair. Will extend that script to cover the others

Copy link

github-actions bot commented Mar 3, 2024

Hi @sabilmakbar, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@holylovenia holylovenia added bonus +1 top-priority Needs to get done ASAP for the experiments labels Mar 12, 2024
@holylovenia
Copy link
Contributor

Adding top-priority and bonus+1 because we need this dataloader for the experiments.

@sabilmakbar
Copy link
Collaborator

Hi @holylovenia, this dataset license is supposedly to be CC-BY-NC-SA 4.0 (in this datacard, we are missing the NC info)

@sabilmakbar sabilmakbar added the pr-ready A PR that closes this issue is Ready to be reviewed label Mar 12, 2024
SamuelCahyawijaya added a commit that referenced this issue Mar 17, 2024
Closes #32 (High Prio) | Extend XL Sum Dataloaders to SEA Langs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonus +1 pr-ready A PR that closes this issue is Ready to be reviewed top-priority Needs to get done ASAP for the experiments
Projects
Status: Done
4 participants