Create dataset loader for M3LS #228

SamuelCahyawijaya · 2023-12-26T03:34:51Z

Dataloader name: m3ls/m3ls.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?m3ls

Dataset	m3ls
Description	The multilingual multimodal summarization dataset (M3LS) consists of over a million instances of document-image pairs along with a professionally annotated multimodal summary for each pair. It is derived from news articles published by the British Broadcasting Corporation (BBC) over a decade and spans 20 total languages.
Subsets	-
Languages	ind
Tasks	Summarization
License	MIT (mit)
Homepage	https://github.com/anubhav-jangra/M3LS
HF URL	-
Paper URL	https://aclanthology.org/2023.eacl-main.263/

The text was updated successfully, but these errors were encountered:

sedrickkeh · 2023-12-26T16:12:51Z

#self-assign

github-actions · 2024-01-10T02:06:34Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sedrickkeh · 2024-01-10T08:46:52Z

Working on it. Will try to finish this week

holylovenia · 2024-01-11T07:30:15Z

Working on it. Will try to finish this week

No problem! Feel free to let us know anytime you would like to discuss.

github-actions · 2024-01-26T02:00:13Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar · 2024-03-22T05:10:41Z

#self-assign

sabilmakbar · 2024-04-20T14:57:12Z

and for this one, I'll try to do this immediately after this one is PR-ed:
#449

sabilmakbar · 2024-04-22T15:16:43Z

Hi, seems the dataset itself is exceedingly large (the zipped version is around 14GB, unsure abt the actual size after unzipping -- now I'm doing it).

Also, there is a forseeable blocker on bypassing Google Drive downloading process by passing the GDrive URL to either datasets.DownloadManager or gdown.download. I'm trying to fix the issue and see other workarounds as well (last time I saw a possible workaround in #206 discussion).

If all of them aren't possible, maybe the last resort is to change the format into local-based dataset

sabilmakbar · 2024-04-22T15:35:39Z

updates:

There are multiple files that can be used, but there's no clear documentation on the contents/folder structuring itself on the description. Prob needs to skim the paper or the scrapper codes to get some hints.

From what I inspect on the , prob this data contains much more info than we thought initially (only text summarization from article). Idk if the images there can be stitched together to get a multimodal dataset, tho (would be valuable if we can somehow pull it).

any thoughts or ideas? @holylovenia @SamuelCahyawijaya

holylovenia · 2024-05-02T05:41:17Z

Thanks for inspecting this dataset, @sabilmakbar! I think this dataset is a multimodal dataset, precisely a multilingual
multimodal summarization dataset. Or did you mean stitching up another multimodal dataset?

* add m3ls * Update seacrowd/sea_datasets/m3ls/m3ls.py * Apply suggestions from code review update to comply w/ `black` formatter Co-authored-by: Frederikus Hudi <[email protected]> * Update m3ls.py * Update m3ls.py * Update m3ls.py following `black` formatter --------- Co-authored-by: Lj Miranda <[email protected]> Co-authored-by: Frederikus Hudi <[email protected]>

SamuelCahyawijaya added this to SEACrowd Data Hub Dec 26, 2023

SamuelCahyawijaya converted this from a draft issue Dec 26, 2023

github-actions bot assigned sedrickkeh Dec 26, 2023

github-actions bot added the staled-issue label Jan 10, 2024

github-actions bot removed the staled-issue label Jan 11, 2024

github-actions bot added the staled-issue label Jan 26, 2024

holylovenia unassigned sedrickkeh Mar 18, 2024

holylovenia removed the staled-issue label Mar 18, 2024

github-actions bot assigned sabilmakbar Mar 22, 2024

github-actions bot added the staled-issue label Apr 6, 2024

github-actions bot removed the staled-issue label Apr 21, 2024

holylovenia added the bonus +1 label May 2, 2024

sabilmakbar mentioned this issue May 16, 2024

Closes #228 | Add M3LS dataloader #675

Merged

8 tasks

github-actions bot added the staled-issue label May 17, 2024

fhudi closed this as completed in #675 May 31, 2024

github-project-automation bot moved this to Done in SEACrowd Data Hub May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for M3LS #228

Create dataset loader for M3LS #228

SamuelCahyawijaya commented Dec 26, 2023

sedrickkeh commented Dec 26, 2023

github-actions bot commented Jan 10, 2024

sedrickkeh commented Jan 10, 2024

holylovenia commented Jan 11, 2024

github-actions bot commented Jan 26, 2024

sabilmakbar commented Mar 22, 2024

sabilmakbar commented Apr 20, 2024

sabilmakbar commented Apr 22, 2024

sabilmakbar commented Apr 22, 2024

holylovenia commented May 2, 2024 •

edited

Loading

Create dataset loader for M3LS #228

Create dataset loader for M3LS #228

Comments

SamuelCahyawijaya commented Dec 26, 2023

sedrickkeh commented Dec 26, 2023

github-actions bot commented Jan 10, 2024

sedrickkeh commented Jan 10, 2024

holylovenia commented Jan 11, 2024

github-actions bot commented Jan 26, 2024

sabilmakbar commented Mar 22, 2024

sabilmakbar commented Apr 20, 2024

sabilmakbar commented Apr 22, 2024

sabilmakbar commented Apr 22, 2024

holylovenia commented May 2, 2024 • edited Loading

holylovenia commented May 2, 2024 •

edited

Loading