Create dataset loader for CLIRMatrix #426

SamuelCahyawijaya · 2024-02-13T02:26:50Z

Dataloader name: clir_matrix/clir_matrix.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?clir_matrix

Dataset	clir_matrix
Description	CLIRMatrix is a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages.
Subsets	-
Languages	tgl, ilo, min, jav, sun, ceb, vie, tha
Tasks	Text Retrieval
License	Unknown (unknown)
Homepage	https://github.com/ssun32/CLIRMatrix
HF URL	-
Paper URL	https://aclanthology.org/2020.emnlp-main.340/

The text was updated successfully, but these errors were encountered:

fhudi · 2024-02-16T12:07:41Z

#self-assign

github-actions · 2024-03-02T01:54:47Z

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

fhudi · 2024-03-02T11:52:18Z

To whoever it may concern, in regards to the above github-actions' reminder,
I apologise for the delay and any inconveniences caused by this.
I am currently still working on this issue, please give some more time.
Regards.

akhdanfadh · 2024-04-01T00:19:28Z

@fhudi may we know if you are still working on this issue? It has already been one month since your last update.

fhudi · 2024-04-01T08:00:04Z

Hi @akhdanfadh, thanks for the reminder.

There are some files in the raw dataset that turned out to be empty file.
I was in the process of downloading the whole combination from the 9 languages supported,
to check and then ask the author for clarification, but somehow forgotten half-way.

Hi @ssun32,
I tried to create dataloader for your dataset but seems like there are empty files.
Could you please help checking the BI-139 for all queries in Indonesian (id), i.e. id → *? 🙏

And also, regarding the license, it seems to be of unknown value,
but from your CLIRMatrix site, it seems to be cc-by-4.0 as written in the footnote.
So which one is correct? 🙏

ssun32 · 2024-04-01T08:24:19Z

@fhudi It turns out there is zero overlap of the id-xx examples with the examples in the other language directions, probably due to incomplete Wikidata entries for ID when I created the dataset a few years ago. I recommend throwing away the language directions with empty files. Thanks for spotting the issue!

fhudi · 2024-04-01T09:59:02Z

thanks @ssun32.
What about the license?

@SamuelCahyawijaya, Need help 🙏
Shall we just remove the support to language ind?

holylovenia · 2024-04-02T06:50:18Z

@fhudi Removed ind in both issue ticket and datasheet.

fhudi · 2024-04-09T05:04:41Z

@holylovenia @SamuelCahyawijaya

The dataset's task is TEXT_RETRIEVAL, so the seacrowd schema for this dataset is PAIRS as noted in the constants.py

However, it seems the schema is incorrect, as it seems the triplet contains a discrete numerical value for relevance score as defined follows:

Although PAIRS_SCORE seems to be more fitting, the dataset is formatted based on the task of IR (Information Retrieval), where multiple document ids and it's relevance score in a single triplet.
Note that PAIRS_MULTI is categorical hence unfit.

I think we are not going to support the format of IR task, right?
Because if we do, it will be problematic to load the whole document texts instead of IDs in the dataloader.

One of the immediate solution that I can think of, without much changes to the shared classes,
is letting the relevance score represented as categorical.

But WDYT?

holylovenia · 2024-04-15T08:06:24Z

Hi @fhudi, I agree with you. Let's do source-only for this dataloader. 👍

* Create dataset loader for CLIRMatrix (#426) * Removing supported task for source-only in clir_matrix.py * Adding split comment for CLIRMatrix (#426) * Add comment for explaining test2 * do make formatter --------- Co-authored-by: Muhammad Ravi Shulthan Habibi <[email protected]>

SamuelCahyawijaya added this to SEACrowd Data Hub Feb 13, 2024

SamuelCahyawijaya converted this from a draft issue Feb 13, 2024

github-actions bot assigned fhudi Feb 16, 2024

github-actions bot added the staled-issue label Mar 2, 2024

github-actions bot removed the staled-issue label Mar 3, 2024

github-actions bot added the staled-issue label Mar 18, 2024

github-actions bot removed the staled-issue label Apr 1, 2024

holylovenia added the source-only label Apr 15, 2024

fhudi added a commit to fhudi/seacrowd-datahub that referenced this issue Apr 17, 2024

Create dataset loader for CLIRMatrix (SEACrowd#426)

6d640b0

fhudi mentioned this issue Apr 17, 2024

Closes #426 | Create dataset loader for CLIRMatrix #650

Merged

8 tasks

holylovenia added the pr-ready A PR that closes this issue is Ready to be reviewed label Apr 22, 2024

fhudi added a commit to fhudi/seacrowd-datahub that referenced this issue May 20, 2024

Adding split comment for CLIRMatrix (SEACrowd#426)

4e684d9

muhammadravi251001 closed this as completed in #650 May 30, 2024

github-project-automation bot moved this to Done in SEACrowd Data Hub May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for CLIRMatrix #426

Create dataset loader for CLIRMatrix #426

SamuelCahyawijaya commented Feb 13, 2024 •

edited by holylovenia

Loading

fhudi commented Feb 16, 2024

github-actions bot commented Mar 2, 2024

fhudi commented Mar 2, 2024

akhdanfadh commented Apr 1, 2024

fhudi commented Apr 1, 2024 •

edited

Loading

ssun32 commented Apr 1, 2024

fhudi commented Apr 1, 2024

holylovenia commented Apr 2, 2024

fhudi commented Apr 9, 2024 •

edited

Loading

holylovenia commented Apr 15, 2024

Create dataset loader for CLIRMatrix #426

Create dataset loader for CLIRMatrix #426

Comments

SamuelCahyawijaya commented Feb 13, 2024 • edited by holylovenia Loading

fhudi commented Feb 16, 2024

github-actions bot commented Mar 2, 2024

fhudi commented Mar 2, 2024

akhdanfadh commented Apr 1, 2024

fhudi commented Apr 1, 2024 • edited Loading

ssun32 commented Apr 1, 2024

fhudi commented Apr 1, 2024

holylovenia commented Apr 2, 2024

fhudi commented Apr 9, 2024 • edited Loading

holylovenia commented Apr 15, 2024

SamuelCahyawijaya commented Feb 13, 2024 •

edited by holylovenia

Loading

fhudi commented Apr 1, 2024 •

edited

Loading

fhudi commented Apr 9, 2024 •

edited

Loading