Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for CLIRMatrix #426

Closed
SamuelCahyawijaya opened this issue Feb 13, 2024 · 10 comments · Fixed by #650
Closed

Create dataset loader for CLIRMatrix #426

SamuelCahyawijaya opened this issue Feb 13, 2024 · 10 comments · Fixed by #650
Assignees
Labels
pr-ready A PR that closes this issue is Ready to be reviewed source-only

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Feb 13, 2024

Dataloader name: clir_matrix/clir_matrix.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?clir_matrix

Dataset clir_matrix
Description CLIRMatrix is a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages.
Subsets -
Languages tgl, ilo, min, jav, sun, ceb, vie, tha
Tasks Text Retrieval
License Unknown (unknown)
Homepage https://github.com/ssun32/CLIRMatrix
HF URL -
Paper URL https://aclanthology.org/2020.emnlp-main.340/
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Feb 13, 2024
@fhudi
Copy link
Collaborator

fhudi commented Feb 16, 2024

#self-assign

Copy link

github-actions bot commented Mar 2, 2024

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@fhudi
Copy link
Collaborator

fhudi commented Mar 2, 2024

To whoever it may concern, in regards to the above github-actions' reminder,
I apologise for the delay and any inconveniences caused by this.
I am currently still working on this issue, please give some more time.
Regards.

@akhdanfadh
Copy link
Collaborator

@fhudi may we know if you are still working on this issue? It has already been one month since your last update.

@fhudi
Copy link
Collaborator

fhudi commented Apr 1, 2024

Hi @akhdanfadh, thanks for the reminder.

There are some files in the raw dataset that turned out to be empty file.
I was in the process of downloading the whole combination from the 9 languages supported,
to check and then ask the author for clarification, but somehow forgotten half-way.


Hi @ssun32,
I tried to create dataloader for your dataset but seems like there are empty files.
Could you please help checking the BI-139 for all queries in Indonesian (id), i.e. id → *? 🙏

image

And also, regarding the license, it seems to be of unknown value,
but from your CLIRMatrix site, it seems to be cc-by-4.0 as written in the footnote.
So which one is correct? 🙏

@ssun32
Copy link
Contributor

ssun32 commented Apr 1, 2024

@fhudi It turns out there is zero overlap of the id-xx examples with the examples in the other language directions, probably due to incomplete Wikidata entries for ID when I created the dataset a few years ago. I recommend throwing away the language directions with empty files. Thanks for spotting the issue!

@fhudi
Copy link
Collaborator

fhudi commented Apr 1, 2024

thanks @ssun32.
What about the license?


@SamuelCahyawijaya, Need help 🙏
Shall we just remove the support to language ind?

@holylovenia
Copy link
Contributor

@fhudi Removed ind in both issue ticket and datasheet.

@fhudi
Copy link
Collaborator

fhudi commented Apr 9, 2024

@holylovenia @SamuelCahyawijaya


The dataset's task is TEXT_RETRIEVAL, so the seacrowd schema for this dataset is PAIRS as noted in the constants.py

However, it seems the schema is incorrect, as it seems the triplet contains a discrete numerical value for relevance score as defined follows:
image

Although PAIRS_SCORE seems to be more fitting, the dataset is formatted based on the task of IR (Information Retrieval), where multiple document ids and it's relevance score in a single triplet.
Note that PAIRS_MULTI is categorical hence unfit.
image


I think we are not going to support the format of IR task, right?
Because if we do, it will be problematic to load the whole document texts instead of IDs in the dataloader.

One of the immediate solution that I can think of, without much changes to the shared classes,
is letting the relevance score represented as categorical.

But WDYT?

@holylovenia
Copy link
Contributor

Hi @fhudi, I agree with you. Let's do source-only for this dataloader. 👍

fhudi added a commit to fhudi/seacrowd-datahub that referenced this issue Apr 17, 2024
@holylovenia holylovenia added the pr-ready A PR that closes this issue is Ready to be reviewed label Apr 22, 2024
fhudi added a commit to fhudi/seacrowd-datahub that referenced this issue May 20, 2024
muhammadravi251001 added a commit that referenced this issue May 30, 2024
* Create dataset loader for CLIRMatrix (#426)

* Removing supported task for source-only in clir_matrix.py

* Adding split comment for CLIRMatrix (#426)

* Add comment for explaining test2

* do make formatter

---------

Co-authored-by: Muhammad Ravi Shulthan Habibi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-ready A PR that closes this issue is Ready to be reviewed source-only
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants