-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dataset loader for CLIRMatrix #426
Comments
#self-assign |
Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help. |
To whoever it may concern, in regards to the above github-actions' reminder, |
@fhudi may we know if you are still working on this issue? It has already been one month since your last update. |
Hi @akhdanfadh, thanks for the reminder. There are some files in the raw dataset that turned out to be empty file. Hi @ssun32, And also, regarding the license, it seems to be of |
@fhudi It turns out there is zero overlap of the id-xx examples with the examples in the other language directions, probably due to incomplete Wikidata entries for ID when I created the dataset a few years ago. I recommend throwing away the language directions with empty files. Thanks for spotting the issue! |
thanks @ssun32. @SamuelCahyawijaya, Need help 🙏 |
@fhudi Removed |
@holylovenia @SamuelCahyawijaya The dataset's task is TEXT_RETRIEVAL, so the seacrowd schema for this dataset is PAIRS as noted in the However, it seems the schema is incorrect, as it seems the triplet contains a discrete numerical value for relevance score as defined follows: Although PAIRS_SCORE seems to be more fitting, the dataset is formatted based on the task of IR (Information Retrieval), where multiple document ids and it's relevance score in a single triplet. I think we are not going to support the format of IR task, right? One of the immediate solution that I can think of, without much changes to the shared classes, But WDYT? |
Hi @fhudi, I agree with you. Let's do |
* Create dataset loader for CLIRMatrix (#426) * Removing supported task for source-only in clir_matrix.py * Adding split comment for CLIRMatrix (#426) * Add comment for explaining test2 * do make formatter --------- Co-authored-by: Muhammad Ravi Shulthan Habibi <[email protected]>
Dataloader name:
clir_matrix/clir_matrix.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?clir_matrix
The text was updated successfully, but these errors were encountered: