Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

May I ask if MTEB supports evaluating downloaded data? #1701

Open
BtlWolf opened this issue Jan 4, 2025 · 4 comments
Open

May I ask if MTEB supports evaluating downloaded data? #1701

BtlWolf opened this issue Jan 4, 2025 · 4 comments

Comments

@BtlWolf
Copy link

BtlWolf commented Jan 4, 2025

During the MTEB evaluation process, it usually downloads relevant task datasets, but I already have these datasets. Is there any way to specify the path

@Samoed
Copy link
Collaborator

Samoed commented Jan 4, 2025

MTEB downloads datasets using datasets, but there’s no way to specify a custom path for the datasets.

@isaac-chung
Copy link
Collaborator

Currently I don't believe there is a clear way. But we certainly welcome contributions!

The MTEB.run() method accepts kwargs, but I have not tried it this way.

kwargs: Additional arguments to be passed to `_run_eval` method and task.load_data.

For non-retrieval tasks, the AbsTask.load_data() uses all kwargs in the TaskMetadata's dataset dict. Right now, most tasks specifies "path" and "revision".

self.dataset = datasets.load_dataset(**self.metadata_dict["dataset"]) # type: ignore

So perhaps, you could try either:

  1. Install MTEB in edit mode and overwrite the dataset's path to point to the desired path, or
  2. Install MTEB normally, create a new dataset, and inherit from the dataset class, and overwrite the dataset dict, e.g.
class NewDataset(AmazonPolarityClassification):
    metadata.dataset = {"path": YOUR_PATH}

Let us know if any of these work :)

@KennethEnevoldsen
Copy link
Contributor

We could implement the following:

# one task
task = mteb.get_task("AmazonPolarityClassification", dataset_kwargs={...})

# more tasks at once
tasks = mteb.get_tasks(tasks = ["AmazonPolarityClassification"], dataset_kwargs={"AmazonPolarityClassification": {...}})

However, I am unsure if we want users to be able to overwrite kwargs.

@KennethEnevoldsen
Copy link
Contributor

related to #1714

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants