There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1 #1714

Bhavya6187 · 2025-01-06T16:57:54Z

Hey folks,

I am trying to run this code -

import mteb
tasks = mteb.get_tasks(tasks=["ArguAna"])
task = tasks[0]
task.load_data()

with HF_HUB_OFFLINE=1

But I get the following error -

----> 4 task.load_data()

File ~/env/lib/python3.10/site-packages/mteb/abstasks/AbsTaskRetrieval.py:283, in AbsTaskRetrieval.load_data(self, **kwargs)
    281 print("dataset_path", dataset_path)
    282 print("hf_repo_qrels", hf_repo_qrels)
--> 283 corpus, queries, qrels = HFDataLoader(
    284     hf_repo=dataset_path,
    285     hf_repo_qrels=hf_repo_qrels,
    286     streaming=False,
    287     keep_in_memory=False,
    288     trust_remote_code=self.metadata_dict["dataset"].get(
    289         "trust_remote_code", False
    290     ),
    291 ).load(split=split)
    292 # Conversion from DataSet
    293 queries = {query["id"]: query["text"] for query in queries}

File ~/env/lib/python3.10/site-packages/mteb/abstasks/AbsTaskRetrieval.py:96, in HFDataLoader.load(self, split)
     93     logger.info("Loading Queries...")
     94     self._load_queries()
---> 96 self._load_qrels(split)
     97 # filter queries with no qrels
     98 qrels_dict = defaultdict(dict)

File ~/env/lib/python3.10/site-packages/mteb/abstasks/AbsTaskRetrieval.py:177, in HFDataLoader._load_qrels(self, split)
    175 def _load_qrels(self, split):
    176     if self.hf_repo:
--> 177         qrels_ds = load_dataset(
    178             self.hf_repo_qrels,
    179             keep_in_memory=self.keep_in_memory,
    180             streaming=self.streaming,
    181             trust_remote_code=self.trust_remote_code,
    182         )[split]
    183     else:
    184         qrels_ds = load_dataset(
    185             "csv",
    186             data_files=self.qrels_file,
    187             delimiter="\t",
    188             keep_in_memory=self.keep_in_memory,
    189         )

File ~/env/lib/python3.10/site-packages/datasets/load.py:2606, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2601 verification_mode = VerificationMode(
   2602     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   2603 )
   2605 # Create a dataset builder
-> 2606 builder_instance = load_dataset_builder(
   2607     path=path,
   2608     name=name,
   2609     data_dir=data_dir,
   2610     data_files=data_files,
   2611     cache_dir=cache_dir,
   2612     features=features,
   2613     download_config=download_config,
   2614     download_mode=download_mode,
   2615     revision=revision,
   2616     token=token,
   2617     storage_options=storage_options,
   2618     trust_remote_code=trust_remote_code,
   2619     _require_default_config_name=name is None,
   2620     **config_kwargs,
   2621 )
   2623 # Return iterable dataset in case of streaming
   2624 if streaming:

File ~/env/lib/python3.10/site-packages/datasets/load.py:2314, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
   2312 builder_cls = get_dataset_builder_class(dataset_module, dataset_name=dataset_name)
   2313 # Instantiate the dataset builder
-> 2314 builder_instance: DatasetBuilder = builder_cls(
   2315     cache_dir=cache_dir,
   2316     dataset_name=dataset_name,
   2317     config_name=config_name,
   2318     data_dir=data_dir,
   2319     data_files=data_files,
   2320     hash=dataset_module.hash,
   2321     info=info,
   2322     features=features,
   2323     token=token,
   2324     storage_options=storage_options,
   2325     **builder_kwargs,
   2326     **config_kwargs,
   2327 )
   2328 builder_instance._use_legacy_cache_dir_if_possible(dataset_module)
   2330 return builder_instance

File ~/env/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py:140, in Cache.__init__(self, cache_dir, dataset_name, config_name, version, hash, base_path, info, features, token, use_auth_token, repo_id, data_files, data_dir, storage_options, writer_batch_size, name, **config_kwargs)
    138     config_kwargs["data_dir"] = data_dir
    139 if hash == "auto" and version == "auto":
--> 140     config_name, version, hash = _find_hash_in_cache(
    141         dataset_name=repo_id or dataset_name,
    142         config_name=config_name,
    143         cache_dir=cache_dir,
    144         config_kwargs=config_kwargs,
    145         custom_features=features,
    146     )
    147 elif hash == "auto" or version == "auto":
    148     raise NotImplementedError("Pass both hash='auto' and version='auto' instead")

File ~/env/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py:85, in _find_hash_in_cache(dataset_name, config_name, cache_dir, config_kwargs, custom_features)
     73 other_configs = [
     74     Path(_cached_directory_path).parts[-3]
     75     for _cached_directory_path in glob.glob(os.path.join(cached_datasets_directory_path_root, "*", version, hash))
   (...)
     82     )
     83 ]
     84 if not config_id and len(other_configs) > 1:
---> 85     raise ValueError(
     86         f"There are multiple '{dataset_name}' configurations in the cache: {', '.join(other_configs)}"
     87         f"\nPlease specify which configuration to reload from the cache, e.g."
     88         f"\n\tload_dataset('{dataset_name}', '{other_configs[0]}')"
     89     )
     90 config_name = cached_directory_path.parts[-3]
     91 warning_msg = (
     92     f"Found the latest cached dataset configuration '{config_name}' at {cached_directory_path} "
     93     f"(last modified on {time.ctime(_get_modification_time(cached_directory_path))})."
     94 )

ValueError: There are multiple 'mteb/arguana' configurations in the cache: queries, corpus, default
Please specify which configuration to reload from the cache, e.g.
        load_dataset('mteb/arguana', 'queries')

It works when I run the same code with HF_HUB_OFFLINE=0, but after the data is downloaded, I turn off the HF hub cache with HF_HUB_OFFLINE=1, and then this error appears.

Is there some place else where MTEB also fetches the data that I need to store?
I am using MTEB 1.26.4 and datasets 2.21.0

KennethEnevoldsen · 2025-01-06T17:11:33Z

Is there some place else where MTEB also fetches the data that I need to store?

No, mteb directly calls load_dataset, when loading a dataset.

You might want to ensure that you can get this to work by directly calling:

ds = load_dataset("mteb/arguana")

Bhavya6187 · 2025-01-06T17:31:33Z

@KennethEnevoldsen Thanks for such quick response and you are correct! I get the same error when I call datasets directly. Do you have any recommendation on how to get around it.

In this case, will 'default' configuration be the correct one? I can make the package editable and get around it by catching the exception.

KennethEnevoldsen · 2025-01-06T17:45:20Z

Not sure no, but others might have a answer - we would love to support offline evaluation so I will keep this issue open (also related to #1701)

KennethEnevoldsen mentioned this issue Jan 6, 2025

May I ask if MTEB supports evaluating downloaded data？ #1701

Open

Bhavya6187 mentioned this issue Jan 6, 2025

There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1 huggingface/datasets#7359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1 #1714

There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1 #1714

Bhavya6187 commented Jan 6, 2025 •

edited

Loading

KennethEnevoldsen commented Jan 6, 2025

Bhavya6187 commented Jan 6, 2025

KennethEnevoldsen commented Jan 6, 2025

There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1 #1714

There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1 #1714

Comments

Bhavya6187 commented Jan 6, 2025 • edited Loading

KennethEnevoldsen commented Jan 6, 2025

Bhavya6187 commented Jan 6, 2025

KennethEnevoldsen commented Jan 6, 2025

Bhavya6187 commented Jan 6, 2025 •

edited

Loading