Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1 #1714

Open
Bhavya6187 opened this issue Jan 6, 2025 · 3 comments

Comments

@Bhavya6187
Copy link

Bhavya6187 commented Jan 6, 2025

Hey folks,

I am trying to run this code -

import mteb
tasks = mteb.get_tasks(tasks=["ArguAna"])
task = tasks[0]
task.load_data()

with HF_HUB_OFFLINE=1

But I get the following error -

----> 4 task.load_data()

File ~/env/lib/python3.10/site-packages/mteb/abstasks/AbsTaskRetrieval.py:283, in AbsTaskRetrieval.load_data(self, **kwargs)
    281 print("dataset_path", dataset_path)
    282 print("hf_repo_qrels", hf_repo_qrels)
--> 283 corpus, queries, qrels = HFDataLoader(
    284     hf_repo=dataset_path,
    285     hf_repo_qrels=hf_repo_qrels,
    286     streaming=False,
    287     keep_in_memory=False,
    288     trust_remote_code=self.metadata_dict["dataset"].get(
    289         "trust_remote_code", False
    290     ),
    291 ).load(split=split)
    292 # Conversion from DataSet
    293 queries = {query["id"]: query["text"] for query in queries}

File ~/env/lib/python3.10/site-packages/mteb/abstasks/AbsTaskRetrieval.py:96, in HFDataLoader.load(self, split)
     93     logger.info("Loading Queries...")
     94     self._load_queries()
---> 96 self._load_qrels(split)
     97 # filter queries with no qrels
     98 qrels_dict = defaultdict(dict)

File ~/env/lib/python3.10/site-packages/mteb/abstasks/AbsTaskRetrieval.py:177, in HFDataLoader._load_qrels(self, split)
    175 def _load_qrels(self, split):
    176     if self.hf_repo:
--> 177         qrels_ds = load_dataset(
    178             self.hf_repo_qrels,
    179             keep_in_memory=self.keep_in_memory,
    180             streaming=self.streaming,
    181             trust_remote_code=self.trust_remote_code,
    182         )[split]
    183     else:
    184         qrels_ds = load_dataset(
    185             "csv",
    186             data_files=self.qrels_file,
    187             delimiter="\t",
    188             keep_in_memory=self.keep_in_memory,
    189         )

File ~/env/lib/python3.10/site-packages/datasets/load.py:2606, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2601 verification_mode = VerificationMode(
   2602     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   2603 )
   2605 # Create a dataset builder
-> 2606 builder_instance = load_dataset_builder(
   2607     path=path,
   2608     name=name,
   2609     data_dir=data_dir,
   2610     data_files=data_files,
   2611     cache_dir=cache_dir,
   2612     features=features,
   2613     download_config=download_config,
   2614     download_mode=download_mode,
   2615     revision=revision,
   2616     token=token,
   2617     storage_options=storage_options,
   2618     trust_remote_code=trust_remote_code,
   2619     _require_default_config_name=name is None,
   2620     **config_kwargs,
   2621 )
   2623 # Return iterable dataset in case of streaming
   2624 if streaming:

File ~/env/lib/python3.10/site-packages/datasets/load.py:2314, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
   2312 builder_cls = get_dataset_builder_class(dataset_module, dataset_name=dataset_name)
   2313 # Instantiate the dataset builder
-> 2314 builder_instance: DatasetBuilder = builder_cls(
   2315     cache_dir=cache_dir,
   2316     dataset_name=dataset_name,
   2317     config_name=config_name,
   2318     data_dir=data_dir,
   2319     data_files=data_files,
   2320     hash=dataset_module.hash,
   2321     info=info,
   2322     features=features,
   2323     token=token,
   2324     storage_options=storage_options,
   2325     **builder_kwargs,
   2326     **config_kwargs,
   2327 )
   2328 builder_instance._use_legacy_cache_dir_if_possible(dataset_module)
   2330 return builder_instance

File ~/env/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py:140, in Cache.__init__(self, cache_dir, dataset_name, config_name, version, hash, base_path, info, features, token, use_auth_token, repo_id, data_files, data_dir, storage_options, writer_batch_size, name, **config_kwargs)
    138     config_kwargs["data_dir"] = data_dir
    139 if hash == "auto" and version == "auto":
--> 140     config_name, version, hash = _find_hash_in_cache(
    141         dataset_name=repo_id or dataset_name,
    142         config_name=config_name,
    143         cache_dir=cache_dir,
    144         config_kwargs=config_kwargs,
    145         custom_features=features,
    146     )
    147 elif hash == "auto" or version == "auto":
    148     raise NotImplementedError("Pass both hash='auto' and version='auto' instead")

File ~/env/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py:85, in _find_hash_in_cache(dataset_name, config_name, cache_dir, config_kwargs, custom_features)
     73 other_configs = [
     74     Path(_cached_directory_path).parts[-3]
     75     for _cached_directory_path in glob.glob(os.path.join(cached_datasets_directory_path_root, "*", version, hash))
   (...)
     82     )
     83 ]
     84 if not config_id and len(other_configs) > 1:
---> 85     raise ValueError(
     86         f"There are multiple '{dataset_name}' configurations in the cache: {', '.join(other_configs)}"
     87         f"\nPlease specify which configuration to reload from the cache, e.g."
     88         f"\n\tload_dataset('{dataset_name}', '{other_configs[0]}')"
     89     )
     90 config_name = cached_directory_path.parts[-3]
     91 warning_msg = (
     92     f"Found the latest cached dataset configuration '{config_name}' at {cached_directory_path} "
     93     f"(last modified on {time.ctime(_get_modification_time(cached_directory_path))})."
     94 )

ValueError: There are multiple 'mteb/arguana' configurations in the cache: queries, corpus, default
Please specify which configuration to reload from the cache, e.g.
        load_dataset('mteb/arguana', 'queries')

It works when I run the same code with HF_HUB_OFFLINE=0, but after the data is downloaded, I turn off the HF hub cache with HF_HUB_OFFLINE=1, and then this error appears.

Is there some place else where MTEB also fetches the data that I need to store?
I am using MTEB 1.26.4 and datasets 2.21.0

@KennethEnevoldsen
Copy link
Contributor

Is there some place else where MTEB also fetches the data that I need to store?

No, mteb directly calls load_dataset, when loading a dataset.

You might want to ensure that you can get this to work by directly calling:

ds = load_dataset("mteb/arguana")

@Bhavya6187
Copy link
Author

@KennethEnevoldsen Thanks for such quick response and you are correct! I get the same error when I call datasets directly. Do you have any recommendation on how to get around it.

In this case, will 'default' configuration be the correct one? I can make the package editable and get around it by catching the exception.

@KennethEnevoldsen
Copy link
Contributor

Not sure no, but others might have a answer - we would love to support offline evaluation so I will keep this issue open (also related to #1701)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants