Hyperparameter search error with Ray tune #27598

Shamik-07 · 2023-11-20T08:03:34Z

System Info

Hello Someone,

The version of Ray is 2.8.0 and the version of Transformers is 4.35.2

I am trying to run the hyperparameter search for this notebook with ray tune notebooks/examples/text_classification.ipynb at main · huggingface/notebooks (github.com)
and getting the following error:

---------------------------------------------------------------------------
DeprecationWarning                        Traceback (most recent call last)
<ipython-input-33-12c3f54763db> in <cell line: 1>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

3 frames
/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py in with_parameters(trainable, **kwargs)
    313             )
    314 
--> 315             raise DeprecationWarning(_CHECKPOINT_DIR_ARG_DEPRECATION_MSG)
    316 
    317         def inner(config):

DeprecationWarning: Accepting a `checkpoint_dir` argument in your training function is deprecated.
Please use `ray.train.get_checkpoint()` to access your checkpoint as a
`ray.train.Checkpoint` object instead. See below for an example:

Before
------

from ray import tune

def train_fn(config, checkpoint_dir=None):
    if checkpoint_dir:
        torch.load(os.path.join(checkpoint_dir, "checkpoint.pt"))
    ...

tuner = tune.Tuner(train_fn)
tuner.fit()

After
-----

from ray import train, tune

def train_fn(config):
    checkpoint: train.Checkpoint = train.get_checkpoint()
    if checkpoint:
        with checkpoint.as_directory() as checkpoint_dir:
            torch.load(os.path.join(checkpoint_dir, "checkpoint.pt"))
    ...

tuner = tune.Tuner(train_fn)
tuner.fit()

Who can help?

@muellerzr / @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Runnning the hyperparameter search with ray tune.

Expected behavior

Hyperparameter trials with ray tune

The text was updated successfully, but these errors were encountered:

pacman100 · 2023-11-20T12:56:06Z

Hello, this PR #26499 might fix this issue. Could you please try it out and let us know?

Shamik-07 · 2023-11-20T16:07:57Z

I tried running the notebook with the PR, however, i found a different error now:

2023-11-20 16:02:53,411	INFO worker.py:1673 -- Started a local Ray instance.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in put_object(self, value, object_ref, owner_address)
    702         try:
--> 703             serialized_value = self.get_serialization_context().serialize(value)
    704         except TypeError as e:

18 frames
[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in serialize(self, value)
    493         else:
--> 494             return self._serialize_to_msgpack(value)

[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in _serialize_to_msgpack(self, value)
    471             metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON
--> 472             pickle5_serialized_object = self._serialize_to_pickle5(
    473                 metadata, python_objects

[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in _serialize_to_pickle5(self, metadata, value)
    424             self.get_and_clear_contained_object_refs()
--> 425             raise e
    426         finally:

[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in _serialize_to_pickle5(self, metadata, value)
    419             self.set_in_band_serialization()
--> 420             inband = pickle.dumps(
    421                 value, protocol=5, buffer_callback=writer.buffer_callback

[/usr/local/lib/python3.10/dist-packages/ray/cloudpickle/cloudpickle_fast.py](https://localhost:8080/#) in dumps(obj, protocol, buffer_callback)
     87             cp = CloudPickler(file, protocol=protocol, buffer_callback=buffer_callback)
---> 88             cp.dump(obj)
     89             return file.getvalue()

[/usr/local/lib/python3.10/dist-packages/ray/cloudpickle/cloudpickle_fast.py](https://localhost:8080/#) in dump(self, obj)
    732         try:
--> 733             return Pickler.dump(self, obj)
    734         except RuntimeError as e:

TypeError: cannot pickle '_thread.lock' object

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
[<ipython-input-38-12c3f54763db>](https://localhost:8080/#) in <cell line: 1>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[/content/transformers/src/transformers/trainer.py](https://localhost:8080/#) in hyperparameter_search(self, hp_space, compute_objective, n_trials, direction, backend, hp_name, **kwargs)
   2548         self.compute_objective = default_compute_objective if compute_objective is None else compute_objective
   2549 
-> 2550         best_run = backend_obj.run(self, n_trials, direction, **kwargs)
   2551 
   2552         self.hp_search_backend = None

[/content/transformers/src/transformers/hyperparameter_search.py](https://localhost:8080/#) in run(self, trainer, n_trials, direction, **kwargs)
     85 
     86     def run(self, trainer, n_trials: int, direction: str, **kwargs):
---> 87         return run_hp_search_ray(trainer, n_trials, direction, **kwargs)
     88 
     89     def default_hp_space(self, trial):

[/content/transformers/src/transformers/integrations/integration_utils.py](https://localhost:8080/#) in run_hp_search_ray(trainer, n_trials, direction, **kwargs)
    352         dynamic_modules_import_trainable.__mixins__ = trainable.__mixins__
    353 
--> 354     analysis = ray.tune.run(
    355         dynamic_modules_import_trainable,
    356         config=trainer.hp_space(None),

[/usr/local/lib/python3.10/dist-packages/ray/tune/tune.py](https://localhost:8080/#) in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, storage_path, storage_filesystem, search_alg, scheduler, checkpoint_config, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, reuse_actors, raise_on_failed_trial, callbacks, max_concurrent_trials, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, chdir_to_trial_dir, local_dir, _experiment_checkpoint_dir, _remote, _remote_string_queue, _entrypoint)
    509         }
    510 
--> 511     _ray_auto_init(entrypoint=error_message_map["entrypoint"])
    512 
    513     if _remote is None:

[/usr/local/lib/python3.10/dist-packages/ray/tune/tune.py](https://localhost:8080/#) in _ray_auto_init(entrypoint)
    217         logger.info("'TUNE_DISABLE_AUTO_INIT=1' detected.")
    218     elif not ray.is_initialized():
--> 219         ray.init()
    220         logger.info(
    221             "Initializing Ray automatically. "

[/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    101             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 103         return func(*args, **kwargs)
    104 
    105     return wrapper

[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in init(address, num_cpus, num_gpus, resources, labels, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
   1700 
   1701     for hook in _post_init_hooks:
-> 1702         hook()
   1703 
   1704     node_id = global_worker.core_worker.get_current_node_id()

[/usr/local/lib/python3.10/dist-packages/ray/tune/registry.py](https://localhost:8080/#) in flush(self)
    306                 self.references[k] = v
    307             else:
--> 308                 self.references[k] = ray.put(v)
    309         self.to_flush.clear()

[/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py](https://localhost:8080/#) in auto_init_wrapper(*args, **kwargs)
     22     def auto_init_wrapper(*args, **kwargs):
     23         auto_init_ray()
---> 24         return fn(*args, **kwargs)
     25 
     26     return auto_init_wrapper

[/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    101             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 103         return func(*args, **kwargs)
    104 
    105     return wrapper

[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in put(value, _owner)
   2634     with profiling.profile("ray.put"):
   2635         try:
-> 2636             object_ref = worker.put_object(value, owner_address=serialize_owner_address)
   2637         except ObjectStoreFullError:
   2638             logger.info(

[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in put_object(self, value, object_ref, owner_address)
    710                 f"{sio.getvalue()}"
    711             )
--> 712             raise TypeError(msg) from e
    713         # This *must* be the first place that we construct this python
    714         # ObjectRef because an entry with 0 local references is created when

TypeError: Could not serialize the put value <transformers.trainer.Trainer object at 0x7e90dd830340>:
================================================================================
Checking Serializability of <transformers.trainer.Trainer object at 0x7e90dd830340>
================================================================================
!!! FAIL serialization: cannot pickle '_thread.lock' object
    Serializing 'compute_metrics' <function compute_metrics at 0x7e90dd9123b0>...
    !!! FAIL serialization: cannot pickle '_thread.lock' object
    Detected 3 global variables. Checking serializability...
        Serializing 'task' cola...
        Serializing 'np' <module 'numpy' from '/usr/local/lib/python3.10/dist-packages/numpy/__init__.py'>...
        Serializing 'metric' Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)...
        !!! FAIL serialization: cannot pickle '_thread.lock' object
            Serializing '_build_data_dir' <bound method Metric._build_data_dir of Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)>...
            !!! FAIL serialization: cannot pickle '_thread.lock' object
    Serializing '_add_sm_patterns_to_gitignore' <bound method Trainer._add_sm_patterns_to_gitignore of <transformers.trainer.Trainer object at 0x7e90dd830340>>...
    !!! FAIL serialization: cannot pickle '_thread.lock' object
        Serializing '__func__' <function Trainer._add_sm_patterns_to_gitignore at 0x7e90dd95d7e0>...
    WARNING: Did not find non-serializable object in <bound method Trainer._add_sm_patterns_to_gitignore of <transformers.trainer.Trainer object at 0x7e90dd830340>>. This may be an oversight.
================================================================================
Variable: 

	FailTuple(_build_data_dir [obj=<bound method Metric._build_data_dir of Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)>, parent=Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)])

was found to be non-serializable. There may be multiple other undetected variables that were non-serializable. 
Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class. 
================================================================================
Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information.
If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/
================================================================================

The deprecation error has been fixed.

Shamik-07 · 2023-11-23T13:02:04Z

What's the above error related to ?

justinvyu · 2023-11-29T20:24:26Z

Hey @Shamik-07, the Ray Tune integration serializes the HuggingFace Trainer along with your remote function. In this case, a non-serializable metric gets pickled along with the trainer via the compute_metrics parameter.

To fix it:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
+   metric = load_metric('glue', actual_task)  # load the metric inside the method, instead of implicitly pickling it
    return metric.compute(predictions=predictions, references=labels)

Shamik-07 · 2023-11-30T14:52:36Z

Thank you very much for the explanation @justinvyu :)

Shamik-07 · 2023-11-30T14:54:11Z

Closing this as this has been fixed by #26499 thanks to @justinvyu

muellerzr linked a pull request Nov 20, 2023 that will close this issue

[integration] Update Ray Tune integration for Ray 2.7 #26499

Merged

5 tasks

muellerzr mentioned this issue Nov 20, 2023

[integration] Update Ray Tune integration for Ray 2.7 #26499

Merged

5 tasks

Shamik-07 closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperparameter search error with Ray tune #27598

Hyperparameter search error with Ray tune #27598

Shamik-07 commented Nov 20, 2023

pacman100 commented Nov 20, 2023

Shamik-07 commented Nov 20, 2023

Shamik-07 commented Nov 23, 2023

justinvyu commented Nov 29, 2023

Shamik-07 commented Nov 30, 2023

Shamik-07 commented Nov 30, 2023

Hyperparameter search error with Ray tune #27598

Hyperparameter search error with Ray tune #27598

Comments

Shamik-07 commented Nov 20, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

pacman100 commented Nov 20, 2023

Shamik-07 commented Nov 20, 2023

Shamik-07 commented Nov 23, 2023

justinvyu commented Nov 29, 2023

Shamik-07 commented Nov 30, 2023

Shamik-07 commented Nov 30, 2023