Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperparameter search error with Ray tune #27598

Closed
2 of 4 tasks
Shamik-07 opened this issue Nov 20, 2023 · 6 comments · Fixed by #26499
Closed
2 of 4 tasks

Hyperparameter search error with Ray tune #27598

Shamik-07 opened this issue Nov 20, 2023 · 6 comments · Fixed by #26499

Comments

@Shamik-07
Copy link

System Info

Hello Someone,

The version of Ray is 2.8.0 and the version of Transformers is 4.35.2

I am trying to run the hyperparameter search for this notebook with ray tune notebooks/examples/text_classification.ipynb at main · huggingface/notebooks (github.com)
and getting the following error:

---------------------------------------------------------------------------
DeprecationWarning                        Traceback (most recent call last)
<ipython-input-33-12c3f54763db> in <cell line: 1>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

3 frames
/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py in with_parameters(trainable, **kwargs)
    313             )
    314 
--> 315             raise DeprecationWarning(_CHECKPOINT_DIR_ARG_DEPRECATION_MSG)
    316 
    317         def inner(config):

DeprecationWarning: Accepting a `checkpoint_dir` argument in your training function is deprecated.
Please use `ray.train.get_checkpoint()` to access your checkpoint as a
`ray.train.Checkpoint` object instead. See below for an example:

Before
------

from ray import tune

def train_fn(config, checkpoint_dir=None):
    if checkpoint_dir:
        torch.load(os.path.join(checkpoint_dir, "checkpoint.pt"))
    ...

tuner = tune.Tuner(train_fn)
tuner.fit()

After
-----

from ray import train, tune

def train_fn(config):
    checkpoint: train.Checkpoint = train.get_checkpoint()
    if checkpoint:
        with checkpoint.as_directory() as checkpoint_dir:
            torch.load(os.path.join(checkpoint_dir, "checkpoint.pt"))
    ...

tuner = tune.Tuner(train_fn)
tuner.fit()

Who can help?

@muellerzr / @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Runnning the hyperparameter search with ray tune.

Expected behavior

Hyperparameter trials with ray tune

@pacman100
Copy link
Contributor

Hello, this PR #26499 might fix this issue. Could you please try it out and let us know?

@muellerzr muellerzr linked a pull request Nov 20, 2023 that will close this issue
5 tasks
@Shamik-07
Copy link
Author

I tried running the notebook with the PR, however, i found a different error now:

2023-11-20 16:02:53,411	INFO worker.py:1673 -- Started a local Ray instance.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in put_object(self, value, object_ref, owner_address)
    702         try:
--> 703             serialized_value = self.get_serialization_context().serialize(value)
    704         except TypeError as e:

18 frames
[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in serialize(self, value)
    493         else:
--> 494             return self._serialize_to_msgpack(value)

[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in _serialize_to_msgpack(self, value)
    471             metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON
--> 472             pickle5_serialized_object = self._serialize_to_pickle5(
    473                 metadata, python_objects

[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in _serialize_to_pickle5(self, metadata, value)
    424             self.get_and_clear_contained_object_refs()
--> 425             raise e
    426         finally:

[/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py](https://localhost:8080/#) in _serialize_to_pickle5(self, metadata, value)
    419             self.set_in_band_serialization()
--> 420             inband = pickle.dumps(
    421                 value, protocol=5, buffer_callback=writer.buffer_callback

[/usr/local/lib/python3.10/dist-packages/ray/cloudpickle/cloudpickle_fast.py](https://localhost:8080/#) in dumps(obj, protocol, buffer_callback)
     87             cp = CloudPickler(file, protocol=protocol, buffer_callback=buffer_callback)
---> 88             cp.dump(obj)
     89             return file.getvalue()

[/usr/local/lib/python3.10/dist-packages/ray/cloudpickle/cloudpickle_fast.py](https://localhost:8080/#) in dump(self, obj)
    732         try:
--> 733             return Pickler.dump(self, obj)
    734         except RuntimeError as e:

TypeError: cannot pickle '_thread.lock' object

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
[<ipython-input-38-12c3f54763db>](https://localhost:8080/#) in <cell line: 1>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[/content/transformers/src/transformers/trainer.py](https://localhost:8080/#) in hyperparameter_search(self, hp_space, compute_objective, n_trials, direction, backend, hp_name, **kwargs)
   2548         self.compute_objective = default_compute_objective if compute_objective is None else compute_objective
   2549 
-> 2550         best_run = backend_obj.run(self, n_trials, direction, **kwargs)
   2551 
   2552         self.hp_search_backend = None

[/content/transformers/src/transformers/hyperparameter_search.py](https://localhost:8080/#) in run(self, trainer, n_trials, direction, **kwargs)
     85 
     86     def run(self, trainer, n_trials: int, direction: str, **kwargs):
---> 87         return run_hp_search_ray(trainer, n_trials, direction, **kwargs)
     88 
     89     def default_hp_space(self, trial):

[/content/transformers/src/transformers/integrations/integration_utils.py](https://localhost:8080/#) in run_hp_search_ray(trainer, n_trials, direction, **kwargs)
    352         dynamic_modules_import_trainable.__mixins__ = trainable.__mixins__
    353 
--> 354     analysis = ray.tune.run(
    355         dynamic_modules_import_trainable,
    356         config=trainer.hp_space(None),

[/usr/local/lib/python3.10/dist-packages/ray/tune/tune.py](https://localhost:8080/#) in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, storage_path, storage_filesystem, search_alg, scheduler, checkpoint_config, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, reuse_actors, raise_on_failed_trial, callbacks, max_concurrent_trials, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, chdir_to_trial_dir, local_dir, _experiment_checkpoint_dir, _remote, _remote_string_queue, _entrypoint)
    509         }
    510 
--> 511     _ray_auto_init(entrypoint=error_message_map["entrypoint"])
    512 
    513     if _remote is None:

[/usr/local/lib/python3.10/dist-packages/ray/tune/tune.py](https://localhost:8080/#) in _ray_auto_init(entrypoint)
    217         logger.info("'TUNE_DISABLE_AUTO_INIT=1' detected.")
    218     elif not ray.is_initialized():
--> 219         ray.init()
    220         logger.info(
    221             "Initializing Ray automatically. "

[/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    101             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 103         return func(*args, **kwargs)
    104 
    105     return wrapper

[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in init(address, num_cpus, num_gpus, resources, labels, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
   1700 
   1701     for hook in _post_init_hooks:
-> 1702         hook()
   1703 
   1704     node_id = global_worker.core_worker.get_current_node_id()

[/usr/local/lib/python3.10/dist-packages/ray/tune/registry.py](https://localhost:8080/#) in flush(self)
    306                 self.references[k] = v
    307             else:
--> 308                 self.references[k] = ray.put(v)
    309         self.to_flush.clear()

[/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py](https://localhost:8080/#) in auto_init_wrapper(*args, **kwargs)
     22     def auto_init_wrapper(*args, **kwargs):
     23         auto_init_ray()
---> 24         return fn(*args, **kwargs)
     25 
     26     return auto_init_wrapper

[/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    101             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 103         return func(*args, **kwargs)
    104 
    105     return wrapper

[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in put(value, _owner)
   2634     with profiling.profile("ray.put"):
   2635         try:
-> 2636             object_ref = worker.put_object(value, owner_address=serialize_owner_address)
   2637         except ObjectStoreFullError:
   2638             logger.info(

[/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py](https://localhost:8080/#) in put_object(self, value, object_ref, owner_address)
    710                 f"{sio.getvalue()}"
    711             )
--> 712             raise TypeError(msg) from e
    713         # This *must* be the first place that we construct this python
    714         # ObjectRef because an entry with 0 local references is created when

TypeError: Could not serialize the put value <transformers.trainer.Trainer object at 0x7e90dd830340>:
================================================================================
Checking Serializability of <transformers.trainer.Trainer object at 0x7e90dd830340>
================================================================================
!!! FAIL serialization: cannot pickle '_thread.lock' object
    Serializing 'compute_metrics' <function compute_metrics at 0x7e90dd9123b0>...
    !!! FAIL serialization: cannot pickle '_thread.lock' object
    Detected 3 global variables. Checking serializability...
        Serializing 'task' cola...
        Serializing 'np' <module 'numpy' from '/usr/local/lib/python3.10/dist-packages/numpy/__init__.py'>...
        Serializing 'metric' Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)...
        !!! FAIL serialization: cannot pickle '_thread.lock' object
            Serializing '_build_data_dir' <bound method Metric._build_data_dir of Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)>...
            !!! FAIL serialization: cannot pickle '_thread.lock' object
    Serializing '_add_sm_patterns_to_gitignore' <bound method Trainer._add_sm_patterns_to_gitignore of <transformers.trainer.Trainer object at 0x7e90dd830340>>...
    !!! FAIL serialization: cannot pickle '_thread.lock' object
        Serializing '__func__' <function Trainer._add_sm_patterns_to_gitignore at 0x7e90dd95d7e0>...
    WARNING: Did not find non-serializable object in <bound method Trainer._add_sm_patterns_to_gitignore of <transformers.trainer.Trainer object at 0x7e90dd830340>>. This may be an oversight.
================================================================================
Variable: 

	FailTuple(_build_data_dir [obj=<bound method Metric._build_data_dir of Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)>, parent=Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)])

was found to be non-serializable. There may be multiple other undetected variables that were non-serializable. 
Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class. 
================================================================================
Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information.
If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/
================================================================================

The deprecation error has been fixed.

@Shamik-07
Copy link
Author

What's the above error related to ?

@justinvyu
Copy link
Contributor

Hey @Shamik-07, the Ray Tune integration serializes the HuggingFace Trainer along with your remote function. In this case, a non-serializable metric gets pickled along with the trainer via the compute_metrics parameter.

To fix it:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
+   metric = load_metric('glue', actual_task)  # load the metric inside the method, instead of implicitly pickling it
    return metric.compute(predictions=predictions, references=labels)

@Shamik-07
Copy link
Author

Thank you very much for the explanation @justinvyu :)

@Shamik-07
Copy link
Author

Closing this as this has been fixed by #26499 thanks to @justinvyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants