You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training crashes on start when trying to fork a run with many nodes (32 nodes with 8 gpus each). The crash happens in self.auth.authenticate() (line 392 in diagnostics/mlflow/logger.py).
The same crash does not happen when I run the same config with 1 node.
What are the steps to reproduce the bug?
Fork a run with 32x8 gpus and mlflow logging enabled.
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 421, in main
AnemoiTrainer(config).train()
^^^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 74, in __init__
self._get_server2server_lineage()
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 343, in _get_server2server_lineage
self.parent_run_server2server = self.mlflow_logger._parent_run_server2server
^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/conda_container_env/lib/python3.11/functools.py", line 1001, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 201, in mlflow_logger
return get_mlflow_logger(self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/logger.py", line 72, in get_mlflow_logger
logger = AnemoiMLflowLogger(
^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/logger.py", line 319, in __init__
run_id, run_name, tags = self._get_mlflow_run_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/logger.py", line 392, in _get_mlflow_run_params
self.auth.authenticate()
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 87, in _wrapper
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 166, in authenticate
response = self._token_request()
^^^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 200, in _token_request
response = self._request(path, payload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 222, in _request
raise RuntimeError(msg)
RuntimeError: ❌
Accompanying data
No response
Organisation
MetNorway
The text was updated successfully, but these errors were encountered:
What happened?
Training crashes on start when trying to fork a run with many nodes (32 nodes with 8 gpus each). The crash happens in self.auth.authenticate() (line 392 in diagnostics/mlflow/logger.py).
The same crash does not happen when I run the same config with 1 node.
What are the steps to reproduce the bug?
Fork a run with 32x8 gpus and mlflow logging enabled.
Version
commit 2179a59
Platform (OS and architecture)
?
Relevant log output
Accompanying data
No response
Organisation
MetNorway
The text was updated successfully, but these errors were encountered: