Mlflow-related crash when forking with many nodes #6

havardhhaugen · 2024-12-11T07:48:24Z

What happened?

Training crashes on start when trying to fork a run with many nodes (32 nodes with 8 gpus each). The crash happens in self.auth.authenticate() (line 392 in diagnostics/mlflow/logger.py).

The same crash does not happen when I run the same config with 1 node.

What are the steps to reproduce the bug?

Fork a run with 32x8 gpus and mlflow logging enabled.

Version

commit 2179a59

Platform (OS and architecture)

?

Relevant log output

File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 421, in main
    AnemoiTrainer(config).train()
    ^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 74, in __init__
    self._get_server2server_lineage()
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 343, in _get_server2server_lineage
    self.parent_run_server2server = self.mlflow_logger._parent_run_server2server
                                    ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/conda_container_env/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 201, in mlflow_logger
    return get_mlflow_logger(self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/logger.py", line 72, in get_mlflow_logger
    logger = AnemoiMLflowLogger(
             ^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/logger.py", line 319, in __init__
    run_id, run_name, tags = self._get_mlflow_run_params(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/logger.py", line 392, in _get_mlflow_run_params
    self.auth.authenticate()
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 87, in _wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 166, in authenticate
    response = self._token_request()
               ^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 200, in _token_request
    response = self._request(path, payload)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 222, in _request
    raise RuntimeError(msg)
RuntimeError: ❌

Accompanying data

No response

Organisation

MetNorway

The text was updated successfully, but these errors were encountered:

gmertes · 2024-12-11T16:54:26Z

Hi Håvard,

Would you mind reproducing the bug, but with some prints in the code so I can get a better idea why it is crashing?

In anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py, on line 221 right before the RuntimeError add the following:

self.log.info(response_json)
self.log.info(response.content)

Could you send me the output in a message on slack? It might contain internal information.

havardhhaugen added the bug Something isn't working label Dec 11, 2024

havardhhaugen mentioned this issue Dec 11, 2024

Fix mlflow authenticate bug when forking run with multiple nodes ecmwf/anemoi-training#196

Open

JesperDramsch added the training label Dec 19, 2024

JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlflow-related crash when forking with many nodes #6

Mlflow-related crash when forking with many nodes #6

havardhhaugen commented Dec 11, 2024

gmertes commented Dec 11, 2024 •

edited

Loading

Mlflow-related crash when forking with many nodes #6

Mlflow-related crash when forking with many nodes #6

Comments

havardhhaugen commented Dec 11, 2024

What happened?

What are the steps to reproduce the bug?

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

gmertes commented Dec 11, 2024 • edited Loading

gmertes commented Dec 11, 2024 •

edited

Loading