Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlflow-related crash when forking with many nodes #6

Open
havardhhaugen opened this issue Dec 11, 2024 · 1 comment
Open

Mlflow-related crash when forking with many nodes #6

havardhhaugen opened this issue Dec 11, 2024 · 1 comment
Labels
bug Something isn't working training

Comments

@havardhhaugen
Copy link
Contributor

What happened?

Training crashes on start when trying to fork a run with many nodes (32 nodes with 8 gpus each). The crash happens in self.auth.authenticate() (line 392 in diagnostics/mlflow/logger.py).

The same crash does not happen when I run the same config with 1 node.

What are the steps to reproduce the bug?

Fork a run with 32x8 gpus and mlflow logging enabled.

Version

commit 2179a59

Platform (OS and architecture)

?

Relevant log output

File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 421, in main
    AnemoiTrainer(config).train()
    ^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 74, in __init__
    self._get_server2server_lineage()
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 343, in _get_server2server_lineage
    self.parent_run_server2server = self.mlflow_logger._parent_run_server2server
                                    ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/conda_container_env/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/train/train.py", line 201, in mlflow_logger
    return get_mlflow_logger(self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/logger.py", line 72, in get_mlflow_logger
    logger = AnemoiMLflowLogger(
             ^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/logger.py", line 319, in __init__
    run_id, run_name, tags = self._get_mlflow_run_params(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/logger.py", line 392, in _get_mlflow_run_params
    self.auth.authenticate()
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 87, in _wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 166, in authenticate
    response = self._token_request()
               ^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 200, in _token_request
    response = self._request(path, payload)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep4/scratch/project_465001383/haugenha/anemoi-training-ref-updated/run-anemoi/lumi/anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py", line 222, in _request
    raise RuntimeError(msg)
RuntimeError: ❌

Accompanying data

No response

Organisation

MetNorway

@gmertes
Copy link
Member

gmertes commented Dec 11, 2024

Hi Håvard,

Would you mind reproducing the bug, but with some prints in the code so I can get a better idea why it is crashing?

In anemoi-training/src/anemoi/training/diagnostics/mlflow/auth.py, on line 221 right before the RuntimeError add the following:

self.log.info(response_json)
self.log.info(response.content)

Could you send me the output in a message on slack? It might contain internal information.

@JesperDramsch JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants