Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems Running on a 4xA100 (80 GB) node... #15

Open
amelie-iska opened this issue Nov 18, 2024 · 9 comments
Open

Problems Running on a 4xA100 (80 GB) node... #15

amelie-iska opened this issue Nov 18, 2024 · 9 comments

Comments

@amelie-iska
Copy link

amelie-iska commented Nov 18, 2024

Input MSAs were truncated to be a single entry (duplicates of the input sequences) because leaving msa: blank causes errors for some reason.

>101
MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTP
TLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYG
FSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDGIKGKKDPFSATNDAVISGKECGSPKYAYFNG
CSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLT
ILDQRPSSRASSHASSRPRPDDLEI
>UniRef100_SecondSeq
FSLESERP

Input YAML:

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDGIKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI
      msa: ./examples/msa/shallow-connexin-msa.a3m
  - protein:
      id: [G,H,I,J,K,L]
      sequence: FSLESERP
      msa: ./examples/msa/peptide_102.a3m
  # - ligand:
  #     id: [G,H,I,J,K,L]
  #     smiles: CC(C)C[C@H](NC(=O)[C@H](CO)NC(=O)[C@@H](N)Cc1ccccc1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N1CCC[C@H]1C(=O)O

Command:

boltz predict examples/connexin-peptide.yaml --devices 4 --recycling_steps 10 --diffusion_samples 10 

Output:

Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.10it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.17it/s]
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.37it/s]
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.30it/s]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/lily/mambaforge/envs/boltz/bin/boltz", line 8, in <module>
[rank1]:     sys.exit(cli())
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
[rank1]:     return self.main(*args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1078, in main
[rank1]:     rv = self.invoke(ctx)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
[rank1]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
[rank1]:     return ctx.invoke(self.callback, **ctx.params)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 783, in invoke
[rank1]:     return __callback(*args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/boltz/main.py", line 395, in predict
[rank1]:     trainer.predict(
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
[rank1]:     return call._call_and_handle_interrupt(
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
[rank1]:     results = self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank1]:     results = self._run_stage()
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
[rank1]:     return self.predict_loop.run()
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
[rank1]:     return loop_run(self, *args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 104, in run
[rank1]:     self.setup_data()
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 157, in setup_data
[rank1]:     dl = _process_dataloader(trainer, trainer_fn, stage, dl)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 484, in _process_dataloader
[rank1]:     dataloader = trainer._data_connector._prepare_dataloader(dataloader, shuffle=is_shuffled, mode=stage)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 189, in _prepare_dataloader
[rank1]:     sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 200, in _resolve_sampler
[rank1]:     sampler = _get_distributed_sampler(
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 239, in _get_distributed_sampler
[rank1]:     return UnrepeatedDistributedSamplerWrapper(dataloader.sampler, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 228, in __init__
[rank1]:     super().__init__(_DatasetSamplerWrapper(sampler), *args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 201, in __init__
[rank1]:     assert self.num_samples >= 1 or self.total_size == 0
[rank1]: AssertionError
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/lily/mambaforge/envs/boltz/bin/boltz", line 8, in <module>
[rank3]:     sys.exit(cli())
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
[rank3]:     return self.main(*args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1078, in main
[rank3]:     rv = self.invoke(ctx)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
[rank3]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
[rank3]:     return ctx.invoke(self.callback, **ctx.params)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 783, in invoke
[rank3]:     return __callback(*args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/boltz/main.py", line 395, in predict
[rank3]:     trainer.predict(
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
[rank3]:     return call._call_and_handle_interrupt(
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank3]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]:     return function(*args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
[rank3]:     results = self._run(model, ckpt_path=ckpt_path)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank3]:     results = self._run_stage()
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
[rank3]:     return self.predict_loop.run()
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
[rank3]:     return loop_run(self, *args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 104, in run
[rank3]:     self.setup_data()
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 157, in setup_data
[rank3]:     dl = _process_dataloader(trainer, trainer_fn, stage, dl)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 484, in _process_dataloader
[rank3]:     dataloader = trainer._data_connector._prepare_dataloader(dataloader, shuffle=is_shuffled, mode=stage)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 189, in _prepare_dataloader
[rank3]:     sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 200, in _resolve_sampler
[rank3]:     sampler = _get_distributed_sampler(
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 239, in _get_distributed_sampler
[rank3]:     return UnrepeatedDistributedSamplerWrapper(dataloader.sampler, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 228, in __init__
[rank3]:     super().__init__(_DatasetSamplerWrapper(sampler), *args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 201, in __init__
[rank3]:     assert self.num_samples >= 1 or self.total_size == 0
[rank3]: AssertionError
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/lily/mambaforge/envs/boltz/bin/boltz", line 8, in <module>
[rank2]:     sys.exit(cli())
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
[rank2]:     return self.main(*args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1078, in main
[rank2]:     rv = self.invoke(ctx)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
[rank2]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
[rank2]:     return ctx.invoke(self.callback, **ctx.params)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 783, in invoke
[rank2]:     return __callback(*args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/boltz/main.py", line 395, in predict
[rank2]:     trainer.predict(
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
[rank2]:     return call._call_and_handle_interrupt(
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank2]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]:     return function(*args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
[rank2]:     results = self._run(model, ckpt_path=ckpt_path)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank2]:     results = self._run_stage()
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
[rank2]:     return self.predict_loop.run()
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
[rank2]:     return loop_run(self, *args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 104, in run
[rank2]:     self.setup_data()
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 157, in setup_data
[rank2]:     dl = _process_dataloader(trainer, trainer_fn, stage, dl)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 484, in _process_dataloader
[rank2]:     dataloader = trainer._data_connector._prepare_dataloader(dataloader, shuffle=is_shuffled, mode=stage)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 189, in _prepare_dataloader
[rank2]:     sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 200, in _resolve_sampler
[rank2]:     sampler = _get_distributed_sampler(
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 239, in _get_distributed_sampler
[rank2]:     return UnrepeatedDistributedSamplerWrapper(dataloader.sampler, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 228, in __init__
[rank2]:     super().__init__(_DatasetSamplerWrapper(sampler), *args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 201, in __init__
[rank2]:     assert self.num_samples >= 1 or self.total_size == 0
[rank2]: AssertionError
[rank: 1] Child process with PID 3680494 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
Killed
@jwohlwend
Copy link
Owner

Hi! Would you mind checking if it passes on single GPU? I have an idea of what might be going on but just wish to confirm, thanks!

@amelie-iska
Copy link
Author

It'll run less demanding inputs on a single GPU (the provided example runs).

@jwohlwend
Copy link
Owner

Thanks I'll investigate, but just to be clear, the multi-GPU mode will not allow you to run larger inputs, it's meant to run multiple input files (provided as a directory) in parallel, not parallelize the model it self (i.e we do not do any sharding).

@amelie-iska
Copy link
Author

VRAM limit option then?

@jwohlwend
Copy link
Owner

Which actually now that I think about it, explains your issue. Since there is only one example, the other 3 GPUs have nothing to do. I'll add a warning around this, and make num_devices = min(num_samples num_devices)

@jwohlwend
Copy link
Owner

VRAM limit option then?

Yeah we need to do some measurements on this. We're going to add a chunking feature in the next day or so to allow for larger inputs at the cost of some slowdown, hopefully that helps!

@amelie-iska
Copy link
Author

🤞

@xiaolinpan
Copy link

VRAM limit option then?

Yeah we need to do some measurements on this. We're going to add a chunking feature in the next day or so to allow for larger inputs at the cost of some slowdown, hopefully that helps!

Is chunking feature work now? I try to generate structure for a large protein, it always raised out of memory on a A100-80G.

@jwohlwend
Copy link
Owner

The PR is open, we're just working out what the default behavior should be and will merge very soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants