Problems Running on a 4xA100 (80 GB) node... #15

amelie-iska · 2024-11-18T22:39:54Z

Input MSAs were truncated to be a single entry (duplicates of the input sequences) because leaving msa: blank causes errors for some reason.

>101
MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTP
TLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYG
FSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDGIKGKKDPFSATNDAVISGKECGSPKYAYFNG
CSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLT
ILDQRPSSRASSHASSRPRPDDLEI

>UniRef100_SecondSeq
FSLESERP

Input YAML:

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDGIKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI
      msa: ./examples/msa/shallow-connexin-msa.a3m
  - protein:
      id: [G,H,I,J,K,L]
      sequence: FSLESERP
      msa: ./examples/msa/peptide_102.a3m
  # - ligand:
  #     id: [G,H,I,J,K,L]
  #     smiles: CC(C)C[C@H](NC(=O)[C@H](CO)NC(=O)[C@@H](N)Cc1ccccc1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N1CCC[C@H]1C(=O)O

Command:

boltz predict examples/connexin-peptide.yaml --devices 4 --recycling_steps 10 --diffusion_samples 10

Output:

Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.10it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Downloading data and model to /home/lily/.boltz. You may change this by setting the --cache flag.
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.17it/s]
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.37it/s]
Checking input data.
Processing input data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.30it/s]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/lily/mambaforge/envs/boltz/bin/boltz", line 8, in <module>
[rank1]:     sys.exit(cli())
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
[rank1]:     return self.main(*args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1078, in main
[rank1]:     rv = self.invoke(ctx)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
[rank1]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
[rank1]:     return ctx.invoke(self.callback, **ctx.params)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 783, in invoke
[rank1]:     return __callback(*args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/boltz/main.py", line 395, in predict
[rank1]:     trainer.predict(
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
[rank1]:     return call._call_and_handle_interrupt(
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
[rank1]:     results = self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank1]:     results = self._run_stage()
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
[rank1]:     return self.predict_loop.run()
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
[rank1]:     return loop_run(self, *args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 104, in run
[rank1]:     self.setup_data()
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 157, in setup_data
[rank1]:     dl = _process_dataloader(trainer, trainer_fn, stage, dl)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 484, in _process_dataloader
[rank1]:     dataloader = trainer._data_connector._prepare_dataloader(dataloader, shuffle=is_shuffled, mode=stage)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 189, in _prepare_dataloader
[rank1]:     sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 200, in _resolve_sampler
[rank1]:     sampler = _get_distributed_sampler(
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 239, in _get_distributed_sampler
[rank1]:     return UnrepeatedDistributedSamplerWrapper(dataloader.sampler, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 228, in __init__
[rank1]:     super().__init__(_DatasetSamplerWrapper(sampler), *args, **kwargs)
[rank1]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 201, in __init__
[rank1]:     assert self.num_samples >= 1 or self.total_size == 0
[rank1]: AssertionError
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/lily/mambaforge/envs/boltz/bin/boltz", line 8, in <module>
[rank3]:     sys.exit(cli())
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
[rank3]:     return self.main(*args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1078, in main
[rank3]:     rv = self.invoke(ctx)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
[rank3]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
[rank3]:     return ctx.invoke(self.callback, **ctx.params)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 783, in invoke
[rank3]:     return __callback(*args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/boltz/main.py", line 395, in predict
[rank3]:     trainer.predict(
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
[rank3]:     return call._call_and_handle_interrupt(
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank3]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]:     return function(*args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
[rank3]:     results = self._run(model, ckpt_path=ckpt_path)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank3]:     results = self._run_stage()
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
[rank3]:     return self.predict_loop.run()
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
[rank3]:     return loop_run(self, *args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 104, in run
[rank3]:     self.setup_data()
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 157, in setup_data
[rank3]:     dl = _process_dataloader(trainer, trainer_fn, stage, dl)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 484, in _process_dataloader
[rank3]:     dataloader = trainer._data_connector._prepare_dataloader(dataloader, shuffle=is_shuffled, mode=stage)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 189, in _prepare_dataloader
[rank3]:     sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 200, in _resolve_sampler
[rank3]:     sampler = _get_distributed_sampler(
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 239, in _get_distributed_sampler
[rank3]:     return UnrepeatedDistributedSamplerWrapper(dataloader.sampler, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 228, in __init__
[rank3]:     super().__init__(_DatasetSamplerWrapper(sampler), *args, **kwargs)
[rank3]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 201, in __init__
[rank3]:     assert self.num_samples >= 1 or self.total_size == 0
[rank3]: AssertionError
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/lily/mambaforge/envs/boltz/bin/boltz", line 8, in <module>
[rank2]:     sys.exit(cli())
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
[rank2]:     return self.main(*args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1078, in main
[rank2]:     rv = self.invoke(ctx)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
[rank2]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
[rank2]:     return ctx.invoke(self.callback, **ctx.params)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/click/core.py", line 783, in invoke
[rank2]:     return __callback(*args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/boltz/main.py", line 395, in predict
[rank2]:     trainer.predict(
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
[rank2]:     return call._call_and_handle_interrupt(
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank2]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]:     return function(*args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
[rank2]:     results = self._run(model, ckpt_path=ckpt_path)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank2]:     results = self._run_stage()
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
[rank2]:     return self.predict_loop.run()
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
[rank2]:     return loop_run(self, *args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 104, in run
[rank2]:     self.setup_data()
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/loops/prediction_loop.py", line 157, in setup_data
[rank2]:     dl = _process_dataloader(trainer, trainer_fn, stage, dl)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 484, in _process_dataloader
[rank2]:     dataloader = trainer._data_connector._prepare_dataloader(dataloader, shuffle=is_shuffled, mode=stage)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 189, in _prepare_dataloader
[rank2]:     sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 200, in _resolve_sampler
[rank2]:     sampler = _get_distributed_sampler(
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 239, in _get_distributed_sampler
[rank2]:     return UnrepeatedDistributedSamplerWrapper(dataloader.sampler, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 228, in __init__
[rank2]:     super().__init__(_DatasetSamplerWrapper(sampler), *args, **kwargs)
[rank2]:   File "/home/lily/mambaforge/envs/boltz/lib/python3.9/site-packages/pytorch_lightning/overrides/distributed.py", line 201, in __init__
[rank2]:     assert self.num_samples >= 1 or self.total_size == 0
[rank2]: AssertionError
[rank: 1] Child process with PID 3680494 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
Killed

The text was updated successfully, but these errors were encountered:

jwohlwend · 2024-11-18T22:52:06Z

Hi! Would you mind checking if it passes on single GPU? I have an idea of what might be going on but just wish to confirm, thanks!

amelie-iska · 2024-11-18T22:55:55Z

It'll run less demanding inputs on a single GPU (the provided example runs).

jwohlwend · 2024-11-18T22:58:10Z

Thanks I'll investigate, but just to be clear, the multi-GPU mode will not allow you to run larger inputs, it's meant to run multiple input files (provided as a directory) in parallel, not parallelize the model it self (i.e we do not do any sharding).

amelie-iska · 2024-11-18T22:58:56Z

VRAM limit option then?

jwohlwend · 2024-11-18T22:59:35Z

Which actually now that I think about it, explains your issue. Since there is only one example, the other 3 GPUs have nothing to do. I'll add a warning around this, and make num_devices = min(num_samples num_devices)

jwohlwend · 2024-11-18T23:00:31Z

VRAM limit option then?

Yeah we need to do some measurements on this. We're going to add a chunking feature in the next day or so to allow for larger inputs at the cost of some slowdown, hopefully that helps!

amelie-iska · 2024-11-18T23:01:09Z

🤞

xiaolinpan · 2024-11-25T15:26:17Z

VRAM limit option then?

Yeah we need to do some measurements on this. We're going to add a chunking feature in the next day or so to allow for larger inputs at the cost of some slowdown, hopefully that helps!

Is chunking feature work now? I try to generate structure for a large protein, it always raised out of memory on a A100-80G.

jwohlwend · 2024-11-25T16:01:39Z

The PR is open, we're just working out what the default behavior should be and will merge very soon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems Running on a 4xA100 (80 GB) node... #15

Problems Running on a 4xA100 (80 GB) node... #15

amelie-iska commented Nov 18, 2024 •

edited

Loading

jwohlwend commented Nov 18, 2024

amelie-iska commented Nov 18, 2024

jwohlwend commented Nov 18, 2024

amelie-iska commented Nov 18, 2024

jwohlwend commented Nov 18, 2024

jwohlwend commented Nov 18, 2024

amelie-iska commented Nov 18, 2024

xiaolinpan commented Nov 25, 2024

jwohlwend commented Nov 25, 2024

Problems Running on a 4xA100 (80 GB) node... #15

Problems Running on a 4xA100 (80 GB) node... #15

Comments

amelie-iska commented Nov 18, 2024 • edited Loading

jwohlwend commented Nov 18, 2024

amelie-iska commented Nov 18, 2024

jwohlwend commented Nov 18, 2024

amelie-iska commented Nov 18, 2024

jwohlwend commented Nov 18, 2024

jwohlwend commented Nov 18, 2024

amelie-iska commented Nov 18, 2024

xiaolinpan commented Nov 25, 2024

jwohlwend commented Nov 25, 2024

amelie-iska commented Nov 18, 2024 •

edited

Loading