You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to fine-tune in a multi-gpu SLURM environment, I get the following error:
Mon Sep 18 09:20:06 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000000:1A:00.0 Off | 0 |
| N/A 31C P0 44W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-32GB On | 00000000:1C:00.0 Off | 0 |
| N/A 27C P0 41W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-32GB On | 00000000:1D:00.0 Off | 0 |
| N/A 28C P0 42W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-32GB On | 00000000:1E:00.0 Off | 0 |
| N/A 30C P0 42W / 300W | 0MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
MASTER_ADDR=mg089
GpuFreq=control_disabled
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /etc/localtime required more than 50 (80) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
run()
File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
trainer.fit(model, train_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
self.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
self.accelerator.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
self.init_deepspeed()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
self._configure_with_arguments(args, mpu)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
args.local_rank = int(os.environ['LOCAL_RANK'])
File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
Hello,
When attempting to fine-tune in a multi-gpu SLURM environment, I get the following error:
My SLURM job request is as follows:
I am using the original library versions defined in the documentation. Is this behavior expected?
The text was updated successfully, but these errors were encountered: