You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't seem to be able to run distributed training on multiple gpus, when I run the training script with a config that includes gpus 0 and 1, I'm getting a Segmentation fault (core dumped) error. I am using Q-Lora also.
Please advise.
What version are you seeing the problem on?
master
How to reproduce the bug
# Create trainertrainer=L.Trainer(
accelerator="gpu",
devices=[0,1], # Use devices from configstrategy="ddp",
...
)
Error messages and logs
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 9709.04it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.17s/it]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
Segmentation fault (core dumped)
Bug description
I'm having an issue while adapting the fine-tuning logic from this HF tutorial:
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb
I don't seem to be able to run distributed training on multiple gpus, when I run the training script with a config that includes gpus 0 and 1, I'm getting a Segmentation fault (core dumped) error. I am using Q-Lora also.
Please advise.
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
Environment
pyproject.toml:
transformers = "^4.44.2"
torch = "^2.4.1"
lightning = "^2.4.0"
peft = "^0.13.2"
accelerate = "^1.1.1"
bitsandbytes = "^0.45.0"
More info
No response
The text was updated successfully, but these errors were encountered: