-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SFTTrainer not using both GPUs #1303
Comments
Ah, belatedly I found #921 which looks promising |
Fix:
local_rank = os.getenv("LOCAL_RANK")
device_string = "cuda:" + str(local_rank)
...
model = AutoModelForCausalLM.from_pretrained(
...
device_map={'':device_string}
)
I'm happy to close the issue but maybe it's useful for others in the same boat. |
Yes this is exactly the way to load the model with multi-GPU and train it, from accelerate import PartialState
device_string = PartialState().process_index I also feel this is a common scenario, would you be happy to update the documentation of SFTTrainer by adding a new section in this file: https://github.com/huggingface/trl/blob/main/docs/source/sft_trainer.mdx ? |
@younesbelkada something like this OK? #1308 |
Amazing, thank you @johnowhitaker ! |
Thanks for the clear issue and resolution - very helpful in getting DDP to work. @younesbelkada, I noticed that using DDP (for this case) seems to take up more VRAM (more easily runs into CUDA OOM) than running with PP (just setting device_map='auto'). Although, DDP does seem to be faster than PP (less time for the same number of steps). Is that to be expected? It's not entirely intuitive to me from the docs. I have run a similar script to above using PP (with python script.py) and DDP (accelerate launch script.py):
The script above runs fine in PP even when I train/save other modules in the LoRA config. But, for DDP, that results in OOM. For comparison, when I ran the script above without other modules being saved, but varying the batch size up to 16, I got OOM with both the PP and DDP approaches. For another comparison, I was able to run DDP with trainable/saveable added modules on TinyLlama with no OOM issues (obviously that's a much smaller model, but it tests whether the added modules pose an issue). So, I'm a bit puzzled why DDP seems to take more VRAM than PP (especially when adding trainable modules). Why is this? EDIT: I'm unclear on whether setting device_map = 'auto' and running 'python script.py' defaults to pipeline parallel or DP see issue. I'm referring to PP above but maybe I really mean DP. |
I'm unclear as well! I'm guessing setting device_map = 'auto' and running 'python script.py' defaults to naive pipeline parallel. |
This works to me:
gradient_checkpointing = False,
|
I see, can we only run scirpts through accelerate or is there a way we can run through python codes. For example, I'm using a notebook and there I have all the configurations. All I want is to run trainer.train() function with accelerate, how can I do that? |
I am trying to fine-tune Llama 2 7B with QLoRA on 2 GPUs. From what I've read SFTTrainer should support multiple GPUs just fine, but when I run this I see one GPU with high utilization and one with almost none:
Expected behaviour would be that both get used during training and it would be about 2x as fast as single-GPU training. I'm running this with
python train.py
, which I think means Trainer uses DP? I get an error launching withpython -m torch.distributed.launch train.py
(RuntimeError: Expected to mark a variable ready only once...
) which makes me think DDP would need a bit more work...This is an older machine without any fast interconnect, but I saw similar usage on a cloud machine with 2xA5000s so I don't think it's that. Anyway, maybe someone can help by explaining why DP might be so slow in this case and/or how to test DDP instead :)
Script:
The text was updated successfully, but these errors were encountered: