-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language modeling examples do not show how to do multi-gpu training / fine-tuning #31323
Comments
Hi @csiefer2, thanks for opening this issue! @muellerzr and @stevhliu are best placed to comment on this in general. In the meantime, you can find some accelerate docs on distributed training here: https://huggingface.co/docs/transformers/en/accelerate#distributed-training-with--accelerate |
My attempts to do that have not been successful... run_clm seems happy to fill up the memory of however many GPUs I tell it to use and then die when it finally exceeds the memory limits. That's why I was asking the question :) |
For instance, with the v4.41-release branch of transformers off of github, if I grab 4 A100s with 80GB of RAM each and do this:
run_clm.py runs itself out of memory... with a model that accelerate tells me I could have fit the gradients on 1-2 GPUs (depending on whether I'm float32 or float16). I get errors like:
Clearly I'm doing something wrong. |
@amyeroberts That example doesn't use the |
@csiefer2 I can't comment on the memory calculation from accelerate (cc @muellerzr here) but I'm assuming it this is just for the weights of the model + gradient on the forward/backward pass? You'll also need to account for the memory requirements of loading the data onto the GPU. What batch size are you using? |
@amyeroberts In the example above, I wasn't specifying it, but I've tried running with a batch size of 1 before and saw the same results. The training/evaluation data I used above is a whole 1.2M / 11k when stored on disk in a text file, so I suspect this isn't a data size issue. |
Thanks for the feedback! I think keeping these two topics (language modeling and distributed training) in separate docs is better. It sounds like the issue is more about setting up for distributed training, and not so much language modeling. But we can improve on the distributed training docs with an example use case featuring language modeling. |
@stevhliu That sounds perfectly reasonable. If you need someone to test out a revised training document to ensure that it works, I'd be a happy to help! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
+1 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Thanks @stevhliu ! |
Thanks for your patience! Working on redesigning the docs right now at #31757 and I'll update the distributed training docs when I reach it 🙂 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Bump! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.41.2Who can help?
@muellerz @stevhliu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
n/a
Expected behavior
The
run_clm.py
and other related scripts in:https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
notionally support training / fine-tuning of models whose gradients are too large to fit on a single GPU, if you believe their CLI. However there is no example showing how to actually do that.
For instance,
accelerate estimate-memory
says training the Mistral-7B family with Adam takes roughly 55 GB with float16, which is more memory than a single 40GB A100 has. So I'd need to use more than one GPU.Would it be possible to modify the language_modeling documentation to explain how to do that?
The text was updated successfully, but these errors were encountered: