Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language modeling examples do not show how to do multi-gpu training / fine-tuning #31323

Closed
2 of 4 tasks
csiefer2 opened this issue Jun 7, 2024 · 16 comments
Closed
2 of 4 tasks

Comments

@csiefer2
Copy link

csiefer2 commented Jun 7, 2024

System Info

  • transformers version: 4.41.2
  • Platform: Linux-5.15.0-1042-nvidia-x86_64-with-glibc2.35
  • Python version: 3.9.18
  • Huggingface_hub version: 0.23.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

@muellerz @stevhliu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

n/a

Expected behavior

The run_clm.py and other related scripts in:

https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling

notionally support training / fine-tuning of models whose gradients are too large to fit on a single GPU, if you believe their CLI. However there is no example showing how to actually do that.

For instance, accelerate estimate-memory says training the Mistral-7B family with Adam takes roughly 55 GB with float16, which is more memory than a single 40GB A100 has. So I'd need to use more than one GPU.

Would it be possible to modify the language_modeling documentation to explain how to do that?

@csiefer2 csiefer2 changed the title Langurage modeling examples do not show how to do multi-gpu training / fine-tuning Language modeling examples do not show how to do multi-gpu training / fine-tuning Jun 7, 2024
@amyeroberts
Copy link
Collaborator

Hi @csiefer2, thanks for opening this issue!

@muellerzr and @stevhliu are best placed to comment on this in general.

In the meantime, you can find some accelerate docs on distributed training here: https://huggingface.co/docs/transformers/en/accelerate#distributed-training-with--accelerate

@csiefer2
Copy link
Author

csiefer2 commented Jun 7, 2024

Just launch the scripts with accelerate launch or torchrun, no need to do anything else

My attempts to do that have not been successful... run_clm seems happy to fill up the memory of however many GPUs I tell it to use and then die when it finally exceeds the memory limits. That's why I was asking the question :)

@csiefer2
Copy link
Author

csiefer2 commented Jun 7, 2024

For instance, with the v4.41-release branch of transformers off of github, if I grab 4 A100s with 80GB of RAM each and do this:

torchrun --nproc-per-node 4 ./run_clm.py --model_name_or_path=mistralai/Mistral-7B-Instruct-v0.2 --train_file=myfile1.txt  --validation_file=myfile2.txt --do_train --do_eval --output_dir=mydir --report_to none

run_clm.py runs itself out of memory... with a model that accelerate tells me I could have fit the gradients on 1-2 GPUs (depending on whether I'm float32 or float16).

I get errors like:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 3 has a total capacity of 79.15 GiB of which 95.25 MiB is free. Including non-PyTorch memory, this process has 79.04 GiB memory in use. Of the allocated memory 77.23 GiB is allocated by PyTorch, and 348.16 MiB is reserved by PyTorch but unallocated. 

Clearly I'm doing something wrong.

@csiefer2
Copy link
Author

@amyeroberts That example doesn't use the trainer.train() function, which is what I'd (ideally) like to use.

@amyeroberts
Copy link
Collaborator

@csiefer2 I can't comment on the memory calculation from accelerate (cc @muellerzr here) but I'm assuming it this is just for the weights of the model + gradient on the forward/backward pass? You'll also need to account for the memory requirements of loading the data onto the GPU. What batch size are you using?

@csiefer2
Copy link
Author

@amyeroberts In the example above, I wasn't specifying it, but I've tried running with a batch size of 1 before and saw the same results. The training/evaluation data I used above is a whole 1.2M / 11k when stored on disk in a text file, so I suspect this isn't a data size issue.

@stevhliu
Copy link
Member

Thanks for the feedback!

I think keeping these two topics (language modeling and distributed training) in separate docs is better. It sounds like the issue is more about setting up for distributed training, and not so much language modeling. But we can improve on the distributed training docs with an example use case featuring language modeling.

@csiefer2
Copy link
Author

@stevhliu That sounds perfectly reasonable. If you need someone to test out a revised training document to ensure that it works, I'd be a happy to help!

Copy link

github-actions bot commented Jul 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@csiefer2
Copy link
Author

+1

Copy link

github-actions bot commented Aug 4, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@stevhliu stevhliu reopened this Aug 12, 2024
@csiefer2
Copy link
Author

Thanks @stevhliu !

@stevhliu
Copy link
Member

Thanks for your patience! Working on redesigning the docs right now at #31757 and I'll update the distributed training docs when I reach it 🙂

@huggingface huggingface deleted a comment from github-actions bot Sep 13, 2024
Copy link

github-actions bot commented Oct 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@csiefer2
Copy link
Author

csiefer2 commented Oct 8, 2024

Bump!

Copy link

github-actions bot commented Nov 2, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants