Language modeling examples do not show how to do multi-gpu training / fine-tuning #31323

csiefer2 · 2024-06-07T18:49:35Z

System Info

transformers version: 4.41.2
Platform: Linux-5.15.0-1042-nvidia-x86_64-with-glibc2.35
Python version: 3.9.18
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.2
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@muellerz @stevhliu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

n/a

Expected behavior

The run_clm.py and other related scripts in:

https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling

notionally support training / fine-tuning of models whose gradients are too large to fit on a single GPU, if you believe their CLI. However there is no example showing how to actually do that.

For instance, accelerate estimate-memory says training the Mistral-7B family with Adam takes roughly 55 GB with float16, which is more memory than a single 40GB A100 has. So I'd need to use more than one GPU.

Would it be possible to modify the language_modeling documentation to explain how to do that?

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-06-07T19:03:20Z

Hi @csiefer2, thanks for opening this issue!

@muellerzr and @stevhliu are best placed to comment on this in general.

In the meantime, you can find some accelerate docs on distributed training here: https://huggingface.co/docs/transformers/en/accelerate#distributed-training-with--accelerate

csiefer2 · 2024-06-07T20:03:13Z

Just launch the scripts with accelerate launch or torchrun, no need to do anything else

My attempts to do that have not been successful... run_clm seems happy to fill up the memory of however many GPUs I tell it to use and then die when it finally exceeds the memory limits. That's why I was asking the question :)

csiefer2 · 2024-06-07T21:26:29Z

For instance, with the v4.41-release branch of transformers off of github, if I grab 4 A100s with 80GB of RAM each and do this:

torchrun --nproc-per-node 4 ./run_clm.py --model_name_or_path=mistralai/Mistral-7B-Instruct-v0.2 --train_file=myfile1.txt  --validation_file=myfile2.txt --do_train --do_eval --output_dir=mydir --report_to none

run_clm.py runs itself out of memory... with a model that accelerate tells me I could have fit the gradients on 1-2 GPUs (depending on whether I'm float32 or float16).

I get errors like:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 3 has a total capacity of 79.15 GiB of which 95.25 MiB is free. Including non-PyTorch memory, this process has 79.04 GiB memory in use. Of the allocated memory 77.23 GiB is allocated by PyTorch, and 348.16 MiB is reserved by PyTorch but unallocated.

Clearly I'm doing something wrong.

csiefer2 · 2024-06-10T17:03:58Z

@amyeroberts That example doesn't use the trainer.train() function, which is what I'd (ideally) like to use.

amyeroberts · 2024-06-10T17:08:35Z

@csiefer2 I can't comment on the memory calculation from accelerate (cc @muellerzr here) but I'm assuming it this is just for the weights of the model + gradient on the forward/backward pass? You'll also need to account for the memory requirements of loading the data onto the GPU. What batch size are you using?

csiefer2 · 2024-06-10T17:12:59Z

@amyeroberts In the example above, I wasn't specifying it, but I've tried running with a batch size of 1 before and saw the same results. The training/evaluation data I used above is a whole 1.2M / 11k when stored on disk in a text file, so I suspect this isn't a data size issue.

stevhliu · 2024-06-10T20:12:59Z

Thanks for the feedback!

I think keeping these two topics (language modeling and distributed training) in separate docs is better. It sounds like the issue is more about setting up for distributed training, and not so much language modeling. But we can improve on the distributed training docs with an example use case featuring language modeling.

csiefer2 · 2024-06-10T20:17:00Z

@stevhliu That sounds perfectly reasonable. If you need someone to test out a revised training document to ensure that it works, I'd be a happy to help!

github-actions · 2024-07-08T08:03:09Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

csiefer2 · 2024-07-10T15:04:12Z

+1

github-actions · 2024-08-04T08:04:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

csiefer2 · 2024-08-13T15:31:28Z

Thanks @stevhliu !

stevhliu · 2024-08-13T16:23:11Z

Thanks for your patience! Working on redesigning the docs right now at #31757 and I'll update the distributed training docs when I reach it 🙂

github-actions · 2024-10-08T08:09:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

csiefer2 · 2024-10-08T21:02:48Z

Bump!

github-actions · 2024-11-02T08:10:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

csiefer2 changed the title ~~Langurage modeling examples do not show how to do multi-gpu training / fine-tuning~~ Language modeling examples do not show how to do multi-gpu training / fine-tuning Jun 7, 2024

github-actions bot closed this as completed Aug 12, 2024

stevhliu reopened this Aug 12, 2024

huggingface deleted a comment from github-actions bot Sep 13, 2024

Rocketknight1 added the Documentation label Nov 4, 2024

github-actions bot closed this as completed Nov 13, 2024

stevhliu reopened this Nov 13, 2024

github-actions bot closed this as completed Nov 22, 2024

stevhliu reopened this Nov 23, 2024

github-actions bot closed this as completed Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language modeling examples do not show how to do multi-gpu training / fine-tuning #31323

Language modeling examples do not show how to do multi-gpu training / fine-tuning #31323

csiefer2 commented Jun 7, 2024 •

edited

Loading

amyeroberts commented Jun 7, 2024

csiefer2 commented Jun 7, 2024

csiefer2 commented Jun 7, 2024 •

edited

Loading

csiefer2 commented Jun 10, 2024

amyeroberts commented Jun 10, 2024

csiefer2 commented Jun 10, 2024

stevhliu commented Jun 10, 2024

csiefer2 commented Jun 10, 2024

github-actions bot commented Jul 8, 2024

csiefer2 commented Jul 10, 2024

github-actions bot commented Aug 4, 2024

csiefer2 commented Aug 13, 2024

stevhliu commented Aug 13, 2024

github-actions bot commented Oct 8, 2024

csiefer2 commented Oct 8, 2024

github-actions bot commented Nov 2, 2024

Language modeling examples do not show how to do multi-gpu training / fine-tuning #31323

Language modeling examples do not show how to do multi-gpu training / fine-tuning #31323

Comments

csiefer2 commented Jun 7, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jun 7, 2024

csiefer2 commented Jun 7, 2024

csiefer2 commented Jun 7, 2024 • edited Loading

csiefer2 commented Jun 10, 2024

amyeroberts commented Jun 10, 2024

csiefer2 commented Jun 10, 2024

stevhliu commented Jun 10, 2024

csiefer2 commented Jun 10, 2024

github-actions bot commented Jul 8, 2024

csiefer2 commented Jul 10, 2024

github-actions bot commented Aug 4, 2024

csiefer2 commented Aug 13, 2024

stevhliu commented Aug 13, 2024

github-actions bot commented Oct 8, 2024

csiefer2 commented Oct 8, 2024

github-actions bot commented Nov 2, 2024

csiefer2 commented Jun 7, 2024 •

edited

Loading

csiefer2 commented Jun 7, 2024 •

edited

Loading