-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA OOM error with supposedly good enough specs according to memory stats output #2071
Comments
I'll take a look! Can you provide your full config definition, too? |
hey @sionhan , when dealing with large models + lora, two things will mostly consume your memory: activations and model parameters. The optimizer state and gradients won't matter much because you are not finetuning many parameters. To tackle the model: QLoRA will be your best friend, because we quantize all the weights. The only other flag that will help you with this is fsdp_cpu_offloading, like Joe shared, but this is slow. To tackle the activations, the main thing you can do is like you said: activation checkpointing, activation offloading and compile. If that doesnt work, then your other knob is to reduce tokenizer.max_seq_len. Reducing the rank and number of finetuned layers helps a bit, but i don't think that it is worth to go <rank=8. Make sure to have dataset.packed=True for higher tokens per second. |
Aha! I found one more possible way to do it. It turns out in our default config we are not sharding the token embedding and and output which are very large. If we add this, it should be right under the 24GiB threshold!!! The tok/sec are still a little slower but not too bad. I'll update our LoRA recipe accordingly (#2072) and then you should be able to install nightlies or from source to take advantage of this. Note: I did these experiments will all the other tricks activated as well - activation offloading, compile, packed=True, max_seq_len=512. Also caveat that it's very close to the 24GiB range and I artificially constrained my memory to test this; however, the only true test is seeing if this will actually work on your 4090. LMK how it goes! |
Hello!
I am currently trying to fine tune using lora a Llama 3.1 70B Nemotron Instruct LLM by tweaking a bit the Llama 3.1 70B lora configs.
According to the memory stats required by torchtune, it would be around 18GiB per GPU which 4090s should be able to handle with some room to spare for any additional charge if I understood correctly. However I still get cuda oom despite lowering every possible option to minimal vram requirements in the finetuning recipe.
Finetuning configs parameters
Copied the Llama 3.1 70B lora configs and adapted it to the Nemotron HF model. It mentioned being able to be ran on 8 gpus.
I used the original Llama 3.1 tokenizer.model as I assumed the tokenizer was the same after reading the config.json of the Nemotron HF model.
Environment
Cuda: 12.4
Torch: 2.5.1
Torchtune: 0.4.0
Specs: 8 x 4090 instance
RAM: 192 GiB
Set PYTORCH_CUDA_ALLOC_CONF to expendable_segments:True.
I also ran nvidia-smi to check the gpus charges, and everything looked correct (equally distributed, all gpus with active processes).
Command ran
tune run --nproc_per_node 8 lora_finetune_distributed --config ./my_config.yaml
Error message (basically a CUDA oom error on every GPU)
The text was updated successfully, but these errors were encountered: