-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPO trainer with deepspped offload cpu config cause error: AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. #955
Comments
Update: if p.grad is None or not p.grad.all():
continue It seems to avoid the error above, but I don't know if this is correct. |
met the exact same error. which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest. I will keep trying to fix this. If I had got the solution, I will post here. |
I met the same error.... |
I met the same error too. I found that the problem started with trl version 0.7.2, so we can try to get the version down to 0.7.1 temporarily. Hopefully trl will resolve this issue soon. |
tagging @kashif, maybe you have an idea? |
In case it helps, I was able to fix this issue by doing the following in _prepare_deepspeed:
With the scheduler pop taken from Emperorizzis' PR above. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Does this issue remain unsolved in the latest version? I met the same problem. |
same error `Package Version Editable project location deepspeed 0.12.6+c00388a |
Do not set lr scheduler in deepspeed config, set it by transformers TrainingArguments. This works for me. |
I ran into this issue because I set the lr_scheduler in the DPOconfig with a different lr_scheduler setting in deepSpeed config file for that DPOconfig. Remove one should do the work. |
Working with DPO alone or using CPU offload within SFT separately results in no issues. However, when using DPO in combination with DS stage2 + CPU Offload, I encounter the following error:
miniconda3/envs/dpo/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py, line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', ensure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
The framework I am using is LLaMA-Factory with no modifications. This problem appears to be unrelated to the repository's code. This leads me to believe it's potentially due to environmental problems related to Pytorch, Trl, Transformers, or Deepspeed. Upon debugging into the Trl library, it seems that DPOTrainer does not close the gradient of ref_model when initializing (even though all gradients are 0), causing an error to be thrown at _prepare_deepspeed. In Deepspeed, at this line, ref_model should continue.
The versions of my environment are as follows:
torch==2.1.0
transformers==4.33.2
trl==0.7.2
deepspeed==0.11.1
My ds_report is displayed below:
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
Thanks for helping!
The text was updated successfully, but these errors were encountered: