Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO trainer with deepspped offload cpu config cause error: AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. #955

Closed
hanyullai opened this issue Nov 4, 2023 · 11 comments
Labels
🏋 DPO Related to DPO

Comments

@hanyullai
Copy link

Working with DPO alone or using CPU offload within SFT separately results in no issues. However, when using DPO in combination with DS stage2 + CPU Offload, I encounter the following error:
miniconda3/envs/dpo/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py, line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', ensure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.

The framework I am using is LLaMA-Factory with no modifications. This problem appears to be unrelated to the repository's code. This leads me to believe it's potentially due to environmental problems related to Pytorch, Trl, Transformers, or Deepspeed. Upon debugging into the Trl library, it seems that DPOTrainer does not close the gradient of ref_model when initializing (even though all gradients are 0), causing an error to be thrown at _prepare_deepspeed. In Deepspeed, at this line, ref_model should continue.

The versions of my environment are as follows:
torch==2.1.0
transformers==4.33.2
trl==0.7.2
deepspeed==0.11.1

My ds_report is displayed below:
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

Thanks for helping!

@hanyullai
Copy link
Author

Update:
I try to add another condition here

if p.grad is None or not p.grad.all():
    continue

It seems to avoid the error above, but I don't know if this is correct.

@tuzeao-tal
Copy link

met the exact same error.
Meanwhile, If I turn off the cpu offload, it will throw another error:
File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'

which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.

I will keep trying to fix this. If I had got the solution, I will post here.

@ttzHome
Copy link

ttzHome commented Nov 27, 2023

met the exact same error. Meanwhile, If I turn off the cpu offload, it will throw another error: File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'

which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.

I will keep trying to fix this. If I had got the solution, I will post here.

I met the same error....

@lvwerra lvwerra added the 🏋 DPO Related to DPO label Nov 29, 2023
@Emperorizzis
Copy link

met the exact same error. Meanwhile, If I turn off the cpu offload, it will throw another error: File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'

which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.

I will keep trying to fix this. If I had got the solution, I will post here.

I met the same error too.

I found that the problem started with trl version 0.7.2, so we can try to get the version down to 0.7.1 temporarily.

Hopefully trl will resolve this issue soon.

@lvwerra
Copy link
Member

lvwerra commented Dec 21, 2023

tagging @kashif, maybe you have an idea?

@arkapal3
Copy link

arkapal3 commented Jan 8, 2024

In case it helps, I was able to fix this issue by doing the following in _prepare_deepspeed:

        if config_kwargs["zero_optimization"]["stage"] != 3:
            config_kwargs["zero_optimization"]["stage"] = 2
        config_kwargs.pop('scheduler', None)

With the scheduler pop taken from Emperorizzis' PR above.

Copy link

github-actions bot commented Feb 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@WJMacro
Copy link

WJMacro commented Apr 26, 2024

Does this issue remain unsolved in the latest version? I met the same problem.

@pengwork
Copy link

same error

`Package Version Editable project location


deepspeed 0.12.6+c00388a
torch 2.1.2
torch-dct 0.1.6
torchaudio 2.1.2
torchvision 0.16.2
transformers 4.38.2
transformers-stream-generator 0.0.5
trl 0.9.4`

@underwoodnoble
Copy link

underwoodnoble commented Jul 4, 2024

Do not set lr scheduler in deepspeed config, set it by transformers TrainingArguments. This works for me.

@vergilus
Copy link

vergilus commented Oct 24, 2024

met the exact same error. Meanwhile, If I turn off the cpu offload, it will throw another error: File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'
which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.
I will keep trying to fix this. If I had got the solution, I will post here.

I met the same error too.

I found that the problem started with trl version 0.7.2, so we can try to get the version down to 0.7.1 temporarily.

Hopefully trl will resolve this issue soon.

I ran into this issue because I set the lr_scheduler in the DPOconfig with a different lr_scheduler setting in deepSpeed config file for that DPOconfig. Remove one should do the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 DPO Related to DPO
Projects
None yet
Development

No branches or pull requests

10 participants