DPO trainer with deepspped offload cpu config cause error: AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. #955

hanyullai · 2023-11-04T12:11:30Z

Working with DPO alone or using CPU offload within SFT separately results in no issues. However, when using DPO in combination with DS stage2 + CPU Offload, I encounter the following error:
miniconda3/envs/dpo/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py, line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', ensure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.

The framework I am using is LLaMA-Factory with no modifications. This problem appears to be unrelated to the repository's code. This leads me to believe it's potentially due to environmental problems related to Pytorch, Trl, Transformers, or Deepspeed. Upon debugging into the Trl library, it seems that DPOTrainer does not close the gradient of ref_model when initializing (even though all gradients are 0), causing an error to be thrown at _prepare_deepspeed. In Deepspeed, at this line, ref_model should continue.

The versions of my environment are as follows:
torch==2.1.0
transformers==4.33.2
trl==0.7.2
deepspeed==0.11.1

My ds_report is displayed below:
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

Thanks for helping!

hanyullai · 2023-11-06T08:43:20Z

Update:
I try to add another condition here

if p.grad is None or not p.grad.all():
    continue

It seems to avoid the error above, but I don't know if this is correct.

tuzeao-tal · 2023-11-12T04:04:37Z

met the exact same error.
Meanwhile, If I turn off the cpu offload, it will throw another error:
File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'

which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.

I will keep trying to fix this. If I had got the solution, I will post here.

ttzHome · 2023-11-27T03:21:10Z

met the exact same error. Meanwhile, If I turn off the cpu offload, it will throw another error: File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'

which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.

I will keep trying to fix this. If I had got the solution, I will post here.

I met the same error....

Emperorizzis · 2023-12-21T08:36:05Z

met the exact same error. Meanwhile, If I turn off the cpu offload, it will throw another error: File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'

which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.

I will keep trying to fix this. If I had got the solution, I will post here.

I met the same error too.

I found that the problem started with trl version 0.7.2, so we can try to get the version down to 0.7.1 temporarily.

Hopefully trl will resolve this issue soon.

lvwerra · 2023-12-21T16:17:09Z

tagging @kashif, maybe you have an idea?

arkapal3 · 2024-01-08T14:13:14Z

In case it helps, I was able to fix this issue by doing the following in _prepare_deepspeed:

        if config_kwargs["zero_optimization"]["stage"] != 3:
            config_kwargs["zero_optimization"]["stage"] = 2
        config_kwargs.pop('scheduler', None)

With the scheduler pop taken from Emperorizzis' PR above.

github-actions · 2024-02-01T15:05:37Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

WJMacro · 2024-04-26T03:18:52Z

Does this issue remain unsolved in the latest version? I met the same problem.

pengwork · 2024-06-23T09:28:26Z

same error

`Package Version Editable project location

deepspeed 0.12.6+c00388a
torch 2.1.2
torch-dct 0.1.6
torchaudio 2.1.2
torchvision 0.16.2
transformers 4.38.2
transformers-stream-generator 0.0.5
trl 0.9.4`

underwoodnoble · 2024-07-04T01:53:51Z

Do not set lr scheduler in deepspeed config, set it by transformers TrainingArguments. This works for me.

vergilus · 2024-10-24T16:44:48Z

met the exact same error. Meanwhile, If I turn off the cpu offload, it will throw another error: File "/anaconda3/envs/ltool/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 661, in __init__ self.warmup_num_steps = max(2, warmup_num_steps) TypeError: '>' not supported between instances of 'str' and 'int'
which is so strange because one month ago my code works smoothly as oil. Seems like some upgrade of transformers or torch whatever caused the incompatible between trl and the rest.
I will keep trying to fix this. If I had got the solution, I will post here.

I met the same error too.

I found that the problem started with trl version 0.7.2, so we can try to get the version down to 0.7.1 temporarily.

Hopefully trl will resolve this issue soon.

I ran into this issue because I set the lr_scheduler in the DPOconfig with a different lr_scheduler setting in deepSpeed config file for that DPOconfig. Remove one should do the work.

lvwerra added the 🏋 DPO Related to DPO label Nov 29, 2023

Emperorizzis mentioned this issue Dec 21, 2023

[dpo_trainer] Fixed a compatibility bug with deepspeed when initializing reference_model #1123

Closed

github-actions bot closed this as completed Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO trainer with deepspped offload cpu config cause error: AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. #955

DPO trainer with deepspped offload cpu config cause error: AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. #955

hanyullai commented Nov 4, 2023

hanyullai commented Nov 6, 2023

tuzeao-tal commented Nov 12, 2023

ttzHome commented Nov 27, 2023

Emperorizzis commented Dec 21, 2023

lvwerra commented Dec 21, 2023

arkapal3 commented Jan 8, 2024

github-actions bot commented Feb 1, 2024

WJMacro commented Apr 26, 2024

pengwork commented Jun 23, 2024

underwoodnoble commented Jul 4, 2024 •

edited

Loading

vergilus commented Oct 24, 2024 •

edited

Loading

DPO trainer with deepspped offload cpu config cause error: AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. #955

DPO trainer with deepspped offload cpu config cause error: AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. #955

Comments

hanyullai commented Nov 4, 2023

hanyullai commented Nov 6, 2023

tuzeao-tal commented Nov 12, 2023

ttzHome commented Nov 27, 2023

Emperorizzis commented Dec 21, 2023

lvwerra commented Dec 21, 2023

arkapal3 commented Jan 8, 2024

github-actions bot commented Feb 1, 2024

WJMacro commented Apr 26, 2024

pengwork commented Jun 23, 2024

underwoodnoble commented Jul 4, 2024 • edited Loading

vergilus commented Oct 24, 2024 • edited Loading

underwoodnoble commented Jul 4, 2024 •

edited

Loading

vergilus commented Oct 24, 2024 •

edited

Loading