Handling of "auto" in deepspeed config causes crash under Zero3 #2154
Labels
🐛 bug
Something isn't working
🚀 deepspeed
Related to deepspeed
🏋 DPO
Related to DPO
🙋 help from community wanted
Open invitation for community members to contribute
System Info
Information
Tasks
examples
folderReproduction
This issue was reported in the hf transformers repo initially here:
huggingface/transformers#29348
I can probably put together a fix for trl when I have some more free time if y'all are interested, since I understand the behaviour now.
Current Behaviour
The base huggingface transformer calls
hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)
to change the values oftotal_num_steps"
andwarmup_num_steps
from auto to be their calculated value during the inner training loop (when the total_num_steps is know). However, in DPOTrainer iftotal_num_steps
is set to "auto" then the trainer will crash whendeepspeed.initialize
is called when wrapping the ref model atself.ref_model = self._prepare_deepspeed(self.ref_model)
.DS config
Script
Crash log
[2024-10-02 01:18:15,497] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 121.5 GB, percent = 12.1%
[2024-10-02 01:18:15,497] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3489, in
[rank0]: main()
[rank0]: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main
[rank0]: globals = debugger.run(setup['file'], None, None, is_module)
[rank0]: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2510, in run
[rank0]: return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
[rank0]: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec
[rank0]: globals = pydevd_runpy.run_path(file, globals, 'main')
[rank0]: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
[rank0]: return _run_module_code(code, init_globals, run_name,
[rank0]: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
[rank0]: _run_code(code, mod_globals, init_globals,
[rank0]: File "/home/b3schnei/.vscode-server/extensions/ms-python.debugpy-2024.10.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/b3schnei/transformers_debug/debug/29348/reproduce.py", line 34, in
[rank0]: dpo_trainer = DPOTrainer(
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
[rank0]: return f(*args, **kwargs)
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 883, in init
[rank0]: self.ref_model = self._prepare_deepspeed(self.ref_model)
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 924, in prepare_deepspeed
[rank0]: model, * = deepspeed.initialize(model=model, config=config_kwargs)
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in init
[rank0]: self._configure_lr_scheduler(lr_scheduler)
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 907, in _configure_lr_scheduler
[rank0]: lr_scheduler = self._scheduler_from_config(self.optimizer)
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 962, in _scheduler_from_config
[rank0]: instantiated_scheduler = scheduler(optimizer, **scheduler_params)
[rank0]: File "/home/b3schnei/anaconda3/envs/test_transformers/lib/python3.10/site-packages/deepspeed/runtime/lr_schedules.py", line 758, in init
[rank0]: if self.total_num_steps < self.warmup_num_steps:
[rank0]: TypeError: '<' not supported between instances of 'str' and 'int'
Expected behavior
I expect the DPOTrainer to initialize under Zero3 when setting ds_config values to "auto" like in transformer's trainer.
The text was updated successfully, but these errors were encountered: