Support resuming of deepspeed + Lora + offloading #29015

thepowerfuldeez · 2024-02-14T08:33:10Z

This PR is a upstream version of @kazemf78 PR to support resuming of Lora training when using deepspeed.
Without setting load_module_strict=False as a default, checkpoint is not loaded due to Lora not containing all weights, throwing an error deepspeed resume Error(s) in loading state_dict for PeftModelForCausalLM

Related discussion: huggingface/peft#746

…_scheduler type

amyeroberts · 2024-02-14T11:28:33Z

cc @pacman100 @younesbelkada

younesbelkada

Thanks, in principle I would say this looks good, might be not ideal to change the default value of load_module_strict for the public method as other users might be using it.
On DS side, I'll let @pacman100 comment on the PR 🙏

younesbelkada · 2024-02-15T04:37:40Z

src/transformers/integrations/deepspeed.py

@@ -414,7 +414,7 @@ def deepspeed_init(trainer, num_training_steps, inference=False):
    return optimizer, lr_scheduler


-def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path, load_module_strict=True):
+def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path, load_module_strict=False):


can we somehow revert this and just force-set it to True in our trainer?

That's a fair point! I'll push a change now

…n trainer

kazemf78 · 2024-02-20T22:00:11Z

Could you please provide any updates on this PR?

younesbelkada · 2024-02-21T02:27:37Z

Sure @thepowerfuldeez !
@pacman100 is currently working on fixing issues with repsect to deepspeed and providing working scripts that you can run out of the box: huggingface/peft#1489 we'll review this PR asap with sourab!

pacman100 · 2024-02-21T13:45:13Z

Hello, this has been already fixed in #28746. I ran experiments today and can confirm resuming training when using PEFT+DeepSpeed works

pacman100 · 2024-02-22T10:44:25Z

src/transformers/trainer.py

+
+        # deepspeed ckpt loading
+        if resume_from_checkpoint is not None and self.is_deepspeed_enabled:
+            deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint, load_module_strict=False)


This is already happenging a couple lines above in 1732.

pacman100 · 2024-02-22T10:44:57Z

src/transformers/trainer.py

+        if resume_from_checkpoint is not None and self.is_deepspeed_enabled:
+            deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint, load_module_strict=False)
+            if self.args.deepspeed_force_lr_scheduler_checkpointing and self.model_wrapped.lr_scheduler is None:
+                if os.path.isfile(os.path.join(resume_from_checkpoint, SCHEDULER_NAME)):


loading scheduler is handled in _load_optimizer_and_scheduler which is couple lines below

pacman100 · 2024-02-22T10:48:16Z

src/transformers/training_args.py

@@ -1316,6 +1316,18 @@ class TrainingArguments:
            "help": "Activates neftune noise embeddings into the model. NEFTune has been proven to drastically improve model performances for instrcution fine-tuning. Check out the original paper here: https://arxiv.org/abs/2310.05914 and the original code here: https://github.com/neelsjain/NEFTune. Only supported for `PreTrainedModel` and `PeftModel` classes."
        },
    )
+
+    deepspeed_force_lr_scheduler_checkpointing: bool = field(


this isn;t required. Trainer saves the scheduler when it isn't part of DeepSpeed Engine.
Below is a screenshot of a checkpoint saved with the sceduler file. So, this isn't required.

pacman100 · 2024-02-22T10:48:42Z

src/transformers/trainer.py

+        if self.is_deepspeed_enabled:
+            # under zero3 model file itself doesn't get saved since it's bogus! Unless deepspeed
+            # config `stage3_gather_16bit_weights_on_model_save` is True
+            self.model_wrapped.save_checkpoint(staging_output_dir)


this happens in self.save_model(staging_output_dir, _internal_call=True)

github-actions · 2024-03-18T08:03:19Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ambroser53 · 2024-03-18T10:19:12Z

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

thepowerfuldeez · 2024-03-18T10:20:58Z

Hi @ambroser53 ! I haven’t tested this branch on an upstream, but it should work.

pacman100 · 2024-03-18T12:50:43Z

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

Please see my comments above, the transformers PR #28746 should already be fixing this.

kazemf78 and others added 4 commits July 30, 2023 12:23

Add lr_scheduler checkpointing when deepspeed does not support the lr…

4acce7a

…_scheduler type

Fix an issue with missing keys of state_dict in loading Lora model

c0d12fc

Merge remote-tracking branch 'ext/main'

84a8867

merge external branch

fd47c33

younesbelkada reviewed Feb 15, 2024

View reviewed changes

revert load_module_strict in deepspeed integration and set to False i…

3d62791

…n trainer

pacman100 reviewed Feb 22, 2024

View reviewed changes

thepowerfuldeez closed this Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support resuming of deepspeed + Lora + offloading #29015

Support resuming of deepspeed + Lora + offloading #29015

thepowerfuldeez commented Feb 14, 2024

amyeroberts commented Feb 14, 2024

younesbelkada left a comment

younesbelkada Feb 15, 2024

thepowerfuldeez Feb 15, 2024

kazemf78 commented Feb 20, 2024

younesbelkada commented Feb 21, 2024

pacman100 commented Feb 21, 2024

pacman100 Feb 22, 2024

pacman100 Feb 22, 2024

pacman100 Feb 22, 2024

pacman100 Feb 22, 2024

github-actions bot commented Mar 18, 2024

ambroser53 commented Mar 18, 2024

thepowerfuldeez commented Mar 18, 2024

pacman100 commented Mar 18, 2024

Support resuming of deepspeed + Lora + offloading #29015

Support resuming of deepspeed + Lora + offloading #29015

Conversation

thepowerfuldeez commented Feb 14, 2024

amyeroberts commented Feb 14, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada Feb 15, 2024

Choose a reason for hiding this comment

thepowerfuldeez Feb 15, 2024

Choose a reason for hiding this comment

kazemf78 commented Feb 20, 2024

younesbelkada commented Feb 21, 2024

pacman100 commented Feb 21, 2024

pacman100 Feb 22, 2024

Choose a reason for hiding this comment

pacman100 Feb 22, 2024

Choose a reason for hiding this comment

pacman100 Feb 22, 2024

Choose a reason for hiding this comment

pacman100 Feb 22, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 18, 2024

ambroser53 commented Mar 18, 2024

thepowerfuldeez commented Mar 18, 2024

pacman100 commented Mar 18, 2024