Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support resuming of deepspeed + Lora + offloading #29015

Closed
wants to merge 5 commits into from
Closed

Support resuming of deepspeed + Lora + offloading #29015

wants to merge 5 commits into from

Conversation

thepowerfuldeez
Copy link

This PR is a upstream version of @kazemf78 PR to support resuming of Lora training when using deepspeed.
Without setting load_module_strict=False as a default, checkpoint is not loaded due to Lora not containing all weights, throwing an error deepspeed resume Error(s) in loading state_dict for PeftModelForCausalLM

Related discussion: huggingface/peft#746

@amyeroberts
Copy link
Collaborator

cc @pacman100 @younesbelkada

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, in principle I would say this looks good, might be not ideal to change the default value of load_module_strict for the public method as other users might be using it.
On DS side, I'll let @pacman100 comment on the PR 🙏

@@ -414,7 +414,7 @@ def deepspeed_init(trainer, num_training_steps, inference=False):
return optimizer, lr_scheduler


def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path, load_module_strict=True):
def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path, load_module_strict=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we somehow revert this and just force-set it to True in our trainer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point! I'll push a change now

@kazemf78
Copy link

Could you please provide any updates on this PR?

@younesbelkada
Copy link
Contributor

Sure @thepowerfuldeez !
@pacman100 is currently working on fixing issues with repsect to deepspeed and providing working scripts that you can run out of the box: huggingface/peft#1489 we'll review this PR asap with sourab!

@pacman100
Copy link
Contributor

Hello, this has been already fixed in #28746. I ran experiments today and can confirm resuming training when using PEFT+DeepSpeed works


# deepspeed ckpt loading
if resume_from_checkpoint is not None and self.is_deepspeed_enabled:
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint, load_module_strict=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already happenging a couple lines above in 1732.

if resume_from_checkpoint is not None and self.is_deepspeed_enabled:
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint, load_module_strict=False)
if self.args.deepspeed_force_lr_scheduler_checkpointing and self.model_wrapped.lr_scheduler is None:
if os.path.isfile(os.path.join(resume_from_checkpoint, SCHEDULER_NAME)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loading scheduler is handled in _load_optimizer_and_scheduler which is couple lines below

@@ -1316,6 +1316,18 @@ class TrainingArguments:
"help": "Activates neftune noise embeddings into the model. NEFTune has been proven to drastically improve model performances for instrcution fine-tuning. Check out the original paper here: https://arxiv.org/abs/2310.05914 and the original code here: https://github.com/neelsjain/NEFTune. Only supported for `PreTrainedModel` and `PeftModel` classes."
},
)

deepspeed_force_lr_scheduler_checkpointing: bool = field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn;t required. Trainer saves the scheduler when it isn't part of DeepSpeed Engine.
Below is a screenshot of a checkpoint saved with the sceduler file. So, this isn't required.

Screenshot 2024-02-22 at 4 17 27 PM

if self.is_deepspeed_enabled:
# under zero3 model file itself doesn't get saved since it's bogus! Unless deepspeed
# config `stage3_gather_16bit_weights_on_model_save` is True
self.model_wrapped.save_checkpoint(staging_output_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens in self.save_model(staging_output_dir, _internal_call=True)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ambroser53
Copy link

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

@thepowerfuldeez
Copy link
Author

Hi @ambroser53 ! I haven’t tested this branch on an upstream, but it should work.

@pacman100
Copy link
Contributor

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

Please see my comments above, the transformers PR #28746 should already be fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants