Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stage3: Use new torch grad accumulation hooks API #6773

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

deepcharm
Copy link
Contributor

@deepcharm deepcharm commented Nov 21, 2024

  • This commit addresses a Deepspeed issue #6718
  • The existing code has been using the grad_acc node hook to reduce params grads.
    The constructs such as param.data = replicated_tensor.data used in allgather_params(..)
    are compiled into param.set() causing the hook assigned to the grad_acc node not being called.
  • Starting from PyTorch 2.1 there is a new and robust hook API on a param itself: param.register_post_accumulate_grad_hook(..)
  • This commit will make use of the proper API depending on the PyTorch version
  • It will also disable compile for PyTorch versions < 2.1

* This commit addresses an issue reported in:
  microsoft#6718
* The existing code has been using the grad_acc node hook to reduce params grads.
  The constructs such as param.data = replicated_tensor.data used in
  allgather_params(..) are compiled into param.set() causing the hook assigned
  to the grad_acc node not being called.
* This is a known torch issue pytorch/pytorch#139742.
* The above caused accuracy issues and could be temporarily solved by simply
  disabling the torch compile when activation checkpointing is used.
* This commit provides a clean solution by replacing the hook on a grad_acc node
  to a hook using a new and robust hook API on a param itself:
  param.register_post_accumulate_grad_hook(..)
@deepcharm deepcharm requested a review from tjruwase as a code owner November 21, 2024 14:57
@deepcharm deepcharm requested a review from awan-10 as a code owner December 15, 2024 12:43
@tjruwase tjruwase removed the request for review from awan-10 December 18, 2024 11:45
@deepcharm
Copy link
Contributor Author

I can see other pending PRs are consistently failing on the unit-tests.
@loadams is that a known issue?

@loadams
Copy link
Contributor

loadams commented Dec 26, 2024

I can see other pending PRs are consistently failing on the unit-tests. @loadams is that a known issue?

@deepcharm - this is a known issue we are working on fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants