Fix step shifting when accumulate gradient #33673

kibitzing · 2024-09-24T04:03:18Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr and @SunMarc

HuggingFaceDocBuilderDev · 2024-09-26T02:01:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for the detailed issue @kibitzing! Really nice job explaining what's happening! This looks like the right fix for gradient accumulation. In the past, we had the same behavior as this PR but without updating at the end of the epoch. See this PR for more information. Since we do update after the end of each epoch, it make sense to not use total_batched_samples anymore. I see that in accelerate, we only turn the gradient sync when we are at the end of the dataloader or step+1 % args.gradient_accumulation_steps == 0. Potentially, we can even remove the condition that we placed in transformers.
From

                if (
                    (step + 1) % args.gradient_accumulation_steps == 0
                    or
                    # last step in epoch but step is always smaller than gradient_accumulation_steps
                    is_last_step_and_steps_less_than_grad_acc
                ):
                    # the `or` condition of `is_last_step_and_steps_less_than_grad_acc` is not covered
                    # in accelerate. So, explicitly enable sync gradients to True in that case.
                    if is_last_step_and_steps_less_than_grad_acc:
                        self.accelerator.gradient_state._set_sync_gradients(True)

to simply

if self.accelerator.sync_gradients:

Can you check if we have the same behavior ? cc @muellerz

kibitzing · 2024-09-26T13:05:28Z

Hello @SunMarc ,

Thank you for taking the time to review my PR! I appreciate your suggestion, it's a great idea and would indeed simplify the code.

However, after investigating and experimenting with this, I found that self.gradient_state.end_of_dataloader is always False in my case, preventing the step of self.accelerator from being reset.
This happens because:

self.gradient_state.end_of_dataloader depends on self.gradient_state.active_dataloader
- end_of_dataloader depends on in_dataloader
- in_dataloader depends on active_dataloader
and active_dataloader is generally set to None, unless explicitly configured otherwise.

While it's been interesting to dig deeper, I believe this issue might be outside the scope of the current PR. Hence, I suggest we stick with the previous approach for now.

Additionally, during my testing, I found a bug in the condition is_last_step_and_steps_less_than_grad_acc. It should use or instead of and for the expected behavior. If and is used, the last steps are always dropped if args.gradient_accumulation_steps is larger than step_in_epoch, which is highly likely to happen. I just removed the comparing step_in_epoch and grad_accum_step part because updating at the last step covers that case as well. I’ve fixed this in the PR because I believe it is directly related to gradient accumulation.

+ It is not modified now but to use self.accelerator.sync_gradients, we should also remove

transformers/src/transformers/trainer.py

Line 4820 in f2c388e

grad_acc_kwargs["sync_with_dataloader"] = False

because _do_sync function check this as well.

kibitzing · 2024-09-27T08:57:09Z

Regarding the failing test (run_swag_no_trainer.py), I checked that it is not using the Trainer, which I have fixed.

Do you have any suggestions or comments on it?

SunMarc · 2024-09-27T13:23:23Z

After reading the code a bit more, I don't think we are performing a gradient update at the end of each epoch no (see your example) ? I said that because as you saw the condition using and in is_last_step_and_steps_less_than_grad_acc + we have set grad_acc_kwargs["sync_with_dataloader"] = False. So, I feel like we are doing the right number of update across epoch. The only issue if that we indeed have an issue since there is no N sub-steps between the on_step_begin and on_step_end callbacks.

Epoch 1

Steps 1, 2, 3, 4 (update parameter because (4 % 4 = 0)
Steps 5, 6, 7 (update because it's the last step)
Epoch 2

Step 8 (update parameter because 8 % 4 =0)
Steps 9, 10, 11, 12 (update parameter because 12 % 4 = 0)
Steps 13, 14 (update because it's the last step)
Epoch 3

Step 15, 16 (update parameter because 16 % 4 = 0)
Steps 17, 18, 19, 20 (update parameter because 16 % 4 = 0)
Steps 21 (update because it's the last step)

However, after investigating and experimenting with this, I found that self.gradient_state.end_of_dataloader is always False in my case, preventing the step of self.accelerator from being reset.
This happens because:

self.gradient_state.end_of_dataloader depends on self.gradient_state.active_dataloader
end_of_dataloader depends on in_dataloader
in_dataloader depends on active_dataloader
and active_dataloader is generally set to None, unless explicitly configured otherwise.

Thanks for exploring ! Could you share a minimal reproducer ? This might indeed require a fix on accelerate side.

Regarding the failing test (run_swag_no_trainer.py), I checked that it is not using the Trainer, which I have fixed.

This is probably a flaky test !

Do you have any suggestions or comments on it?

kibitzing · 2024-09-27T13:45:31Z

Thank you for reply @SunMarc,
I may have misinterpreted the code initially. I agree that there's no issue with the update logic. I will go ahead and revert my changes to keep the condition is_last_step_and_steps_less_than_grad_acc.

Thanks for exploring ! Could you share a minimal reproducer ? This might indeed require a fix on accelerate side.

Sure, I will!

SunMarc · 2024-09-27T13:56:07Z

Thank you for reply @SunMarc,
I may have misinterpreted the code initially. I agree that there's no issue with the update logic. I will go ahead and revert my changes to keep the condition is_last_step_and_steps_less_than_grad_acc.

I mean that the original issue you had (too many updates) might not exist

kibitzing · 2024-09-27T14:04:51Z

Yes, you are right, that issue does not exist.
I thought it always update at last, but it doesn't because it is generally blocked by "steps_less_than_than_grad_acc" condition.
Sorry for the confusion.

SunMarc · 2024-09-27T14:23:02Z

Not an issue ! I was confused also. Still, I will discuss with @muellerzr if it makes sense to switch to back to your idea with the update at each last step, just like how it is coded in accelerate. cc@mueller I'll keep you updated.

Now, the issue is why self.gradient_state.end_of_dataloader doesn't work as expected. Feel free to open an issue on accelerate library with the reproducer ! Thanks a lot !

kibitzing · 2024-09-27T14:44:18Z

Okay, I will create a new issue regarding the self.gradient_state.end_of_dataloader and wait for updates on this PR.
Thanks!

kibitzing · 2024-09-30T14:37:51Z

Hello @SunMarc,

I revisited the issue I previously reported regarding accelerate and, embarrassingly, it turned out to be a problem with my own codebase. I was using accelerate together with a custom Trainer but had overridden get_train_dataloader without calling accelerate.prepare(dataloader).

To explain the flow in more detail:

active_dataloader is set to None at first.
When prepare(dataloader) is called, it goes into prepare_data_loader, where it returns either a DataLoaderShard or DataLoaderShard that inherits from DataLoaderStateMixin as a new dataloader.
These prepared dataloaders add an active_dataloader at every begin().

Without calling prepare, it is clear that the GradientState doesn't have an active dataloader

Therefore, there is no issue with the accelerate code itself. Everything works correctly when prepare is called, and I confirmed that at the last batch, self.gradient_state.end_of_dataloader=True is set as expected. I also checked this with the pytorch run_glue.py example.

Thus, as you suggested, using if self.accelerator.sync_gradients: is perfectly fine and results in simpler code.

I’m going to change the current condition to this one and push the update.

kibitzing · 2024-10-08T08:38:21Z

Hello, I have updated the if condition as per @SunMarc's suggestion.
Do we have any updates regarding the update at each last step ?

kibitzing · 2024-10-24T09:28:24Z

@SunMarc
Is there any update on this PR?
If any additional work is needed, please let me know!

muellerzr

Nice! I feel this is a great simplification all around :)

muellerzr · 2024-10-24T13:42:56Z

(Last bit is resolving the conflicts)

SunMarc

Can you fix the merge comflits. We did a lot of modification wrt to grad acc and it makes sense to have this now ! Feel free to ask us any question you have !

SunMarc · 2024-10-24T13:51:10Z

src/transformers/trainer.py

-                is_last_step_and_steps_less_than_grad_acc = (
-                    steps_in_epoch <= args.gradient_accumulation_steps and (step + 1) == steps_in_epoch
-                )
-
-                if (
-                    total_batched_samples % args.gradient_accumulation_steps == 0
-                    or
-                    # last step in epoch but step is always smaller than gradient_accumulation_steps
-                    is_last_step_and_steps_less_than_grad_acc
-                ):
-                    # the `or` condition of `is_last_step_and_steps_less_than_grad_acc` is not covered
-                    # in accelerate. So, explicitly enable sync gradients to True in that case.
-                    if is_last_step_and_steps_less_than_grad_acc:
-                        self.accelerator.gradient_state._set_sync_gradients(True)
-


We decided to perform the grad acc at the end of the dataloader but in trainer, we will set self.accelerator.gradient_state._set_sync_gradients by ourselves and not rely on the values set by accelerate.accumulate

Okay, then I'll stick with the do_sync_step in the main branch code, and just replace the total_batched_samples with step.

that's right !

kibitzing · 2024-10-25T07:32:32Z

src/transformers/trainer.py

@@ -4786,8 +4769,6 @@ def create_accelerator_and_postprocess(self):
            # take the gradient_accumulation_steps setting from TrainingArguments.
            grad_acc_kwargs["num_steps"] = self.args.gradient_accumulation_steps

-        grad_acc_kwargs["sync_with_dataloader"] = False


If I understand correctly, since we're setting self.accelerator.gradient_state._set_sync_gradients by ourselves in the trainer, would it be safer to keep it set to False?

Yeah, let's keep it set to False !

kibitzing · 2024-10-25T11:30:55Z

Hello, @SunMarc @muellerzr

Here is a summary of the new commits:

I merged the main branch and resolved conflicts.
I replaced total_batched_samples with (step + 1) and removed steps_in_epoch <= args.gradient_accumulation_steps condition to simplify, as we discussed before

I see that in accelerate, we only turn the gradient sync when we are at the end of the dataloader or step+1 % args.gradient_accumulation_steps == 0. Potentially, we can even remove the condition that we placed in transformers.

I reverted the previous changes and set grad_acc_kwargs["sync_with_dataloader"] = False as SunMarc mentioned,

we will set self.accelerator.gradient_state._set_sync_gradients by ourselves and not rely on the values set by accelerate.accumulate

If there are any issues or additional modifications needed, please let me know!

SunMarc

A lot better ! Thanks for your patience @kibitzing. If you are happy with the changes, feel free to merge the PR @muellerzr

muellerzr · 2024-10-25T15:48:22Z

failing tests come from main, we are in limbo 🫠

muellerzr · 2024-10-25T18:52:14Z

cc @ydshieh

kibitzing · 2024-10-30T15:59:00Z

Hello @muellerzr, I merged the main branch and it looks like it passed the tests! 😄

LysandreJik

Nice!

* replace total_batched_samples with step while counting grad accum step * remove unused variable * simplify condition for update step * fix format by ruff * simplify update step condition using accelerator.sync_gradients * simplify update condition using do_sync_step * remove print for test --------- Co-authored-by: Zach Mueller <[email protected]>

kibitzing added 2 commits September 24, 2024 12:59

replace total_batched_samples with step while counting grad accum step

6720364

remove unused variable

59273e6

SunMarc self-requested a review September 26, 2024 01:38

SunMarc reviewed Sep 26, 2024

View reviewed changes

SunMarc mentioned this pull request Sep 26, 2024

Step shifting using total_batched_samples for gradient_accumulation_steps counting #33671

Closed

4 tasks

simplify condition for update step

81215c0

fix format by ruff

e4cc360

kibitzing force-pushed the fix-step-shifting-when-accum-grad branch from e4f9d89 to e4cc360 Compare September 27, 2024 14:34

simplify update step condition using accelerator.sync_gradients

8348ff4

muellerzr approved these changes Oct 24, 2024

View reviewed changes

muellerzr requested a review from SunMarc October 24, 2024 13:39

muellerzr mentioned this pull request Oct 24, 2024

fix loss scaling only when compute_loss_func is used #34233

Open

5 tasks

SunMarc reviewed Oct 24, 2024

View reviewed changes

kibitzing commented Oct 25, 2024

View reviewed changes

kibitzing added 3 commits October 25, 2024 16:38

Merge branch 'main' into fix-step-shifting-when-accum-grad

ffd4c83

simplify update condition using do_sync_step

ca715e7

Merge branch 'main' into fix-step-shifting-when-accum-grad

53c2ed8

remove print for test

8f21026

SunMarc approved these changes Oct 25, 2024

View reviewed changes

Merge branch 'main' into fix-step-shifting-when-accum-grad

a69efb0

Merge branch 'main' into fix-step-shifting-when-accum-grad

0333ba0

Merge branch 'main' into fix-step-shifting-when-accum-grad

041971e

LysandreJik approved these changes Oct 31, 2024

View reviewed changes

muellerzr merged commit dca93ca into huggingface:main Oct 31, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix step shifting when accumulate gradient #33673

Fix step shifting when accumulate gradient #33673

kibitzing commented Sep 24, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 26, 2024

SunMarc left a comment •

edited

Loading

kibitzing commented Sep 26, 2024 •

edited

Loading

kibitzing commented Sep 27, 2024 •

edited

Loading

SunMarc commented Sep 27, 2024 •

edited

Loading

kibitzing commented Sep 27, 2024 •

edited

Loading

SunMarc commented Sep 27, 2024

kibitzing commented Sep 27, 2024

SunMarc commented Sep 27, 2024

kibitzing commented Sep 27, 2024

kibitzing commented Sep 30, 2024

kibitzing commented Oct 8, 2024

kibitzing commented Oct 24, 2024

muellerzr left a comment

muellerzr commented Oct 24, 2024

SunMarc left a comment

SunMarc Oct 24, 2024

kibitzing Oct 25, 2024 •

edited

Loading

SunMarc Oct 25, 2024

kibitzing Oct 25, 2024 •

edited

Loading

SunMarc Oct 25, 2024

kibitzing commented Oct 25, 2024 •

edited

Loading

SunMarc left a comment

muellerzr commented Oct 25, 2024

muellerzr commented Oct 25, 2024

kibitzing commented Oct 30, 2024

LysandreJik left a comment

Fix step shifting when accumulate gradient #33673

Fix step shifting when accumulate gradient #33673

Conversation

kibitzing commented Sep 24, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Sep 26, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

kibitzing commented Sep 26, 2024 • edited Loading

kibitzing commented Sep 27, 2024 • edited Loading

SunMarc commented Sep 27, 2024 • edited Loading

kibitzing commented Sep 27, 2024 • edited Loading

SunMarc commented Sep 27, 2024

kibitzing commented Sep 27, 2024

SunMarc commented Sep 27, 2024

kibitzing commented Sep 27, 2024

kibitzing commented Sep 30, 2024

kibitzing commented Oct 8, 2024

kibitzing commented Oct 24, 2024

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr commented Oct 24, 2024

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc Oct 24, 2024

Choose a reason for hiding this comment

kibitzing Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

SunMarc Oct 25, 2024

Choose a reason for hiding this comment

kibitzing Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

SunMarc Oct 25, 2024

Choose a reason for hiding this comment

kibitzing commented Oct 25, 2024 • edited Loading

SunMarc left a comment

Choose a reason for hiding this comment

muellerzr commented Oct 25, 2024

muellerzr commented Oct 25, 2024

kibitzing commented Oct 30, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

kibitzing commented Sep 24, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading

kibitzing commented Sep 26, 2024 •

edited

Loading

kibitzing commented Sep 27, 2024 •

edited

Loading

SunMarc commented Sep 27, 2024 •

edited

Loading

kibitzing commented Sep 27, 2024 •

edited

Loading

kibitzing Oct 25, 2024 •

edited

Loading

kibitzing Oct 25, 2024 •

edited

Loading

kibitzing commented Oct 25, 2024 •

edited

Loading