Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce configured_state arg for accelerator_config #29781

Merged
merged 16 commits into from
May 20, 2024

Conversation

muellerzr
Copy link
Contributor

What does this PR do?

This PR starts to allow users to bring in their own Accelerator. To start, we are simply allowing users to define a PartialState or AcceleratorState outside the TrainingArguments, and then enable the user to use its state using a new use_configured_state arg.

For instance, a user can now do:

from accelerate import PartialState
from transformers import TrainingArguments

state = PartialState()
args = TrainingArguments(accelerator_config={"use_configured_state":True})

And this will use the defined state already made.

These states are Singeltons, so defining it once and calling it will maintain the same state on subsequent calls.

This may lead to issues with hyperparameter tuning, which requires the state to be reset each time, as noted in the doc description

Fixes related accelerate issue which occurs when defining an Accelerator before the TrainingArguments: huggingface/accelerate#2564

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts @pacman100

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding use_configured_state to control whether or not to reset the AcceleratorState when creating TrainingArguments object!

@muellerzr muellerzr requested a review from ArthurZucker March 25, 2024 15:52
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Could we run a check with a hyperparam search (just most common settings) to check whether this does cause an effect?

src/transformers/training_args.py Outdated Show resolved Hide resolved
src/transformers/training_args.py Outdated Show resolved Hide resolved
@muellerzr
Copy link
Contributor Author

cc @amyeroberts

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this!

I have a question about the state specification, but I suspect this more to do with my knowledge of the use of PartialState in accelerate.

Have you experimented at all with hyperparam searches to know if this does affect things?

src/transformers/training_args.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer.py Show resolved Hide resolved
src/transformers/training_args.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Looks good to me.

I'd like a second review and approval from @pacman100 before merging, as he's more familiar with the state handling so can catch anything I might have missed

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @muellerzr for adding this!

@muellerzr muellerzr force-pushed the muellerzr-reset-state branch from 51cd21c to 47fafa1 Compare April 25, 2024 13:06
@muellerzr muellerzr requested a review from amyeroberts April 25, 2024 15:13
@muellerzr
Copy link
Contributor Author

@amyeroberts requesting a rereview after having to gut much of it on further testing 😅

@amyeroberts
Copy link
Collaborator

@pacman100 Can you do a re-review?

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @muellerzr, for iterating on this. Left a couple comments.

src/transformers/trainer_pt_utils.py Show resolved Hide resolved
@@ -1615,6 +1619,39 @@ def __post_init__(self):
if version.parse(version.parse(torch.__version__).base_version) == version.parse("2.0.0") and self.fp16:
raise ValueError("--optim adamw_torch_fused with --fp16 requires PyTorch>2.0")

# We need to setup the accelerator config here *before* the first call to `self.device`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, makes sense as the PartialState is usually created in the first call to self.device itself.

src/transformers/training_args.py Outdated Show resolved Hide resolved
@ArthurZucker ArthurZucker removed their request for review April 30, 2024 07:47
@muellerzr muellerzr force-pushed the muellerzr-reset-state branch from 063fdd3 to 257e47a Compare May 6, 2024 14:52
@muellerzr muellerzr requested a review from pacman100 May 6, 2024 15:07
"`AcceleratorState` or `PartialState` to be defined before calling `TrainingArguments`. "
)
# We rely on `PartialState` to yell if there's issues here (which it will)
self.distributed_state = PartialState(cpu=self.use_cpu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still doesn't account for the case when user passed --fsdp but hasn't enabled it via PartialState. In general, my comment was about the mismatch between the training arguments vs the PartialState set by them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just have a PartialState you can still initialize FSDP later as it's done at the AcceleratorState level. I'll do a quick test to verify, but the PartialState doesn't care about FSDP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(PartialState only initializes the distributed env, for things like FSDP or DeepSpeed plugin that happens later, though DS will still be called/activated if the env is set properly with the PartialState)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, Thank you Zach for the clarifications.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Zach, I still think this has some gaps. Please refer to my comment.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature!

@muellerzr muellerzr merged commit 92d1d97 into main May 20, 2024
20 checks passed
@muellerzr muellerzr deleted the muellerzr-reset-state branch May 20, 2024 13:21
itazap pushed a commit that referenced this pull request May 24, 2024
* Introduce configured_state

* Include note on tuning

* Allow for users to have defined a state already

* Include tests

* Add note on hpam tune

* Guard a bit better

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <[email protected]>

* Finish rebase

* Finish rebase

* Guard carefully

* Fixup test

* Refactor

* Fin refactor

* Comment

* Update wrt feedback

---------

Co-authored-by: amyeroberts <[email protected]>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jun 11, 2024
)

* Introduce configured_state

* Include note on tuning

* Allow for users to have defined a state already

* Include tests

* Add note on hpam tune

* Guard a bit better

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <[email protected]>

* Finish rebase

* Finish rebase

* Guard carefully

* Fixup test

* Refactor

* Fin refactor

* Comment

* Update wrt feedback

---------

Co-authored-by: amyeroberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants