Introduce configured_state arg for accelerator_config #29781

muellerzr · 2024-03-21T14:24:33Z

What does this PR do?

This PR starts to allow users to bring in their own Accelerator. To start, we are simply allowing users to define a PartialState or AcceleratorState outside the TrainingArguments, and then enable the user to use its state using a new use_configured_state arg.

For instance, a user can now do:

from accelerate import PartialState
from transformers import TrainingArguments

state = PartialState()
args = TrainingArguments(accelerator_config={"use_configured_state":True})

And this will use the defined state already made.

These states are Singeltons, so defining it once and calling it will maintain the same state on subsequent calls.

This may lead to issues with hyperparameter tuning, which requires the state to be reset each time, as noted in the doc description

Fixes related accelerate issue which occurs when defining an Accelerator before the TrainingArguments: huggingface/accelerate#2564

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts @pacman100

HuggingFaceDocBuilderDev · 2024-03-21T14:44:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pacman100

Thank you for adding use_configured_state to control whether or not to reset the AcceleratorState when creating TrainingArguments object!

amyeroberts

Thanks for adding this!

Could we run a check with a hyperparam search (just most common settings) to check whether this does cause an effect?

src/transformers/training_args.py

muellerzr · 2024-04-15T12:16:50Z

cc @amyeroberts

amyeroberts

Thanks for iterating on this!

I have a question about the state specification, but I suspect this more to do with my knowledge of the use of PartialState in accelerate.

Have you experimented at all with hyperparam searches to know if this does affect things?

src/transformers/training_args.py

tests/trainer/test_trainer.py

src/transformers/training_args.py

amyeroberts

Thanks for adding this! Looks good to me.

I'd like a second review and approval from @pacman100 before merging, as he's more familiar with the state handling so can catch anything I might have missed

pacman100

Thank you @muellerzr for adding this!

muellerzr · 2024-04-25T15:14:12Z

@amyeroberts requesting a rereview after having to gut much of it on further testing 😅

amyeroberts · 2024-04-29T10:07:27Z

@pacman100 Can you do a re-review?

pacman100

Thank you, @muellerzr, for iterating on this. Left a couple comments.

src/transformers/trainer_pt_utils.py

pacman100 · 2024-04-29T13:17:58Z

src/transformers/training_args.py

@@ -1615,6 +1619,39 @@ def __post_init__(self):
            if version.parse(version.parse(torch.__version__).base_version) == version.parse("2.0.0") and self.fp16:
                raise ValueError("--optim adamw_torch_fused with --fp16 requires PyTorch>2.0")

+        # We need to setup the accelerator config here *before* the first call to `self.device`


ah, makes sense as the PartialState is usually created in the first call to self.device itself.

src/transformers/training_args.py

Co-authored-by: amyeroberts <[email protected]>

pacman100 · 2024-05-10T10:31:03Z

src/transformers/training_args.py

+                    "`AcceleratorState` or `PartialState` to be defined before calling `TrainingArguments`. "
+                )
+            # We rely on `PartialState` to yell if there's issues here (which it will)
+            self.distributed_state = PartialState(cpu=self.use_cpu)


this still doesn't account for the case when user passed --fsdp but hasn't enabled it via PartialState. In general, my comment was about the mismatch between the training arguments vs the PartialState set by them.

If we just have a PartialState you can still initialize FSDP later as it's done at the AcceleratorState level. I'll do a quick test to verify, but the PartialState doesn't care about FSDP

(PartialState only initializes the distributed env, for things like FSDP or DeepSpeed plugin that happens later, though DS will still be called/activated if the env is set properly with the PartialState)

See here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L894-L904

Understood, Thank you Zach for the clarifications.

pacman100

Hello Zach, I still think this has some gaps. Please refer to my comment.

amyeroberts

Thanks for adding this feature!

* Introduce configured_state * Include note on tuning * Allow for users to have defined a state already * Include tests * Add note on hpam tune * Guard a bit better * Update src/transformers/training_args.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/training_args.py Co-authored-by: amyeroberts <[email protected]> * Finish rebase * Finish rebase * Guard carefully * Fixup test * Refactor * Fin refactor * Comment * Update wrt feedback --------- Co-authored-by: amyeroberts <[email protected]>

) * Introduce configured_state * Include note on tuning * Allow for users to have defined a state already * Include tests * Add note on hpam tune * Guard a bit better * Update src/transformers/training_args.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/training_args.py Co-authored-by: amyeroberts <[email protected]> * Finish rebase * Finish rebase * Guard carefully * Fixup test * Refactor * Fin refactor * Comment * Update wrt feedback --------- Co-authored-by: amyeroberts <[email protected]>

muellerzr requested review from amyeroberts and pacman100 March 21, 2024 14:25

pacman100 approved these changes Mar 21, 2024

View reviewed changes

muellerzr requested a review from ArthurZucker March 25, 2024 15:52

amyeroberts reviewed Mar 26, 2024

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

src/transformers/training_args.py Outdated Show resolved Hide resolved

muellerzr requested review from amyeroberts and pacman100 March 28, 2024 17:01

amyeroberts reviewed Apr 15, 2024

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

tests/trainer/test_trainer.py Show resolved Hide resolved

src/transformers/training_args.py Outdated Show resolved Hide resolved

amyeroberts approved these changes Apr 19, 2024

View reviewed changes

pacman100 approved these changes Apr 24, 2024

View reviewed changes

muellerzr force-pushed the muellerzr-reset-state branch from 51cd21c to 47fafa1 Compare April 25, 2024 13:06

muellerzr requested a review from amyeroberts April 25, 2024 15:13

pacman100 reviewed Apr 29, 2024

View reviewed changes

ArthurZucker removed their request for review April 30, 2024 07:47

muellerzr and others added 12 commits May 6, 2024 10:52

Introduce configured_state

5dcfd10

Include note on tuning

be98576

Allow for users to have defined a state already

b4f13a8

Include tests

d89828c

Add note on hpam tune

d002203

Guard a bit better

21e485a

Update src/transformers/training_args.py

bccfa8c

Co-authored-by: amyeroberts <[email protected]>

Update src/transformers/training_args.py

03c37a0

Co-authored-by: amyeroberts <[email protected]>

Finish rebase

4769a58

Finish rebase

d48fb95

Guard carefully

8bdb4d3

Fixup test

c91019b

muellerzr added 3 commits May 6, 2024 10:52

Refactor

81f52c8

Fin refactor

86bc065

Comment

257e47a

muellerzr force-pushed the muellerzr-reset-state branch from 063fdd3 to 257e47a Compare May 6, 2024 14:52

Update wrt feedback

b5fd489

muellerzr requested a review from pacman100 May 6, 2024 15:07

pacman100 reviewed May 10, 2024

View reviewed changes

pacman100 approved these changes May 16, 2024

View reviewed changes

amyeroberts approved these changes May 20, 2024

View reviewed changes

muellerzr merged commit 92d1d97 into main May 20, 2024
20 checks passed

muellerzr deleted the muellerzr-reset-state branch May 20, 2024 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce configured_state arg for accelerator_config #29781

Introduce configured_state arg for accelerator_config #29781

muellerzr commented Mar 21, 2024

HuggingFaceDocBuilderDev commented Mar 21, 2024

pacman100 left a comment

amyeroberts left a comment

muellerzr commented Apr 15, 2024

amyeroberts left a comment

amyeroberts left a comment

pacman100 left a comment

muellerzr commented Apr 25, 2024

amyeroberts commented Apr 29, 2024

pacman100 left a comment

pacman100 Apr 29, 2024

pacman100 May 10, 2024

muellerzr May 10, 2024

muellerzr May 10, 2024

muellerzr May 10, 2024

pacman100 May 16, 2024

pacman100 left a comment

amyeroberts left a comment

Introduce configured_state arg for accelerator_config #29781

Introduce configured_state arg for accelerator_config #29781

Conversation

muellerzr commented Mar 21, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Mar 21, 2024

pacman100 left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

muellerzr commented Apr 15, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

muellerzr commented Apr 25, 2024

amyeroberts commented Apr 29, 2024

pacman100 left a comment

Choose a reason for hiding this comment

pacman100 Apr 29, 2024

Choose a reason for hiding this comment

pacman100 May 10, 2024

Choose a reason for hiding this comment

muellerzr May 10, 2024

Choose a reason for hiding this comment

muellerzr May 10, 2024

Choose a reason for hiding this comment

muellerzr May 10, 2024

Choose a reason for hiding this comment

pacman100 May 16, 2024

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment