Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early stopping patience does not work when resuming from checkpoint #28544

Closed
2 of 4 tasks
Ubadub opened this issue Jan 17, 2024 · 5 comments · Fixed by #29666
Closed
2 of 4 tasks

Early stopping patience does not work when resuming from checkpoint #28544

Ubadub opened this issue Jan 17, 2024 · 5 comments · Fixed by #29666

Comments

@Ubadub
Copy link
Contributor

Ubadub commented Jan 17, 2024

System Info

  • transformers version: 4.33.3
  • Platform: Linux-4.18.0-348.23.1.el8_5.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.16
  • Huggingface_hub version: 0.16.2
  • Safetensors version: 0.3.1
  • Accelerate version: 0.24.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.0.post101 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@muellerzr and @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Fundamentally the issue is that the early_stopping_patience_counter is not persisted when checkpointing. Consequently, it is always (re)set to 0 when initializing Trainer, including when resuming from checkpoint. This means that if, for example, you never train your model for early_stopping_patience-many evaluation steps at once before stopping and resuming from checkpoint, early stopping will never happen.

An auxiliary issue is that even if you train your model for longer than early_stopping_patience-many evaluation steps, and training correctly stops, if you happen to then re-initiate training from a checkpoint, training will resume even though the run ended with self.control.should_training_stop == True. This is because this variable is also not persisted to the trainer_state.json file when checkpointing. This issue was reported in #10290, but was never resolved before the issue was closed as stale.

To reproduce the main issue, simply initiate a training run and set early_stopping_patience to a value of your choice, then interrupt training before the run gets there. Reinitiate training with resume_from_checkpoint=True. Rinse and repeat until best_metric increases for early_stopping_patience-many evaluation calls.

To reproduce the auxiliary issue, don't interrupt your run until it stops due to early stopping. When it is complete, reinitiate training with resume_from_checkpoint=True.

Expected behavior

Early stopping patience should work exactly the same when stopping and resuming runs from a checkpoint as when training continuously without interruption.

@tanmay17061
Copy link
Contributor

tanmay17061 commented Jan 17, 2024

IMO, the problem can be generalised through the TrainingArguments parameter save_only_model (link):

Its definition states: "...when this is true, you won't be able to resume training from checkpoint.".
As the current TrainerState implementation stands, we are not truly able to resume training from a checkpoint even on setting save_only_model=False.

As pointed by @Ubadub here, Trainer's callbacks' states are not persisted along with models (even on setting save_only_model=False). To fix this issue (and the auxiliary issue #10290, pointed above), we need the capability to persist callbacks and load them when using resume_from_checkpoint.

In short, Trainer's callbacks should be a part of the TrainerState object.

I can help in this implementation if this analysis seems reasonable.

Cheers!

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Mar 7, 2024
@Ubadub
Copy link
Contributor Author

Ubadub commented Mar 14, 2024

I forgot to reply to this in time, but I am confirming that this issue is still present and should not be closed. @muellerzr and @pacman100 could one of you reopen this issue?

@tanmay17061's suggested solution makes a lot of sense to me. I can attempt a go at solving this myself, but likely not for a few more weeks at the earliest.

@muellerzr
Copy link
Contributor

Hi all, #29666 should solve the issue. I chose to write a relatively-scalable way for us to save callback data inside the TrainerState :)

You can test it with pip install git+https://github.com/huggingface/transformers@muellerzr-checkpoint-callbacks

Thanks for your patience!

Copy link

github-actions bot commented Apr 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants