-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early stopping patience does not work when resuming from checkpoint #28544
Comments
IMO, the problem can be generalised through the TrainingArguments parameter Its definition states: "...when this is true, you won't be able to resume training from checkpoint.". As pointed by @Ubadub here, Trainer's callbacks' states are not persisted along with models (even on setting In short, Trainer's callbacks should be a part of the TrainerState object. I can help in this implementation if this analysis seems reasonable. Cheers! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I forgot to reply to this in time, but I am confirming that this issue is still present and should not be closed. @muellerzr and @pacman100 could one of you reopen this issue? @tanmay17061's suggested solution makes a lot of sense to me. I can attempt a go at solving this myself, but likely not for a few more weeks at the earliest. |
Hi all, #29666 should solve the issue. I chose to write a relatively-scalable way for us to save callback data inside the You can test it with Thanks for your patience! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.33.3Who can help?
@muellerzr and @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Fundamentally the issue is that the
early_stopping_patience_counter
is not persisted when checkpointing. Consequently, it is always (re)set to 0 when initializingTrainer
, including when resuming from checkpoint. This means that if, for example, you never train your model forearly_stopping_patience
-many evaluation steps at once before stopping and resuming from checkpoint, early stopping will never happen.An auxiliary issue is that even if you train your model for longer than
early_stopping_patience
-many evaluation steps, and training correctly stops, if you happen to then re-initiate training from a checkpoint, training will resume even though the run ended withself.control.should_training_stop == True
. This is because this variable is also not persisted to thetrainer_state.json
file when checkpointing. This issue was reported in #10290, but was never resolved before the issue was closed as stale.To reproduce the main issue, simply initiate a training run and set
early_stopping_patience
to a value of your choice, then interrupt training before the run gets there. Reinitiate training withresume_from_checkpoint=True
. Rinse and repeat untilbest_metric
increases forearly_stopping_patience
-many evaluation calls.To reproduce the auxiliary issue, don't interrupt your run until it stops due to early stopping. When it is complete, reinitiate training with
resume_from_checkpoint=True
.Expected behavior
Early stopping patience should work exactly the same when stopping and resuming runs from a checkpoint as when training continuously without interruption.
The text was updated successfully, but these errors were encountered: