Early stopping patience does not work when resuming from checkpoint #28544

Ubadub · 2024-01-17T01:00:05Z

System Info

transformers version: 4.33.3
Platform: Linux-4.18.0-348.23.1.el8_5.x86_64-x86_64-with-glibc2.28
Python version: 3.9.16
Huggingface_hub version: 0.16.2
Safetensors version: 0.3.1
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.0.0.post101 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@muellerzr and @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Fundamentally the issue is that the early_stopping_patience_counter is not persisted when checkpointing. Consequently, it is always (re)set to 0 when initializing Trainer, including when resuming from checkpoint. This means that if, for example, you never train your model for early_stopping_patience-many evaluation steps at once before stopping and resuming from checkpoint, early stopping will never happen.

An auxiliary issue is that even if you train your model for longer than early_stopping_patience-many evaluation steps, and training correctly stops, if you happen to then re-initiate training from a checkpoint, training will resume even though the run ended with self.control.should_training_stop == True. This is because this variable is also not persisted to the trainer_state.json file when checkpointing. This issue was reported in #10290, but was never resolved before the issue was closed as stale.

To reproduce the main issue, simply initiate a training run and set early_stopping_patience to a value of your choice, then interrupt training before the run gets there. Reinitiate training with resume_from_checkpoint=True. Rinse and repeat until best_metric increases for early_stopping_patience-many evaluation calls.

To reproduce the auxiliary issue, don't interrupt your run until it stops due to early stopping. When it is complete, reinitiate training with resume_from_checkpoint=True.

Expected behavior

Early stopping patience should work exactly the same when stopping and resuming runs from a checkpoint as when training continuously without interruption.

The text was updated successfully, but these errors were encountered:

tanmay17061 · 2024-01-17T10:03:20Z

IMO, the problem can be generalised through the TrainingArguments parameter save_only_model (link):

Its definition states: "...when this is true, you won't be able to resume training from checkpoint.".
As the current TrainerState implementation stands, we are not truly able to resume training from a checkpoint even on setting save_only_model=False.

As pointed by @Ubadub here, Trainer's callbacks' states are not persisted along with models (even on setting save_only_model=False). To fix this issue (and the auxiliary issue #10290, pointed above), we need the capability to persist callbacks and load them when using resume_from_checkpoint.

In short, Trainer's callbacks should be a part of the TrainerState object.

I can help in this implementation if this analysis seems reasonable.

Cheers!

github-actions · 2024-02-28T08:04:23Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Ubadub · 2024-03-14T23:54:45Z

I forgot to reply to this in time, but I am confirming that this issue is still present and should not be closed. @muellerzr and @pacman100 could one of you reopen this issue?

@tanmay17061's suggested solution makes a lot of sense to me. I can attempt a go at solving this myself, but likely not for a few more weeks at the earliest.

muellerzr · 2024-03-15T05:56:31Z

Hi all, #29666 should solve the issue. I chose to write a relatively-scalable way for us to save callback data inside the TrainerState :)

You can test it with pip install git+https://github.com/huggingface/transformers@muellerzr-checkpoint-callbacks

Thanks for your patience!

github-actions · 2024-04-08T08:05:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Mar 7, 2024

muellerzr reopened this Mar 15, 2024

muellerzr mentioned this issue Mar 15, 2024

Introduce Stateful Callbacks #29666

Merged

5 tasks

github-actions bot closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early stopping patience does not work when resuming from checkpoint #28544

Early stopping patience does not work when resuming from checkpoint #28544

Ubadub commented Jan 17, 2024 •

edited

Loading

tanmay17061 commented Jan 17, 2024 •

edited

Loading

github-actions bot commented Feb 28, 2024

Ubadub commented Mar 14, 2024

muellerzr commented Mar 15, 2024

github-actions bot commented Apr 8, 2024

Early stopping patience does not work when resuming from checkpoint #28544

Early stopping patience does not work when resuming from checkpoint #28544

Comments

Ubadub commented Jan 17, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

tanmay17061 commented Jan 17, 2024 • edited Loading

github-actions bot commented Feb 28, 2024

Ubadub commented Mar 14, 2024

muellerzr commented Mar 15, 2024

github-actions bot commented Apr 8, 2024

Ubadub commented Jan 17, 2024 •

edited

Loading

tanmay17061 commented Jan 17, 2024 •

edited

Loading