-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_best_model_at_end
is inconsistent with evaluation (and save) logic at end of training
#28539
Comments
NotesI realize that this issue probably doesn't arise if the strategy is It seems that using N + epsilon as the edit: Ok digging a bit more, it seems that the proper way of fixing this problem would be to add a callback to the trainer which would enforce saving at the end of training. |
Gentle ping @muellerzr @pacman100 |
Another ping @pacman100 @muellerzr |
Hello, Zach will be looking into this |
Done, #30160 will address this by making it default to save the model at the end of training, always. |
Hello! I have a relevant question, please. If both UPDATE |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.36.2Who can help?
@muellerzr @pacman100 @sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Shortened script below:
Expected behavior
I would expect that my model is evaluated (and saved!) at the last step.
It is not, and in most example scripts we see
trainer.evaluate()
after thetrainer.train()
.As a result, when we set
load_best_model_at_end=True
we concretely discard any training that happened after the last checkpoint, which seems wrong. In my case, the last 10% of training is discarded.My understanding of what's happening:
global_step
is a multiple of theeval_steps
. If the total number of step is not a multiple of it, this condition is not met at the last step.load_best_model_at_end
, the last accessible evaluation does not include the performance of the latest stages of training.trainer.evaluate()
by hand after the training only re-evaluates the past checkpoint that was selected as the best.The text was updated successfully, but these errors were encountered: