Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_best_model_at_end is inconsistent with evaluation (and save) logic at end of training #28539

Closed
2 of 4 tasks
antoine-lizee opened this issue Jan 16, 2024 · 7 comments · Fixed by #30160
Closed
2 of 4 tasks
Labels

Comments

@antoine-lizee
Copy link

System Info

  • transformers version: 4.36.2
  • Platform: Linux-5.10.201-191.748.amzn2.x86_64-x86_64-with-glibc2.26
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.2
  • Safetensors version: 0.3.3
  • Accelerate version: 0.26.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@muellerzr @pacman100 @sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Shortened script below:


model_checkpoint = "xlm-roberta-large"
model_name = model_checkpoint.split("/")[-1]
model = XLMRobertaForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

batch_size = 32
learning_rate = 2e-5
eval_steps = 0.1

# The data + batch size leads to having 11277 steps

training_args = TrainingArguments(
    output_dir_name,
    logging_dir=run_dir,
    logging_strategy="steps",
    logging_steps=eval_steps / 5,
    evaluation_strategy="steps",
    eval_steps=eval_steps,
    save_strategy="steps",
    save_steps=eval_steps,
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=epochs,
    weight_decay=0.01,
    push_to_hub=False,
    save_total_limit=4,
    load_best_model_at_end=True
)

data_collator = DataCollatorForTokenClassification(tokenizer)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

Expected behavior

I would expect that my model is evaluated (and saved!) at the last step.

It is not, and in most example scripts we see trainer.evaluate() after the trainer.train().

As a result, when we set load_best_model_at_end=True we concretely discard any training that happened after the last checkpoint, which seems wrong. In my case, the last 10% of training is discarded.

My understanding of what's happening:

  • In the trainer callback, we check (here) if the global_step is a multiple of the eval_steps. If the total number of step is not a multiple of it, this condition is not met at the last step.
  • If we load_best_model_at_end, the last accessible evaluation does not include the performance of the latest stages of training.
  • As a side note, running trainer.evaluate() by hand after the training only re-evaluates the past checkpoint that was selected as the best.
@antoine-lizee
Copy link
Author

antoine-lizee commented Jan 16, 2024

Notes

I realize that this issue probably doesn't arise if the strategy is epoch.

It seems that using N + epsilon as the num_train_epochs would go around this problem in a very hacky way (and evaluate / save the model that corresponds to the first step after the desired epoch that is a multiple of eval_steps). Would that be your recommendation?

edit: Ok digging a bit more, it seems that the proper way of fixing this problem would be to add a callback to the trainer which would enforce saving at the end of training.
I will do this, but the default behaviour is still "wrong" I believe. (and would warrant at least some clear disclaimer in the doc?)

@huggingface huggingface deleted a comment from github-actions bot Feb 16, 2024
@huggingface huggingface deleted a comment from github-actions bot Mar 12, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr @pacman100

@huggingface huggingface deleted a comment from github-actions bot Apr 8, 2024
@amyeroberts
Copy link
Collaborator

Another ping @pacman100 @muellerzr

@pacman100
Copy link
Contributor

Hello, Zach will be looking into this

@muellerzr
Copy link
Contributor

Done, #30160 will address this by making it default to save the model at the end of training, always.

@ymoslem
Copy link
Contributor

ymoslem commented Apr 10, 2024

Hello! I have a relevant question, please. If both load_best_model_at_end and push_to_hub are True, is the best model or the last model uploaded? If so, how can I verify this? I am asking because when the model card is updated, it shows “results on the evaluation set” of the last model, not the best model. Thanks for clarification!

UPDATE
Answering my question, I have downloaded the model from the hub and compared the checksum of its model file with that of the local one sha256sum model.safetensors
So, it appears that this is just not reflected when automatically updating the evaluation on the model card. I can manually edit the card.

Copy link

github-actions bot commented May 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants