You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm running into an issue pushing checkpoints to Hub during training. I think the issue arises in these lines.
Specifically, the staging directory is set in trainer.py, and then passed to self.save_model():
ifos.path.exists(output_dir) andlen(os.listdir(output_dir)) >0:
logger.warning(
f"Checkpoint destination directory {output_dir} already exists and is non-empty. ""Saving will proceed but saved results may be invalid."
)
staging_output_dir=output_direlse:
staging_output_dir=os.path.join(run_dir, f"tmp-{checkpoint_folder}")
self.save_model(staging_output_dir, _internal_call=True)
If the output directory does not exist (e.g. a new training run), tmp- is prepended.
With push_to_hub=True, save_model launches a model push job in a new thread. This job is looking for the tmp-checkpoint-... directory.
Before that job can finish executing (sometimes even before it starts), L2538 is run:
os.rename(staging_output_dir, output_dir)
This removes the tmp- prefix.
If I inspect the trainer's push jobs afterwards, I can see the exception:
ipdb> trainer.push_in_progress.jobs
[<Future at 0x7f04be9a3b50 state=finished returned CommitInfo>, <Future at 0x7f04be9a27a0 state=finished raised ValueError>]
ipdb> trainer.push_in_progress.jobs[-1].result()
*** ValueError: Provided path: '/tmp/test_trainer/tmp-checkpoint-13' is not a directory
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.38.2Who can help?
@muellerz @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Here's a script that reproduces this problem:
Expected behavior
Hi, I'm running into an issue pushing checkpoints to Hub during training. I think the issue arises in these lines.
Specifically, the staging directory is set in
trainer.py
, and then passed toself.save_model()
:If the output directory does not exist (e.g. a new training run),
tmp-
is prepended.With
push_to_hub=True
,save_model
launches a model push job in a new thread. This job is looking for thetmp-checkpoint-...
directory.Before that job can finish executing (sometimes even before it starts), L2538 is run:
This removes the
tmp-
prefix.If I inspect the trainer's push jobs afterwards, I can see the exception:
The text was updated successfully, but these errors were encountered: