Skip to content

Commit

Permalink
Fix bug for checkpoint saving on multi node training setting (#28078)
Browse files Browse the repository at this point in the history
* add multi-node traning setting

* fix style
  • Loading branch information
dumpmemory authored Dec 15, 2023
1 parent dec84b3 commit 1c286be
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion src/transformers/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2386,7 +2386,9 @@ def _save_checkpoint(self, model, trial, metrics=None):
self.args.distributed_state.wait_for_everyone()
# Then go through the rewriting process starting on process 0
if staging_output_dir != output_dir:
with self.args.main_process_first(desc="Renaming model checkpoint folder to true location"):
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)

Expand Down

0 comments on commit 1c286be

Please sign in to comment.