-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for checkpoint rename race condition #28364
Fix for checkpoint rename race condition #28364
Conversation
One thing to note, I attempted to reuse the existing with block: and include the fsync. Unfortunately fsync did not flush buffers related to the staging directory, so it still failed on other processes. This raises some concerns as to the behavior of fsync on network attached storage using NFS. |
Oops, missed this. I looked yesterday :) but I guess you poster after I looked. This is my version: I don't see why existence check is required if it is happening only once per node. |
Could be a race if the output directory for the checkpoint is used sometime later in the code. If that is not the case, then shouldn't be an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, I think the best solution here is a mix of both. @siddartha-RE your approach of using existing functions in the state
is better than Accelerate here for readability and focusing it on the Trainer state, and @tblattner using fsync
here I think is more robust and better.
I propose that @siddartha-RE can you make your PR just make the modifications to trainer_callback.py
and we can handle the OS issue in trainer
in this PR?
Thank you both so much for your wonderful solutions and PR's.
src/transformers/trainer.py
Outdated
if self.args.should_save: | ||
self._rotate_checkpoints(use_mtime=True, output_dir=run_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also move this to be under this if/else so that way we can have it be done just on a single process as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Above the wait_for_everyone
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually have a related question. I noticed that the current code has push_to_hub unguarded by any check for is_local / is_world zero. This seems incorrect. However, I don't use that option so I didn't want to touch it without understanding implications.
My guess is that it would have been best to do push_to_hub after wait_for_everyone from the final output_dir. Otherwise it seems like the push could end up shipping partially written state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the logic for this so that we include the rotate_checkpoints function. Also curious about push_to_hub and the save_to_json that is above line 2384.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed thinking more on it, it should be after for the reasons you mentioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, push_to_hub
checks is_world_process_zero()
, so it's fine :)
…o only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic.
Co-authored-by: Zach Mueller <[email protected]>
Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic
…to work for directories.
a9fe43c
to
eb58698
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM we no longer get a warning saying that the model checkpoint is renamed but not an issue IMO
Thanks @tblattner 🤗 |
|
* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <[email protected]> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <[email protected]>
Definition of should_save looks like they would be equivalent and the second one a little clearer. It will also allow the extra check below for should_save to be removed. |
* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <[email protected]> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <[email protected]>
* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <[email protected]> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <[email protected]>
I think this is critical, when your training fails at the first checkpoint. Maybe as a workaround we can add a condition for non-Windows systems? if platform.system() != 'Windows':
fd = os.open(output_dir, os.O_RDONLY)
os.fsync(fd)
os.close(fd) |
Yes, quick PR going in a moment. |
* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <[email protected]> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <[email protected]>
@Vechtomov @tblattner #28637 fixed it. Unsure how it affects multinode on windows but if a user has this situation and hits it then we can deal with it then as there's not really a clean solution for doing so in python :( |
What does this PR do?
When running distributed training with deepspeed, I was encountered a race condition due to os.rename not being atomic on network filesystems. This rework, changes the logic for renaming to only run on the main processes, or a single main process depending on the save_on_each_node flag. Also added is the use of fsync to try to flush buffers, hopefully ensuring the rename is completed. fsync may have no effect in some filesystems, so a better mechanism may be required to ensure that the rename completed.
Fixes #27925
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@muellerzr
@pacman100