-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parallel process of universal checkpoint conversion #5343
Merged
tohtana
merged 7 commits into
microsoft:master
from
tohtana:tohtana/ds_to_univ_parallel_pool
Apr 22, 2024
Merged
Improve parallel process of universal checkpoint conversion #5343
tohtana
merged 7 commits into
microsoft:master
from
tohtana:tohtana/ds_to_univ_parallel_pool
Apr 22, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tohtana
changed the title
improve parallel process of universal checkpoint conversion
Improve parallel process of universal checkpoint conversion
Apr 1, 2024
tjruwase
approved these changes
Apr 1, 2024
@tohtana, this is an amazing improvement. Did you observe any conversion speedups that you can share in this PR? |
@tjruwase I converted data on a blob storage. It was totally limited by IO but I still observed 10-20% speed up in total. |
rraminen
pushed a commit
to ROCm/DeepSpeed
that referenced
this pull request
May 9, 2024
…t#5343) The conversion script from a regular checkpoint to the universal one runs the followings in parallel. 1. extracts zero sharded optimizer states 2. merge the shards However, it passes `map()` a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set. This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
umchand
pushed a commit
to umchand/DeepSpeed
that referenced
this pull request
May 20, 2024
…t#5343) The conversion script from a regular checkpoint to the universal one runs the followings in parallel. 1. extracts zero sharded optimizer states 2. merge the shards However, it passes `map()` a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set. This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
dbyoung18
pushed a commit
to dbyoung18/DeepSpeed
that referenced
this pull request
Jun 11, 2024
…t#5343) The conversion script from a regular checkpoint to the universal one runs the followings in parallel. 1. extracts zero sharded optimizer states 2. merge the shards However, it passes `map()` a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set. This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The conversion script from a regular checkpoint to the universal one runs the followings in parallel.
However, it passes
map()
a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set.This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running.