-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loading a universal checkpoint #5263
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tjruwase
reviewed
Mar 12, 2024
tjruwase
reviewed
Mar 12, 2024
@@ -2785,7 +2785,7 @@ def load_checkpoint(self, | |||
if self.load_universal_checkpoint(): | |||
self.optimizer.update_lp_params() | |||
if load_zero_checkpoint: | |||
self.update_optimizer_step(step=client_states['iteration'] + 1) | |||
self.update_optimizer_step(step=client_states['iteration']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosheisland, FYI. Based on #4588, is there a potential off-by-one issue here?
tjruwase
approved these changes
Mar 13, 2024
…osoft/DeepSpeed into tohtana/fix_univ_chkpt_load
github-merge-queue bot
pushed a commit
that referenced
this pull request
Mar 28, 2024
This PR includes the following improvement regarding universal checkpoint. - Restoring step A universal checkpoint saves the training step count taken from the engine. In #5263, we fixed to always set this count to restore training step count to optimizer's states per-param (`optimizer_state['state`][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers. Torch's Adam doesn't make `step` in a param groups and only uses `optimizer_state['state'][param]['step']`. Apex's fused adam only uses `step` in a param groups. DeepSpeed's fused adam creates `step` in a param groups and never updates. It only uses `optimizer_state['state'][param]['step']`. Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups. This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups. - Unit tests of DP size scaling This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match. - Fix bug of loading with `load_optimizer_states=False` The loader doesn't load parameters from a universal checkpoint when `load_optimizer_states=False`. c8c0498 fixes this issue.
rraminen
pushed a commit
to ROCm/DeepSpeed
that referenced
this pull request
May 9, 2024
This PR fixes the following two points regarding checkpoint loading. - Load optimizer states With [this PR](microsoft#5104), we removed optimizer's `step()` on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint. For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails. This PR fixes the loader to find optimizer state keys from a given checkpoint. - Resume step count microsoft@2943e6a The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has `step`. But some optimizers creates the key `step` in a param group at the first call of `step()` (e.g. Apex [Fused Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154). In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group. This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.
rraminen
pushed a commit
to ROCm/DeepSpeed
that referenced
this pull request
May 9, 2024
This PR includes the following improvement regarding universal checkpoint. - Restoring step A universal checkpoint saves the training step count taken from the engine. In microsoft#5263, we fixed to always set this count to restore training step count to optimizer's states per-param (`optimizer_state['state`][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers. Torch's Adam doesn't make `step` in a param groups and only uses `optimizer_state['state'][param]['step']`. Apex's fused adam only uses `step` in a param groups. DeepSpeed's fused adam creates `step` in a param groups and never updates. It only uses `optimizer_state['state'][param]['step']`. Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups. This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups. - Unit tests of DP size scaling This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match. - Fix bug of loading with `load_optimizer_states=False` The loader doesn't load parameters from a universal checkpoint when `load_optimizer_states=False`. microsoft@c8c0498 fixes this issue.
dbyoung18
pushed a commit
to dbyoung18/DeepSpeed
that referenced
this pull request
Jun 11, 2024
This PR fixes the following two points regarding checkpoint loading. - Load optimizer states With [this PR](microsoft#5104), we removed optimizer's `step()` on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint. For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails. This PR fixes the loader to find optimizer state keys from a given checkpoint. - Resume step count microsoft@2943e6a The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has `step`. But some optimizers creates the key `step` in a param group at the first call of `step()` (e.g. Apex [Fused Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154). In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group. This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.
dbyoung18
pushed a commit
to dbyoung18/DeepSpeed
that referenced
this pull request
Jun 11, 2024
This PR includes the following improvement regarding universal checkpoint. - Restoring step A universal checkpoint saves the training step count taken from the engine. In microsoft#5263, we fixed to always set this count to restore training step count to optimizer's states per-param (`optimizer_state['state`][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers. Torch's Adam doesn't make `step` in a param groups and only uses `optimizer_state['state'][param]['step']`. Apex's fused adam only uses `step` in a param groups. DeepSpeed's fused adam creates `step` in a param groups and never updates. It only uses `optimizer_state['state'][param]['step']`. Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups. This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups. - Unit tests of DP size scaling This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match. - Fix bug of loading with `load_optimizer_states=False` The loader doesn't load parameters from a universal checkpoint when `load_optimizer_states=False`. microsoft@c8c0498 fixes this issue.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes the following two points regarding checkpoint loading.
Load optimizer states
With this PR, we removed optimizer's
step()
on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint.For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails.
This PR fixes the loader to find optimizer state keys from a given checkpoint.
Resume step count 2943e6a
The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has
step
. But some optimizers creates the keystep
in a param group at the first call ofstep()
(e.g. Apex Fused Adam. In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group.This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.