Fix loading a universal checkpoint #5263

tohtana · 2024-03-12T16:04:57Z

This PR fixes the following two points regarding checkpoint loading.

Load optimizer states
With this PR, we removed optimizer's step() on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint.
For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails.
This PR fixes the loader to find optimizer state keys from a given checkpoint.
Resume step count 2943e6a
The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has step. But some optimizers creates the key step in a param group at the first call of step() (e.g. Apex Fused Adam. In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group.
This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.

deepspeed/checkpoint/universal_checkpoint.py

tjruwase · 2024-03-12T23:26:01Z

deepspeed/runtime/engine.py

@@ -2785,7 +2785,7 @@ def load_checkpoint(self,
        if self.load_universal_checkpoint():
            self.optimizer.update_lp_params()
            if load_zero_checkpoint:
-                self.update_optimizer_step(step=client_states['iteration'] + 1)
+                self.update_optimizer_step(step=client_states['iteration'])


@mosheisland, FYI. Based on #4588, is there a potential off-by-one issue here?

…osoft/DeepSpeed into tohtana/fix_univ_chkpt_load

This PR includes the following improvement regarding universal checkpoint. - Restoring step A universal checkpoint saves the training step count taken from the engine. In #5263, we fixed to always set this count to restore training step count to optimizer's states per-param (`optimizer_state['state`][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers. Torch's Adam doesn't make `step` in a param groups and only uses `optimizer_state['state'][param]['step']`. Apex's fused adam only uses `step` in a param groups. DeepSpeed's fused adam creates `step` in a param groups and never updates. It only uses `optimizer_state['state'][param]['step']`. Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups. This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups. - Unit tests of DP size scaling This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match. - Fix bug of loading with `load_optimizer_states=False` The loader doesn't load parameters from a universal checkpoint when `load_optimizer_states=False`. c8c0498 fixes this issue.

This PR fixes the following two points regarding checkpoint loading. - Load optimizer states With [this PR](microsoft#5104), we removed optimizer's `step()` on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint. For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails. This PR fixes the loader to find optimizer state keys from a given checkpoint. - Resume step count microsoft@2943e6a The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has `step`. But some optimizers creates the key `step` in a param group at the first call of `step()` (e.g. Apex [Fused Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154). In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group. This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.

This PR includes the following improvement regarding universal checkpoint. - Restoring step A universal checkpoint saves the training step count taken from the engine. In microsoft#5263, we fixed to always set this count to restore training step count to optimizer's states per-param (`optimizer_state['state`][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers. Torch's Adam doesn't make `step` in a param groups and only uses `optimizer_state['state'][param]['step']`. Apex's fused adam only uses `step` in a param groups. DeepSpeed's fused adam creates `step` in a param groups and never updates. It only uses `optimizer_state['state'][param]['step']`. Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups. This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups. - Unit tests of DP size scaling This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match. - Fix bug of loading with `load_optimizer_states=False` The loader doesn't load parameters from a universal checkpoint when `load_optimizer_states=False`. microsoft@c8c0498 fixes this issue.

This PR fixes the following two points regarding checkpoint loading. - Load optimizer states With [this PR](microsoft#5104), we removed optimizer's `step()` on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint. For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails. This PR fixes the loader to find optimizer state keys from a given checkpoint. - Resume step count microsoft@2943e6a The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has `step`. But some optimizers creates the key `step` in a param group at the first call of `step()` (e.g. Apex [Fused Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154). In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group. This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.

This PR includes the following improvement regarding universal checkpoint. - Restoring step A universal checkpoint saves the training step count taken from the engine. In microsoft#5263, we fixed to always set this count to restore training step count to optimizer's states per-param (`optimizer_state['state`][param]['step']`) and a param_group. However, this approach does not restore the optimizer's state and param groups precisely due to different behaviors of optimizers. Torch's Adam doesn't make `step` in a param groups and only uses `optimizer_state['state'][param]['step']`. Apex's fused adam only uses `step` in a param groups. DeepSpeed's fused adam creates `step` in a param groups and never updates. It only uses `optimizer_state['state'][param]['step']`. Consequently, this leads to discrepancies between the restored and original states of the optimizer and param groups. This PR modifies the restoration process to ensure that the step number in the optimizer's state and param groups matches those in the original setup, effectively aligning the restored and original optimizer states and param groups. - Unit tests of DP size scaling This PR also adds unit tests to verify universal checkpointing. They run training with DP, save a checkpoint, and converts in to a universal checkpoint. Then they load the checkpoint with a different DP size and validate that parameters and the all-gathered (ZeRO 1/2) optimizer states match. - Fix bug of loading with `load_optimizer_states=False` The loader doesn't load parameters from a universal checkpoint when `load_optimizer_states=False`. microsoft@c8c0498 fixes this issue.

get hp keys from checkpoint files

3099308

tjruwase reviewed Mar 12, 2024

View reviewed changes

deepspeed/checkpoint/universal_checkpoint.py Show resolved Hide resolved

tohtana and others added 3 commits March 12, 2024 21:13

set optimizer's internal states

26b9ea4

resume step in param group

2943e6a

Merge branch 'master' into tohtana/fix_univ_chkpt_load

18285ae

tjruwase reviewed Mar 12, 2024

View reviewed changes

tohtana marked this pull request as ready for review March 12, 2024 23:50

tohtana requested review from mrwyattii and awan-10 as code owners March 12, 2024 23:50

tjruwase approved these changes Mar 13, 2024

View reviewed changes

tohtana added this pull request to the merge queue Mar 13, 2024

tohtana removed this pull request from the merge queue due to a manual request Mar 13, 2024

tohtana added 2 commits March 13, 2024 06:33

set step count in opt state for each param

db79711

Merge branch 'tohtana/fix_univ_chkpt_load' of https://github.com/micr…

90c721a

…osoft/DeepSpeed into tohtana/fix_univ_chkpt_load

tjruwase added this pull request to the merge queue Mar 13, 2024

Merged via the queue into master with commit b112c99 Mar 13, 2024
12 checks passed

tohtana mentioned this pull request Mar 17, 2024

Improve universal checkpoint #5289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loading a universal checkpoint #5263

Fix loading a universal checkpoint #5263

tohtana commented Mar 12, 2024 •

edited

Loading

tjruwase Mar 12, 2024

Fix loading a universal checkpoint #5263

Fix loading a universal checkpoint #5263

Conversation

tohtana commented Mar 12, 2024 • edited Loading

tjruwase Mar 12, 2024

Choose a reason for hiding this comment

tohtana commented Mar 12, 2024 •

edited

Loading