-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411
Conversation
@inkcherry, amazing contribution!!! @tohtana, @conglongli FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me and left a few comments. Still need a second review from @tohtana
deepspeed/runtime/utils.py
Outdated
# Use grad_norm_mask to avoid redundant computation of flattened gradient norm | ||
# # including, Pipeline parallelism may replicate parameters. | ||
# # replicated tensors from tensor model parallelism | ||
|
||
# A loop-free implementation to create a mask tensor based on a range list, | ||
# which is logically equivalent to the following implementation. | ||
|
||
# # mask_tensor_ = torch.zeros_like(p, device=p.device, dtype=bool) | ||
# #for mask_idx in grad_norm_mask[idx]: | ||
# # mask_tensor_[mask_idx[0]:mask_idx[1]] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clean up this block of comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaned up and kept some for readability
deepspeed/runtime/utils.py
Outdated
mask_tensor = torch.zeros(p.shape[0] + 1, device=get_accelerator().current_device(), dtype=p.dtype) | ||
mask_tensor = mask_tensor.scatter_(0, grad_norm_mask[idx].view(-1), | ||
cum_sum_pairs.view(-1)).cumsum(0).bool()[:-1] | ||
# assert torch.equal(mask_tensor_, mask_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete this if no longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thank you @inkcherry!
microsoft#5411) Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues microsoft/Megatron-DeepSpeed#151 https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models. Due to the absence of flattening params_group grad in bf16, this problem is avoided. We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger. with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59) without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0) --------- Co-authored-by: Conglong Li <[email protected]>
microsoft#5411) Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues microsoft/Megatron-DeepSpeed#151 https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models. Due to the absence of flattening params_group grad in bf16, this problem is avoided. We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger. with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59) without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0) --------- Co-authored-by: Conglong Li <[email protected]>
microsoft#5411) Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues microsoft/Megatron-DeepSpeed#151 https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models. Due to the absence of flattening params_group grad in bf16, this problem is avoided. We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger. with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59) without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0) --------- Co-authored-by: Conglong Li <[email protected]>
Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues
microsoft/Megatron-DeepSpeed#151
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files
We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models.
Due to the absence of flattening params_group grad in bf16, this problem is avoided.
We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger.
with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16
without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16