Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411

inkcherry · 2024-04-15T08:40:03Z

Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues
microsoft/Megatron-DeepSpeed#151
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files

We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models.

Due to the absence of flattening params_group grad in bf16, this problem is avoided.

We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger.

with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16

without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16

tjruwase · 2024-04-15T14:00:51Z

@inkcherry, amazing contribution!!!

@tohtana, @conglongli FYI

conglongli

Overall looks good to me and left a few comments. Still need a second review from @tohtana

conglongli · 2024-04-15T18:10:02Z

deepspeed/runtime/utils.py

+                # Use grad_norm_mask to avoid redundant computation of flattened gradient norm
+                # # including, Pipeline parallelism may replicate parameters.
+                # # replicated tensors from tensor model parallelism
+
+                # A loop-free implementation to create a mask tensor based on a range list,
+                # which is logically equivalent to the following implementation.
+
+                # # mask_tensor_ = torch.zeros_like(p, device=p.device, dtype=bool)
+                # #for mask_idx in grad_norm_mask[idx]:
+                # #    mask_tensor_[mask_idx[0]:mask_idx[1]] = True


Please clean up this block of comments

cleaned up and kept some for readability

conglongli · 2024-04-15T18:10:28Z

deepspeed/runtime/utils.py

+                mask_tensor = torch.zeros(p.shape[0] + 1, device=get_accelerator().current_device(), dtype=p.dtype)
+                mask_tensor = mask_tensor.scatter_(0, grad_norm_mask[idx].view(-1),
+                                                   cum_sum_pairs.view(-1)).cumsum(0).bool()[:-1]
+                # assert torch.equal(mask_tensor_, mask_tensor)


Please delete this if no longer needed

deepspeed/runtime/utils.py

tohtana

This looks good to me, thank you @inkcherry!

microsoft#5411) Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues microsoft/Megatron-DeepSpeed#151 https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models. Due to the absence of flattening params_group grad in bf16, this problem is avoided. We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger. with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59) without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0) --------- Co-authored-by: Conglong Li <[email protected]>

inkcherry added 7 commits April 12, 2024 16:08

fix grad norm for tp

287fa5e

refine code

a7e8a7f

remove unnecessary clip_gradients fun

ea41928

improve perf by loop-free implementations

e74b7ca

Modify the comments.

79cc4ce

update

3ebed5e

Merge remote-tracking branch 'master' into tp_grad_fix

fc537b8

inkcherry requested review from mrwyattii and tjruwase as code owners April 15, 2024 08:40

tjruwase requested review from tohtana and conglongli and removed request for mrwyattii April 15, 2024 14:01

conglongli suggested changes Apr 15, 2024

View reviewed changes

refine comments

df976ca

tohtana approved these changes Apr 16, 2024

View reviewed changes

Merge branch 'master' into tp_grad_fix

a40263f

conglongli enabled auto-merge April 16, 2024 18:20

conglongli approved these changes Apr 16, 2024

View reviewed changes

conglongli added this pull request to the merge queue Apr 16, 2024

Merged via the queue into microsoft:master with commit 0896503 Apr 16, 2024
14 checks passed

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411

Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411

inkcherry commented Apr 15, 2024 •

edited

Loading

tjruwase commented Apr 15, 2024 •

edited

Loading

conglongli left a comment

conglongli Apr 15, 2024

inkcherry Apr 16, 2024

conglongli Apr 15, 2024

inkcherry Apr 16, 2024

tohtana left a comment

Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411

Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411

Conversation

inkcherry commented Apr 15, 2024 • edited Loading

tjruwase commented Apr 15, 2024 • edited Loading

conglongli left a comment

Choose a reason for hiding this comment

conglongli Apr 15, 2024

Choose a reason for hiding this comment

inkcherry Apr 16, 2024

Choose a reason for hiding this comment

conglongli Apr 15, 2024

Choose a reason for hiding this comment

inkcherry Apr 16, 2024

Choose a reason for hiding this comment

tohtana left a comment

Choose a reason for hiding this comment

inkcherry commented Apr 15, 2024 •

edited

Loading

tjruwase commented Apr 15, 2024 •

edited

Loading