Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411

Merged
merged 9 commits into from
Apr 16, 2024

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented Apr 15, 2024

Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues
microsoft/Megatron-DeepSpeed#151
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files

We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models.

Due to the absence of flattening params_group grad in bf16, this problem is avoided.

We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger.

with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16
image

without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16
image

@tjruwase
Copy link
Contributor

tjruwase commented Apr 15, 2024

@inkcherry, amazing contribution!!!

@tohtana, @conglongli FYI

@tjruwase tjruwase requested review from tohtana and conglongli and removed request for mrwyattii April 15, 2024 14:01
Copy link
Contributor

@conglongli conglongli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me and left a few comments. Still need a second review from @tohtana

Comment on lines 420 to 429
# Use grad_norm_mask to avoid redundant computation of flattened gradient norm
# # including, Pipeline parallelism may replicate parameters.
# # replicated tensors from tensor model parallelism

# A loop-free implementation to create a mask tensor based on a range list,
# which is logically equivalent to the following implementation.

# # mask_tensor_ = torch.zeros_like(p, device=p.device, dtype=bool)
# #for mask_idx in grad_norm_mask[idx]:
# # mask_tensor_[mask_idx[0]:mask_idx[1]] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean up this block of comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned up and kept some for readability

mask_tensor = torch.zeros(p.shape[0] + 1, device=get_accelerator().current_device(), dtype=p.dtype)
mask_tensor = mask_tensor.scatter_(0, grad_norm_mask[idx].view(-1),
cum_sum_pairs.view(-1)).cumsum(0).bool()[:-1]
# assert torch.equal(mask_tensor_, mask_tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete this if no longer needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

deepspeed/runtime/utils.py Show resolved Hide resolved
Copy link
Contributor

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thank you @inkcherry!

@conglongli conglongli enabled auto-merge April 16, 2024 18:20
@conglongli conglongli added this pull request to the merge queue Apr 16, 2024
Merged via the queue into microsoft:master with commit 0896503 Apr 16, 2024
14 checks passed
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
microsoft#5411)

Some users are concerned that changes in TP topology during MOE training
may potentially cause interference with experiments when noticing
similar issues
microsoft/Megatron-DeepSpeed#151 
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files

We found a grad_norm calculation error after enabling TP. This error
occurs because flattened grad of a params group is used, where the group
contains both non-TP and TP parameters. Therefore, it is not possible to
use a single attribute to determine whether flattened grad needs to
compute the norm. In the current code logic, all params are assumed to
be non-TP, resulting in only tp_rank0 grad participating in grad_norm
computation. Other tp_rank grads have grad_norm_sum equal to 0. We
tested and found that with TP=1 and TP=4, the difference in grad_norm is
approximately twice (sqrt(4)). This aligns with the aforementioned
issue. This problem should also affect dense models.

Due to the absence of flattening params_group grad in bf16, this problem
is avoided.

We tested the loss curve on the 1.3B model. In cases where TP size
increases the inconsistent gap should be larger.

with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59)


without this change  1.3B with EP=4 TP=4&1  ,fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0)

---------

Co-authored-by: Conglong Li <[email protected]>
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
microsoft#5411)

Some users are concerned that changes in TP topology during MOE training
may potentially cause interference with experiments when noticing
similar issues
microsoft/Megatron-DeepSpeed#151 
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files

We found a grad_norm calculation error after enabling TP. This error
occurs because flattened grad of a params group is used, where the group
contains both non-TP and TP parameters. Therefore, it is not possible to
use a single attribute to determine whether flattened grad needs to
compute the norm. In the current code logic, all params are assumed to
be non-TP, resulting in only tp_rank0 grad participating in grad_norm
computation. Other tp_rank grads have grad_norm_sum equal to 0. We
tested and found that with TP=1 and TP=4, the difference in grad_norm is
approximately twice (sqrt(4)). This aligns with the aforementioned
issue. This problem should also affect dense models.

Due to the absence of flattening params_group grad in bf16, this problem
is avoided.

We tested the loss curve on the 1.3B model. In cases where TP size
increases the inconsistent gap should be larger.

with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59)


without this change  1.3B with EP=4 TP=4&1  ,fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0)

---------

Co-authored-by: Conglong Li <[email protected]>
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
microsoft#5411)

Some users are concerned that changes in TP topology during MOE training
may potentially cause interference with experiments when noticing
similar issues
microsoft/Megatron-DeepSpeed#151 
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files

We found a grad_norm calculation error after enabling TP. This error
occurs because flattened grad of a params group is used, where the group
contains both non-TP and TP parameters. Therefore, it is not possible to
use a single attribute to determine whether flattened grad needs to
compute the norm. In the current code logic, all params are assumed to
be non-TP, resulting in only tp_rank0 grad participating in grad_norm
computation. Other tp_rank grads have grad_norm_sum equal to 0. We
tested and found that with TP=1 and TP=4, the difference in grad_norm is
approximately twice (sqrt(4)). This aligns with the aforementioned
issue. This problem should also affect dense models.

Due to the absence of flattening params_group grad in bf16, this problem
is avoided.

We tested the loss curve on the 1.3B model. In cases where TP size
increases the inconsistent gap should be larger.

with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59)


without this change  1.3B with EP=4 TP=4&1  ,fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0)

---------

Co-authored-by: Conglong Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants