-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SwitchTransformer
] Significant performance improvement on MoE blocks
#31173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot ! Looks very good ! Can you make sure the styling checks pas make fixup
&& make fix-copies
Thanks @younesbelkada How do I need to this correctly? Would you please let me know how to do this? |
Hi @ranggihwang |
Shouldn't the styling check be done for the |
since gpt_san_japanese uses blocks that are copied from switch transformers, running |
@younesbelkada @ranggihwang gpt san has been deprecated, so we don't really want these changes to be propogated. I've just merged in #31153 which removes the |
Perfect thanks for the heads up @amyeroberts ! |
52b6c57
to
4965da6
Compare
@amyeroberts @younesbelkada I've just rebase it to main and commit it. Would you please check if it is correct? |
Thanks @ranggihwang ! Now styling checks are failing, can you run |
Okay, now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I'll review this as I reviewed the previous PR, want to make sure the suggestions are all applied! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you apply the suggestion I did in the previous PR
router_mask = router_mask.bool() | ||
idx_mask = router_mask.transpose(1, 2) # Batch * experts * tokens | ||
idx_mask = torch.cat(torch.split(idx_mask, 1, dim=0), dim=2) # 1 * experts * (batch * tokens) | ||
idx_mask = idx_mask.sum(dim=2) | ||
idx_mask = idx_mask.squeeze() # length: number of experts / value: number of tokens | ||
idx_mask = torch.nonzero(idx_mask, as_tuple=True)[ | ||
0 | ||
].tolist() # length: number of "activated" expert / value: index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
router_mask = router_mask.bool() | |
idx_mask = router_mask.transpose(1, 2) # Batch * experts * tokens | |
idx_mask = torch.cat(torch.split(idx_mask, 1, dim=0), dim=2) # 1 * experts * (batch * tokens) | |
idx_mask = idx_mask.sum(dim=2) | |
idx_mask = idx_mask.squeeze() # length: number of experts / value: number of tokens | |
idx_mask = torch.nonzero(idx_mask, as_tuple=True)[ | |
0 | |
].tolist() # length: number of "activated" expert / value: index | |
idx_mask = router_mask.reshape(batch*seq_len, num_experts).transpose(0,1).sum(dim=1) | |
idx_mask = torch.nonzero(idx_mask, as_tuple=True)[0].tolist() |
- the comment about shapes! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch_size, seq_len, and num_experts are not defined in the funciton.
So, I've defined it with the router_mask and reflected your suggestions.
Thank you @ArthurZucker !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
@ArthurZucker @younesbelkada |
SwitchTransformer
] Significant performance improvement on MoE blocks of SwitchTransformer
SwitchTransformer
] Significant performance improvement on MoE blocks of SwitchTransformerSwitchTransformer
] Significant performance improvement on MoE blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM ! Let's wait for @ArthurZucker 's final review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! 🤗
Could this be propagated to the qwen code @ranggihwang ? I know that they have some variants with lots of experts! |
@ArthurZucker I think it can be adopted for many MoE models in HuggingFace not only qwen-moe but also for NLLB-MoE, Mixtral, etc. |
awesome! Then if you are interested feel free to open a PR and ping me! 🤗 |
…ks (huggingface#31173) * SwitchTransformer MoE layer performance improvement * make fixup * comments about shapes * make fixup
…ks (huggingface#31173) * SwitchTransformer MoE layer performance improvement * make fixup * comments about shapes * make fixup
What does this PR do?
This is an edited version of the previously closed PR (#30490)
This PR includes a performant implementation of
SwitchTransformersSparseMLP
in the Google SwitchTransformer.In the current implementation of the SwitchTransformer, it spans all possible experts, including the inactive ones.
This results in serious performance degradation of the SwitchTransformer.
As shown in this figure, the current implementation of the SwitchTransformer spans inactive experts, unnecessarily increasing latency. This issue can be particularly severe in models with a larger number of experts, as it needlessly spans more experts.However, in my custom implementation of
SwitchTransformersSparseMLP
, it only accesses and computes the active experts.Advantages
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker and @younesbelkada