[`SwitchTransformer`] Significant performance improvement on MoE blocks #31173

ranggihwang · 2024-06-01T07:35:48Z

What does this PR do?

This is an edited version of the previously closed PR (#30490)

This PR includes a performant implementation of SwitchTransformersSparseMLP in the Google SwitchTransformer.
In the current implementation of the SwitchTransformer, it spans all possible experts, including the inactive ones.

for idx, expert in enumerate(self.experts.values()):
            token_indices = router_mask[:, :, idx].bool()
            next_states[token_indices] = expert(hidden_states[token_indices]).to(next_states.dtype)

This results in serious performance degradation of the SwitchTransformer.

As shown in this figure, the current implementation of the SwitchTransformer spans inactive experts, unnecessarily increasing latency.

This issue can be particularly severe in models with a larger number of experts, as it needlessly spans more experts.

However, in my custom implementation of SwitchTransformersSparseMLP, it only accesses and computes the active experts.

Advantages

This can significantly reduce the latency of the SwitchTransformer and make the model more accessible to a broader range of users.
This change achieves greater latency reductions when expert parameters are offloaded to the CPU or SSD.
This change addresses the problem of increasing latency proportional to the number of experts.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker and @younesbelkada

younesbelkada

Thanks a lot ! Looks very good ! Can you make sure the styling checks pas make fixup && make fix-copies

ranggihwang · 2024-06-03T08:13:11Z

Thanks @younesbelkada
I've done make fixup && make fix-copies before in #30490 (review)
But I was requested to revert in the end.

How do I need to this correctly? Would you please let me know how to do this?

younesbelkada · 2024-06-03T08:16:04Z

Hi @ranggihwang
Thanks ! Hmm I think there was a misunderstanding on my side at that time, if you could run the styling checks and push the results here (it should only change 2 files, switch transformers & gpt_san_japanese file), it would be great !

ranggihwang · 2024-06-03T08:20:46Z

Shouldn't the styling check be done for the src/transformers/models/switch_transformers/modeling_switch_transformers.py?
I haven't changed anything except the file and it seems like there's no file named gpt_san_japanese_file in my repo.

younesbelkada · 2024-06-03T08:23:33Z

since gpt_san_japanese uses blocks that are copied from switch transformers, running make fix-copies will propagate the changes you introduced in that file as well, see: https://app.circleci.com/pipelines/github/huggingface/transformers/94615/workflows/e1e9d110-614a-411a-a0f5-b7d4146e4db8/jobs/1241937

amyeroberts · 2024-06-03T08:44:20Z

@younesbelkada @ranggihwang gpt san has been deprecated, so we don't really want these changes to be propogated. I've just merged in #31153 which removes the # Copied from headers for this model. Rebasing on main will include this and remove the need to run make fix-copies here. Thanks!

younesbelkada · 2024-06-03T08:47:14Z

Perfect thanks for the heads up @amyeroberts !
@ranggihwang feel free to proceed as suggested by amy 🙏

ranggihwang · 2024-06-03T09:46:16Z

@amyeroberts @younesbelkada
Thank you for your advice, Amy and Younes.

I've just rebase it to main and commit it. Would you please check if it is correct?

younesbelkada · 2024-06-03T09:49:21Z

Thanks @ranggihwang ! Now styling checks are failing, can you run make fixup and commit the changes ?

ranggihwang · 2024-06-03T09:52:24Z

Okay, now make fixup is done!

younesbelkada

Thanks !

HuggingFaceDocBuilderDev · 2024-06-03T10:20:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-06-03T12:19:12Z

I'll review this as I reviewed the previous PR, want to make sure the suggestions are all applied!

ArthurZucker

Could you apply the suggestion I did in the previous PR

ArthurZucker · 2024-06-03T12:20:45Z

src/transformers/models/switch_transformers/modeling_switch_transformers.py

+        router_mask = router_mask.bool()
+        idx_mask = router_mask.transpose(1, 2)  # Batch * experts * tokens
+        idx_mask = torch.cat(torch.split(idx_mask, 1, dim=0), dim=2)  # 1 * experts * (batch * tokens)
+        idx_mask = idx_mask.sum(dim=2)
+        idx_mask = idx_mask.squeeze()  # length: number of experts / value: number of tokens
+        idx_mask = torch.nonzero(idx_mask, as_tuple=True)[
+            0
+        ].tolist()  # length: number of "activated" expert / value: index


Suggested change

router_mask = router_mask.bool()

idx_mask = router_mask.transpose(1, 2) # Batch * experts * tokens

idx_mask = torch.cat(torch.split(idx_mask, 1, dim=0), dim=2) # 1 * experts * (batch * tokens)

idx_mask = idx_mask.sum(dim=2)

idx_mask = idx_mask.squeeze() # length: number of experts / value: number of tokens

idx_mask = torch.nonzero(idx_mask, as_tuple=True)[

0

].tolist() # length: number of "activated" expert / value: index

idx_mask = router_mask.reshape(batch*seq_len, num_experts).transpose(0,1).sum(dim=1)

idx_mask = torch.nonzero(idx_mask, as_tuple=True)[0].tolist()

the comment about shapes! 🤗

The batch_size, seq_len, and num_experts are not defined in the funciton.
So, I've defined it with the router_mask and reflected your suggestions.

Thank you @ArthurZucker !

ranggihwang · 2024-06-04T04:03:42Z

@ArthurZucker @younesbelkada
Please let me know if I need to rebase it again :)

younesbelkada

Still LGTM ! Let's wait for @ArthurZucker 's final review!

ArthurZucker

Thanks a lot! 🤗

ArthurZucker · 2024-06-06T07:10:48Z

Could this be propagated to the qwen code @ranggihwang ? I know that they have some variants with lots of experts!

ranggihwang · 2024-06-06T07:39:38Z

@ArthurZucker I think it can be adopted for many MoE models in HuggingFace not only qwen-moe but also for NLLB-MoE, Mixtral, etc.

ArthurZucker · 2024-06-06T08:28:40Z

awesome! Then if you are interested feel free to open a PR and ping me! 🤗
Some models need compile support which might be a little bit tricky we'll see

…ks (huggingface#31173) * SwitchTransformer MoE layer performance improvement * make fixup * comments about shapes * make fixup

younesbelkada reviewed Jun 3, 2024

View reviewed changes

younesbelkada requested a review from ArthurZucker June 3, 2024 08:09

SwitchTransformer MoE layer performance improvement

4965da6

ranggihwang force-pushed the google_st_model branch from 52b6c57 to 4965da6 Compare June 3, 2024 09:45

make fixup

8e5bb3d

younesbelkada approved these changes Jun 3, 2024

View reviewed changes

younesbelkada requested a review from amyeroberts June 3, 2024 10:22

ArthurZucker reviewed Jun 3, 2024

View reviewed changes

ranggihwang added 2 commits June 4, 2024 04:00

comments about shapes

3459299

make fixup

590cd13

ranggihwang changed the title ~~Significant performance improvement on MoE blocks of SwitchTransformer~~ [SwitchTransformer] Significant performance improvement on MoE blocks of SwitchTransformer Jun 4, 2024

ranggihwang changed the title ~~[SwitchTransformer] Significant performance improvement on MoE blocks of SwitchTransformer~~ [SwitchTransformer] Significant performance improvement on MoE blocks Jun 4, 2024

ranggihwang requested review from younesbelkada and ArthurZucker June 4, 2024 04:08

younesbelkada approved these changes Jun 4, 2024

View reviewed changes

ArthurZucker approved these changes Jun 6, 2024

View reviewed changes

ArthurZucker merged commit 9b85e40 into huggingface:main Jun 6, 2024
21 checks passed

ArthurZucker mentioned this pull request Jun 12, 2024

Add torch compile for mixtral #30793

Closed

ArthurZucker mentioned this pull request Aug 1, 2024

Qwen2_moe: Avoid zero tokens fowarding for some experts #32283

Closed

4 tasks

ArthurZucker mentioned this pull request Aug 8, 2024

Skip non-selected experts for mixtral and qwen2_moe #32429

Open

ArthurZucker mentioned this pull request Dec 13, 2024

Add ViTPose #30530

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`SwitchTransformer`] Significant performance improvement on MoE blocks #31173

[`SwitchTransformer`] Significant performance improvement on MoE blocks #31173

ranggihwang commented Jun 1, 2024

younesbelkada left a comment

ranggihwang commented Jun 3, 2024

younesbelkada commented Jun 3, 2024

ranggihwang commented Jun 3, 2024

younesbelkada commented Jun 3, 2024

amyeroberts commented Jun 3, 2024 •

edited

Loading

younesbelkada commented Jun 3, 2024

ranggihwang commented Jun 3, 2024

younesbelkada commented Jun 3, 2024

ranggihwang commented Jun 3, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Jun 3, 2024

ArthurZucker commented Jun 3, 2024

ArthurZucker left a comment

ArthurZucker Jun 3, 2024

ranggihwang Jun 4, 2024

ArthurZucker Jun 6, 2024

ranggihwang commented Jun 4, 2024

younesbelkada left a comment

ArthurZucker left a comment

ArthurZucker commented Jun 6, 2024

ranggihwang commented Jun 6, 2024

ArthurZucker commented Jun 6, 2024 •

edited

Loading

[SwitchTransformer] Significant performance improvement on MoE blocks #31173

[SwitchTransformer] Significant performance improvement on MoE blocks #31173

Conversation

ranggihwang commented Jun 1, 2024

What does this PR do?

Before submitting

Who can review?

younesbelkada left a comment

Choose a reason for hiding this comment

ranggihwang commented Jun 3, 2024

younesbelkada commented Jun 3, 2024

ranggihwang commented Jun 3, 2024

younesbelkada commented Jun 3, 2024

amyeroberts commented Jun 3, 2024 • edited Loading

younesbelkada commented Jun 3, 2024

ranggihwang commented Jun 3, 2024

younesbelkada commented Jun 3, 2024

ranggihwang commented Jun 3, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 3, 2024

ArthurZucker commented Jun 3, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jun 3, 2024

Choose a reason for hiding this comment

ranggihwang Jun 4, 2024

Choose a reason for hiding this comment

ArthurZucker Jun 6, 2024

Choose a reason for hiding this comment

ranggihwang commented Jun 4, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jun 6, 2024

ranggihwang commented Jun 6, 2024

ArthurZucker commented Jun 6, 2024 • edited Loading

[`SwitchTransformer`] Significant performance improvement on MoE blocks #31173

[`SwitchTransformer`] Significant performance improvement on MoE blocks #31173

amyeroberts commented Jun 3, 2024 •

edited

Loading

ArthurZucker commented Jun 6, 2024 •

edited

Loading