Add torch compile for mixtral #30793

zhenglongjiepheonix · 2024-05-14T00:21:32Z

This PR is working in progress and it tries to add torch compile support for Mixtral, it currently also contains changes from #30642 because there are some common ground shared between these two models, and there are several issues regarding Mixtral:

we have to set the following flag to True in order to capture full graph with MOE

torch._dynamo.config.capture_dynamic_output_shape_ops = True

I believe it's inevitable because MistralSparseMoeBlock uses torch.where to extract tokens that each expert cares about, and the number and indexes of tokens that each expert attends to are variable, even if we do make a static shape(which means we zero out the non-care tokens for each expert), we are adding extra computation cost because zero-out values still get to take participate in computation, and each expert will have to run full tokens in terms of computation, which makes the whole point of computation-saving of MOE invalid.

The logits tests on main branch are currently failing on my dev machine

=========================================== short test summary info ===========================================
FAILED tests/models/mixtral/test_modeling_mixtral.py::MixtralModelTest::test_custom_4d_attention_mask - AssertionError: assert False
FAILED tests/models/mixtral/test_modeling_mixtral.py::MixtralIntegrationTest::test_small_model_logits - AssertionError: Tensor-likes are not close!
FAILED tests/models/mixtral/test_modeling_mixtral.py::MixtralIntegrationTest::test_small_model_logits_batched - AssertionError: Tensor-likes are not close!
=========================== 3 failed, 112 passed, 35 skipped, 47 warnings in 34.78s ===========================

fix style

HuggingFaceDocBuilderDev · 2024-05-14T00:45:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts · 2024-05-14T08:40:15Z

cc @ArthurZucker

…ompile_for_mixtral

zhenglongjiepheonix · 2024-05-15T02:28:23Z

src/transformers/models/mixtral/modeling_mixtral.py

+            # the `top_x` tensor here. this will give `skipping cudagraphs due to index put with accumulate`
+            # in compile
+            # final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
+
+            # still suffers from `skipping cudagraphs due to ['incompatible ops']`
+            final_hidden_states[top_x] += current_hidden_states.to(hidden_states.dtype)


I am kind of stuck on this, here it seems to give cudagraph skipped warnings no matter what equivalent form I put, for now it seems that cudagraphs can only be applied partially because of this, I have tried the following forms:

final_hidden_states.index_add_
this will give skipping cudagraphs due to index put with accumulate

final_hidden_states[top_x] += ...
this will give skipping cudagraphs due to ['incompatible ops']

final_hidden_states.scatter_add_...
this will disable fullgraph tracing because data dependent ops on top_x

I think the root cause still comes from the dynamic nature of moe where different experts compute different sets of tokens, and it seems that we can not circumvent index put if we do not want every expert to do a full
forward with all tokens @ArthurZucker @gante do you have any thoughts or suggestions on this?

I think this is expected pretty much yes!
If we use the megablock like implementation (with sparse topology and matrix reprensentation) like it was done in JetMoE we might be able to get over this, but not sure we can go further with the current version!

I think this is expected pretty much yes! If we use the megablock like implementation (with sparse topology and matrix reprensentation) like it was done in JetMoE we might be able to get over this, but not sure we can go further with the current version!

Yes, the root cause is top_x here we use is unbacked free symbols in torch.compile and is data dependent beucase of torch.where, this will cause skipped cudagraphs, but we will still benefit from partial cudagraphs if we are not rewriting it into sparse forms

unfortunately, currently torch.compile produces wrong results when setting fullgraph=True, I believe it has something to do with torch.where used here(when I try ignore expert mask and compute the whole token set for every expert results can align with eager forward), the traced fx graph is not correct, I think if we want to support torch.compile in fullgraph mode we have to rewrite moe layer in a whole different way, maybe compute experts for tokens rather than compute tokens for experts @ArthurZucker

zhenglongjiepheonix · 2024-05-16T04:09:39Z

src/transformers/models/mixtral/modeling_mixtral.py

+class MixtralBlockTop2MLP(nn.Module):
+    def __init__(self, config: MixtralConfig):
+        super().__init__()
+        self.num_experts = config.num_local_experts
+        self.ffn_dim = config.intermediate_size
+        self.hidden_dim = config.hidden_size
+
+        self.w1 = nn.Parameter(torch.empty(self.num_experts, self.ffn_dim, self.hidden_dim))
+        self.w2 = nn.Parameter(torch.empty(self.num_experts, self.hidden_dim, self.ffn_dim))
+        self.w3 = nn.Parameter(torch.empty(self.num_experts, self.ffn_dim, self.hidden_dim))
+
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(
+        self, hidden_states: torch.Tensor, selected_experts: torch.Tensor, routing_weights: torch.Tensor
+    ) -> torch.Tensor:
+        """_summary_
+
+        Args:
+            hidden_states (torch.Tensor): (batch_size * token_num, hidden_dim)
+            selected_experts (torch.Tensor): (batch_size * token_num, top_k)
+            routing_weights (torch.Tensor): (batch_size * token_num, top_k)
+
+        Returns:
+            torch.Tensor: _description_
+        """
+
+        ts, tk = hidden_states.size(0), selected_experts.size(-1)
+
+        w1 = self.w1[selected_experts]  # (batch_size * token_num, top_k, ffn_dim, hidden_dim)
+        w2 = self.w2[selected_experts]  # (batch_size * token_num, top_k, hidden_dim, ffn_dim)
+        w3 = self.w3[selected_experts]  # (batch_size * token_num, ffn_dim, hidden_dim)
+
+        x1 = torch.matmul(w1, hidden_states[:, None, :, None])
+        x3 = torch.matmul(w3, hidden_states[:, None, :, None])
+        x1 = self.act_fn(x1)
+        final_hidden_states = torch.matmul(w2, x1 * x3).reshape(ts, tk, self.hidden_dim)
+        final_hidden_states = final_hidden_states * routing_weights[:, :, None]
+        final_hidden_states = final_hidden_states.sum(dim=1)
+        return final_hidden_states
+
+
+class MixtralMoeBlock(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        self.hidden_dim = config.hidden_size
+        self.ffn_dim = config.intermediate_size
+        self.num_experts = config.num_local_experts
+        self.top_k = config.num_experts_per_tok
+
+        # gating
+        self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
+        self.experts = MixtralBlockTop2MLP(config)
+        # Jitter parameters
+        self.jitter_noise = config.router_jitter_noise
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        batch_size, sequence_length, hidden_dim = hidden_states.shape
+        if self.training and self.jitter_noise > 0:
+            hidden_states *= torch.empty_like(hidden_states).uniform_(1.0 - self.jitter_noise, 1.0 + self.jitter_noise)
+        hidden_states = hidden_states.view(-1, hidden_dim)
+        # router_logits: (batch * sequence_length, n_experts)
+        router_logits = self.gate(hidden_states)
+
+        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
+        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
+        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
+        # we cast back to the input dtype
+        routing_weights = routing_weights.to(hidden_states.dtype)
+        final_hidden_states = self.experts(hidden_states, selected_experts, routing_weights)
+        final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
+        return final_hidden_states, router_logits
+
+


this gathers experts for tokens, and it actually works for torch.compile with fullgraph and cudagraphs support, and I think it works best when we are doing decoding phase where the batchsize is small, but it will uses more memory because we need to gather expert weights for every token, however it will require changes on model weights structure when loading (from expert-wise scattered MLPs to a centralized MLP)

If we are supporting fast generation, then I think it's good to have this than the current version because we definitely will gain more speedups especially when decoding @ArthurZucker

Yeah GPTFast has similar changes! I think it's super interesting but too breaking as you mention

ArthurZucker

thanks a lot for working on this. Seems like it would be too breaking to merge as is 😢
But I'll ping you if we have a new MoE model to use this as default...
A related PR / implementation is #31173! Does this version support compile

ArthurZucker · 2024-06-12T16:05:03Z

src/transformers/models/mixtral/modeling_mixtral.py

+class MixtralBlockTop2MLP(nn.Module):
+    def __init__(self, config: MixtralConfig):
+        super().__init__()
+        self.num_experts = config.num_local_experts
+        self.ffn_dim = config.intermediate_size
+        self.hidden_dim = config.hidden_size
+
+        self.w1 = nn.Parameter(torch.empty(self.num_experts, self.ffn_dim, self.hidden_dim))
+        self.w2 = nn.Parameter(torch.empty(self.num_experts, self.hidden_dim, self.ffn_dim))
+        self.w3 = nn.Parameter(torch.empty(self.num_experts, self.ffn_dim, self.hidden_dim))
+
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(
+        self, hidden_states: torch.Tensor, selected_experts: torch.Tensor, routing_weights: torch.Tensor
+    ) -> torch.Tensor:
+        """_summary_
+
+        Args:
+            hidden_states (torch.Tensor): (batch_size * token_num, hidden_dim)
+            selected_experts (torch.Tensor): (batch_size * token_num, top_k)
+            routing_weights (torch.Tensor): (batch_size * token_num, top_k)
+
+        Returns:
+            torch.Tensor: _description_
+        """
+
+        ts, tk = hidden_states.size(0), selected_experts.size(-1)
+
+        w1 = self.w1[selected_experts]  # (batch_size * token_num, top_k, ffn_dim, hidden_dim)
+        w2 = self.w2[selected_experts]  # (batch_size * token_num, top_k, hidden_dim, ffn_dim)
+        w3 = self.w3[selected_experts]  # (batch_size * token_num, ffn_dim, hidden_dim)
+
+        x1 = torch.matmul(w1, hidden_states[:, None, :, None])
+        x3 = torch.matmul(w3, hidden_states[:, None, :, None])
+        x1 = self.act_fn(x1)
+        final_hidden_states = torch.matmul(w2, x1 * x3).reshape(ts, tk, self.hidden_dim)
+        final_hidden_states = final_hidden_states * routing_weights[:, :, None]
+        final_hidden_states = final_hidden_states.sum(dim=1)
+        return final_hidden_states
+
+
+class MixtralMoeBlock(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        self.hidden_dim = config.hidden_size
+        self.ffn_dim = config.intermediate_size
+        self.num_experts = config.num_local_experts
+        self.top_k = config.num_experts_per_tok
+
+        # gating
+        self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
+        self.experts = MixtralBlockTop2MLP(config)
+        # Jitter parameters
+        self.jitter_noise = config.router_jitter_noise
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        batch_size, sequence_length, hidden_dim = hidden_states.shape
+        if self.training and self.jitter_noise > 0:
+            hidden_states *= torch.empty_like(hidden_states).uniform_(1.0 - self.jitter_noise, 1.0 + self.jitter_noise)
+        hidden_states = hidden_states.view(-1, hidden_dim)
+        # router_logits: (batch * sequence_length, n_experts)
+        router_logits = self.gate(hidden_states)
+
+        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
+        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
+        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
+        # we cast back to the input dtype
+        routing_weights = routing_weights.to(hidden_states.dtype)
+        final_hidden_states = self.experts(hidden_states, selected_experts, routing_weights)
+        final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
+        return final_hidden_states, router_logits
+
+


Yeah GPTFast has similar changes! I think it's super interesting but too breaking as you mention

zhenglongjiepheonix · 2024-06-12T17:55:10Z

thanks a lot for working on this. Seems like it would be too breaking to merge as is 😢 But I'll ping you if we have a new MoE model to use this as default... A related PR / implementation is #31173! Does this version support compile

Yes, it does, with fullgraph and cudagraphs enabled

github-actions · 2024-07-07T08:04:09Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

zhenglongjiepheonix added 26 commits May 10, 2024 22:33

first version

2aa4df3

fix sliding window

1ddd617

fix style

2f5c7ca

add sliding window cache

e117323

fix style

dec4904

address comments

900615b

fix test

e04d68b

fix style

bb0811b

move sliding window check inside cache init

dd7ff33

add compile for mixtral

3e08b7e

first version

dcac131

fix sliding window

9afc73b

fix style

3fa9285

add sliding window cache

5246ced

fix style

6367154

address comments

c74b329

fix test

1cd711c

fix style

06b64ca

move sliding window check inside cache init

d46c601

revert changes on irrelevant files & add comment on SlidingWindowCache

ec8f338

address comments & fix style

a51b44f

fix style

update causal mask

c1fca1a

merge from main

ad969d2

merge changes on Mistral

fb00bb9

fix style

6d0bf35

revert setup.py

66de109

zhenglongjiepheonix changed the title ~~Longjie/add torch compile for mixtral~~ Add torch compile for mixtral May 14, 2024

fix some bug

210179b

zhenglongjiepheonix marked this pull request as draft May 14, 2024 02:53

zhenglongjiepheonix added 3 commits May 15, 2024 04:12

attempt

c6bf7a1

attempt

93680ea

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

b2ab9b3

…ompile_for_mixtral

zhenglongjiepheonix commented May 15, 2024

View reviewed changes

zhenglongjiepheonix added 2 commits May 16, 2024 03:20

attempt

18fa186

fix some bug

9b2c104

zhenglongjiepheonix commented May 16, 2024

View reviewed changes

ArthurZucker self-requested a review May 23, 2024 13:23

gante mentioned this pull request May 29, 2024

tracker: generate compatibility with torch.compile #28981

Open

32 tasks

ArthurZucker reviewed Jun 12, 2024

View reviewed changes

github-actions bot closed this Jul 15, 2024

This was referenced Aug 22, 2024

Question: How to use Float8InferenceLinear with FSDP1/2? pytorch/ao#704

Open

[Feature Request] Fused fp8 matmul kernel (quant + dequant + matmul) pytorch/ao#752

Open

ArthurZucker mentioned this pull request Aug 27, 2024

Add OLMoE #32406

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torch compile for mixtral #30793

Add torch compile for mixtral #30793

zhenglongjiepheonix commented May 14, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 14, 2024

amyeroberts commented May 14, 2024

zhenglongjiepheonix May 15, 2024 •

edited

Loading

ArthurZucker May 15, 2024

zhenglongjiepheonix May 15, 2024

zhenglongjiepheonix May 15, 2024 •

edited

Loading

zhenglongjiepheonix May 16, 2024 •

edited

Loading

zhenglongjiepheonix May 16, 2024

ArthurZucker Jun 12, 2024

ArthurZucker left a comment

ArthurZucker Jun 12, 2024

zhenglongjiepheonix commented Jun 12, 2024 •

edited

Loading

github-actions bot commented Jul 7, 2024

Add torch compile for mixtral #30793

Add torch compile for mixtral #30793

Conversation

zhenglongjiepheonix commented May 14, 2024 • edited Loading

HuggingFaceDocBuilderDev commented May 14, 2024

amyeroberts commented May 14, 2024

zhenglongjiepheonix May 15, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker May 15, 2024

Choose a reason for hiding this comment

zhenglongjiepheonix May 15, 2024

Choose a reason for hiding this comment

zhenglongjiepheonix May 15, 2024 • edited Loading

Choose a reason for hiding this comment

zhenglongjiepheonix May 16, 2024 • edited Loading

Choose a reason for hiding this comment

zhenglongjiepheonix May 16, 2024

Choose a reason for hiding this comment

ArthurZucker Jun 12, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jun 12, 2024

Choose a reason for hiding this comment

zhenglongjiepheonix commented Jun 12, 2024 • edited Loading

github-actions bot commented Jul 7, 2024

zhenglongjiepheonix commented May 14, 2024 •

edited

Loading

zhenglongjiepheonix May 15, 2024 •

edited

Loading

zhenglongjiepheonix May 15, 2024 •

edited

Loading

zhenglongjiepheonix May 16, 2024 •

edited

Loading

zhenglongjiepheonix commented Jun 12, 2024 •

edited

Loading