Try to add Qwen-moe into mixtral_moe.py #117

ZhangEnmao · 2024-01-17T02:47:42Z

Hi,
I try to add Qwen-moe into mixtral_moe.py, and I have done some modifications. But now, I meet some problems in there.

I think it is wrong, because auto_map should not appear in "MixtralForCausalLM". When I delete it, the model will be Nan.
Do you know the reason?
I am looking forward to your reply.

ZhangEnmao · 2024-01-17T03:51:08Z

Hi, Sorry to bother you again. Could you tell me why mixtral-moe only choose llama structure or mixtral structure ? Why are other models inappropriate

cg123 · 2024-01-17T04:16:22Z

No worries, happy to help! The reason the script needs a Llama or Mistral model is because it's written to take advantage of the Mixtral architecture. Because Mixtral is essentially just Mistral with multiple MLP sections and a gate, the tensors from a Mistral model can be used without any training. (Llama works as well because they're almost exactly the same architecture.)

It's definitely possible to combine other architectures in a similar fashion, but the result won't be compatible with the Mixtral architecture. There are two basic ways to make it work. You can get creative with how you use the weights of your models, throwing some out, and doing a bunch of training afterwards to rehabilitate it in the new architecture (CausalLM is a success story of this approach.) Mergekit can't really support this method as there's no easy way to automatically map the weights of an arbitrary language model architecture onto another - it really needs a human to decide that correlation.

The other approach is to not use the Mixtral architecture, and instead write your own custom code to inference the resulting model. Maxime Labonne's Phixtral models are examples of this approach. Similarly, this can't really be automated. I can look at integrating new architectures as they are implemented - for example, now that Phixtral is getting some traction I'm considering extending the script to also be able to output Phixtral models. But the actual inference code I can't really help with - I'm only one person, and if I start writing custom MoE architectures for every type of model out there I'd never have time to do anything else. :)

ZhangEnmao · 2024-01-17T06:24:17Z

Oh, you are truly amazing! Your answer has been of great help to me, and I feel like I have gained a deeper understanding of MergeKit and MOE. If you are expanding the Phixtral architecture, I believe it would require some special code related to Phixtral model features (which I would also need to obtain Qwen-moe). Currently, I have made some very simple modifications to mixtral_moe.py, but it doesn't give me a mixtral-moe architecture, probably because it's too simplistic. I will further contemplate on how to incorporate Qwen. Thank you for your response, and I'm looking forward to the expansion of Phixtral!

ZhangEnmao · 2024-01-18T03:56:08Z

Hey, bro. Good morning ! I have an idea now which is a Qwen-moe.py file may be necessary, just like Qwen model owing its Qwen.py file to help loading pretrained model correctly. Do you think my idea is right ?

Expands the script `mergekit-moe` to support two new output architectures, Deepseek MoE and Qwen 2 MoE. Both architectures include support for "shared" experts. Currently the script supports adding a single shared expert. The Deepseek architecture uses the shared experts ungated and unweighted, so you probably want to set the new `residual_scale` option on the shared expert to a relatively low value (think 0.1ish) to keep the model from being completely overcooked. Qwen 2 MoE has a gate parameter associated with the shared expert so this is less necessary, but still advisable. Deepseek MoE supports either Llama or Mistral based models as inputs. Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models. Addresses #117, #244, and #134.

cg123 mentioned this issue Apr 12, 2024

MoE Merge Rework #263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to add Qwen-moe into mixtral_moe.py #117

Try to add Qwen-moe into mixtral_moe.py #117

ZhangEnmao commented Jan 17, 2024

ZhangEnmao commented Jan 17, 2024

cg123 commented Jan 17, 2024

ZhangEnmao commented Jan 17, 2024

ZhangEnmao commented Jan 18, 2024

Try to add Qwen-moe into mixtral_moe.py #117

Try to add Qwen-moe into mixtral_moe.py #117

Comments

ZhangEnmao commented Jan 17, 2024

ZhangEnmao commented Jan 17, 2024

cg123 commented Jan 17, 2024

ZhangEnmao commented Jan 17, 2024

ZhangEnmao commented Jan 18, 2024