Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to add Qwen-moe into mixtral_moe.py #117

Open
ZhangEnmao opened this issue Jan 17, 2024 · 4 comments
Open

Try to add Qwen-moe into mixtral_moe.py #117

ZhangEnmao opened this issue Jan 17, 2024 · 4 comments

Comments

@ZhangEnmao
Copy link

Hi,
I try to add Qwen-moe into mixtral_moe.py, and I have done some modifications. But now, I meet some problems in there.
1
I think it is wrong, because auto_map should not appear in "MixtralForCausalLM". When I delete it, the model will be Nan.
Do you know the reason?
I am looking forward to your reply.

@ZhangEnmao
Copy link
Author

Hi, Sorry to bother you again. Could you tell me why mixtral-moe only choose llama structure or mixtral structure ? Why are other models inappropriate

@cg123
Copy link
Collaborator

cg123 commented Jan 17, 2024

No worries, happy to help! The reason the script needs a Llama or Mistral model is because it's written to take advantage of the Mixtral architecture. Because Mixtral is essentially just Mistral with multiple MLP sections and a gate, the tensors from a Mistral model can be used without any training. (Llama works as well because they're almost exactly the same architecture.)

It's definitely possible to combine other architectures in a similar fashion, but the result won't be compatible with the Mixtral architecture. There are two basic ways to make it work. You can get creative with how you use the weights of your models, throwing some out, and doing a bunch of training afterwards to rehabilitate it in the new architecture (CausalLM is a success story of this approach.) Mergekit can't really support this method as there's no easy way to automatically map the weights of an arbitrary language model architecture onto another - it really needs a human to decide that correlation.

The other approach is to not use the Mixtral architecture, and instead write your own custom code to inference the resulting model. Maxime Labonne's Phixtral models are examples of this approach. Similarly, this can't really be automated. I can look at integrating new architectures as they are implemented - for example, now that Phixtral is getting some traction I'm considering extending the script to also be able to output Phixtral models. But the actual inference code I can't really help with - I'm only one person, and if I start writing custom MoE architectures for every type of model out there I'd never have time to do anything else. :)

@ZhangEnmao
Copy link
Author

Oh, you are truly amazing! Your answer has been of great help to me, and I feel like I have gained a deeper understanding of MergeKit and MOE. If you are expanding the Phixtral architecture, I believe it would require some special code related to Phixtral model features (which I would also need to obtain Qwen-moe). Currently, I have made some very simple modifications to mixtral_moe.py, but it doesn't give me a mixtral-moe architecture, probably because it's too simplistic. I will further contemplate on how to incorporate Qwen. Thank you for your response, and I'm looking forward to the expansion of Phixtral!

@ZhangEnmao
Copy link
Author

Hey, bro. Good morning ! I have an idea now which is a Qwen-moe.py file may be necessary, just like Qwen model owing its Qwen.py file to help loading pretrained model correctly. Do you think my idea is right ?

@cg123 cg123 mentioned this issue Apr 12, 2024
cg123 added a commit that referenced this issue Apr 16, 2024
Expands the script `mergekit-moe` to support two new output
architectures, Deepseek MoE and Qwen 2 MoE.

Both architectures include support for "shared" experts. Currently the
script supports adding a single shared expert. The Deepseek architecture
uses the shared experts ungated and unweighted, so you probably want to
set the new `residual_scale` option on the shared expert to a relatively
low value (think 0.1ish) to keep the model from being completely
overcooked. Qwen 2 MoE has a gate parameter associated with the shared
expert so this is less necessary, but still advisable.

Deepseek MoE supports either Llama or Mistral based models as inputs.
Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models.

Addresses #117, #244, and #134.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants