-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to add Qwen-moe into mixtral_moe.py #117
Comments
Hi, Sorry to bother you again. Could you tell me why mixtral-moe only choose llama structure or mixtral structure ? Why are other models inappropriate |
No worries, happy to help! The reason the script needs a Llama or Mistral model is because it's written to take advantage of the Mixtral architecture. Because Mixtral is essentially just Mistral with multiple MLP sections and a gate, the tensors from a Mistral model can be used without any training. (Llama works as well because they're almost exactly the same architecture.) It's definitely possible to combine other architectures in a similar fashion, but the result won't be compatible with the Mixtral architecture. There are two basic ways to make it work. You can get creative with how you use the weights of your models, throwing some out, and doing a bunch of training afterwards to rehabilitate it in the new architecture (CausalLM is a success story of this approach.) Mergekit can't really support this method as there's no easy way to automatically map the weights of an arbitrary language model architecture onto another - it really needs a human to decide that correlation. The other approach is to not use the Mixtral architecture, and instead write your own custom code to inference the resulting model. Maxime Labonne's Phixtral models are examples of this approach. Similarly, this can't really be automated. I can look at integrating new architectures as they are implemented - for example, now that Phixtral is getting some traction I'm considering extending the script to also be able to output Phixtral models. But the actual inference code I can't really help with - I'm only one person, and if I start writing custom MoE architectures for every type of model out there I'd never have time to do anything else. :) |
Oh, you are truly amazing! Your answer has been of great help to me, and I feel like I have gained a deeper understanding of MergeKit and MOE. If you are expanding the Phixtral architecture, I believe it would require some special code related to Phixtral model features (which I would also need to obtain Qwen-moe). Currently, I have made some very simple modifications to mixtral_moe.py, but it doesn't give me a mixtral-moe architecture, probably because it's too simplistic. I will further contemplate on how to incorporate Qwen. Thank you for your response, and I'm looking forward to the expansion of Phixtral! |
Hey, bro. Good morning ! I have an idea now which is a Qwen-moe.py file may be necessary, just like Qwen model owing its Qwen.py file to help loading pretrained model correctly. Do you think my idea is right ? |
Expands the script `mergekit-moe` to support two new output architectures, Deepseek MoE and Qwen 2 MoE. Both architectures include support for "shared" experts. Currently the script supports adding a single shared expert. The Deepseek architecture uses the shared experts ungated and unweighted, so you probably want to set the new `residual_scale` option on the shared expert to a relatively low value (think 0.1ish) to keep the model from being completely overcooked. Qwen 2 MoE has a gate parameter associated with the shared expert so this is less necessary, but still advisable. Deepseek MoE supports either Llama or Mistral based models as inputs. Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models. Addresses #117, #244, and #134.
Hi,
I try to add Qwen-moe into mixtral_moe.py, and I have done some modifications. But now, I meet some problems in there.
I think it is wrong, because auto_map should not appear in "MixtralForCausalLM". When I delete it, the model will be Nan.
Do you know the reason?
I am looking forward to your reply.
The text was updated successfully, but these errors were encountered: