MoE Merge Rework #263

cg123 · 2024-04-12T01:46:14Z

Expands the script mergekit-moe to support two new output architectures, Deepseek MoE and Qwen 2 MoE.

Both architectures include support for "shared" experts. Currently the script supports adding a single shared expert. The Deepseek architecture uses the shared experts ungated and unweighted, so you probably want to set the new residual_scale option on the shared expert to a relatively low value (think 0.1ish) to keep the model from being completely overcooked. Qwen 2 MoE has a gate parameter associated with the shared expert so this is less necessary, but still advisable.

Deepseek MoE supports either Llama or Mistral based models as inputs.
Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models.

Addresses #117, #244, and #134.

cg123 added 11 commits April 6, 2024 15:45

WIP

42cc6e0

Refactor

56e1506

First pass at Deepseek MoE support

73e6a36

Qwen2 MoE

e968aeb

Default to Mixtral if multiple architectures compatible with config

a780f08

Comments

e690588

Merge branch 'main' into moe-rework

2bb3268

Update docs/moe.md

f7bce00

Add link to moe.md in README

c6970a6

Merge branch 'main' into moe-rework

8db8773

Merge branch 'main' into moe-rework

689be80

cg123 merged commit 215f767 into main Apr 16, 2024
6 checks passed

cg123 deleted the moe-rework branch April 16, 2024 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE Merge Rework #263

MoE Merge Rework #263

cg123 commented Apr 12, 2024

MoE Merge Rework #263

MoE Merge Rework #263

Conversation

cg123 commented Apr 12, 2024