MoE Merge Rework (#263)

Expands the script `mergekit-moe` to support two new output architectures, Deepseek MoE and Qwen 2 MoE. Both architectures include support for "shared" experts. Currently the script supports adding a single shared expert. The Deepseek architecture uses the shared experts ungated and unweighted, so you probably want to set the new `residual_scale` option on the shared expert to a relatively low value (think 0.1ish) to keep the model from being completely overcooked. Qwen 2 MoE has a gate parameter associated with the shared expert so this is less necessary, but still advisable. Deepseek MoE supports either Llama or Mistral based models as inputs. Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models. Addresses #117, #244, and #134.
arcee-ai · Apr 16, 2024 · 215f767 · 215f767
1 parent 846eb3a
commit 215f767
Show file tree

Hide file tree

Showing 15 changed files with 1,327 additions and 492 deletions.
diff --git a/README.md b/README.md
@@ -10,8 +10,9 @@ Features:
 - Lazy loading of tensors for low memory use
 - Interpolated gradients for parameter values (inspired by Gryphe's [BlockMerge_Gradient](https://github.com/Gryphe/BlockMerge_Gradient) script)
 - Piecewise assembly of language models from layers ("Frankenmerging")
+- [Mixture of Experts merging](#mixture-of-experts-merging)
 
-🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see https://github.com/arcee-ai/mergekit/issues/207.
+🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see <https://github.com/arcee-ai/mergekit/issues/207>.
 
 🌐 GUI Launch Alert 🤗 - We are excited to announce the launch of a graphical user interface for mergekit in Hugging Face Spaces! This GUI simplifies the merging process, making it more accessible to a broader audience. Check it out and contribute at [Hugging Face Spaces - mergekit-community](https://huggingface.co/mergekit-community).
 
@@ -179,13 +180,17 @@ Parameters:
 
 Mergekit allows extracting PEFT-compatible low-rank approximations of finetuned models.
 
-### Usage:
+### Usage
 
 ```sh
 mergekit-extract-lora finetuned_model_id_or_path base_model_id_or_path output_path [--no-lazy-unpickle] --rank=desired_rank
 ```
 
-# Citation
+## Mixture of Experts merging
+
+The `mergekit-moe` script supports merging multiple dense models into a mixture of experts, either for direct use or for further training. For more details see the [`mergekit-moe` documentation](docs/moe.md).
+
+## Citation
 
 We now have a [paper](https://arxiv.org/abs/2403.13257) you can cite for the MergeKit library:
 

diff --git a/docs/moe.md b/docs/moe.md
@@ -1,6 +1,12 @@
 # mergekit-moe
 
-`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models. `mergekit-moe` uses its own YML configuration syntax, which looks like so:
+`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models.
+
+If using the `hidden` or `cheap_embed` gate mode, the output model will be usable without any further training. If you are initializing a model to do further training on, such as for sparse upcycling, then use the `random` gate mode to get a model ready for training.
+
+## Configuration
+
+`mergekit-moe` uses its own YML configuration syntax, which looks like so:
 
 ```yml
 base_model: path/to/self_attn_donor
@@ -21,18 +27,89 @@ experts:
 
 The script takes two arguments, an input config and an output path: `mergekit-moe ./config.yml ./my-clowncar-moe-12x180B`
 
-## Gate Modes
+Currently the script can output models that use the Mixtral, Deepseek MoE, or Qwen MoE architectures. Some output architectures support a shared expert which will be activated for all tokens, which can be configured like this:
+
+```yml
+base_model: path/to/self_attn_donor
+gate_mode: hidden # one of "hidden", "cheap_embed", or "random"
+dtype: bfloat16 # output dtype (float32, float16, or bfloat16)
+experts:
+  ...
+shared_experts:
+  - source_model: model_name
+    positive_prompts: # required by Qwen MoE for "hidden" gate mode, otherwise not allowed
+      - "blah blah"
+    # (optional, but recommended:)
+    residual_scale: 0.1 # downweight output from shared expert to prevent overcooking the model
+```
+
+Currently only up to one shared expert is supported.
+
+An appropriate architecture will be inferred based on the input models and presence or absence of shared experts in your configuration. Alternatively, you can explicitly specify an output architecture by setting the `architecture:` field in your config. For example:
+
+```yml
+base_model: path/to/self_attn_donor
+architecture: qwen
+# ... and so on
+```
+
+### Gate Modes
 
 There are three methods for populating the MoE gates implemented.
 
-### "hidden"
+#### "hidden"
 
 Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use `--load-in-8bit` or `--load-in-4bit` to reduce VRAM usage.
 
-### "cheap_embed"
+#### "cheap_embed"
 
 Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.
 
-### "random"
+#### "random"
 
 Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won't judge.
+
+## Example Configurations
+
+Sparse upcycling of smol_llama into a 8x220M MoE:
+
+```yml
+base_model: BEE-spoke-data/smol_llama-220M-GQA
+gate_mode: random
+dtype: bfloat16
+experts:
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+  - source_model: BEE-spoke-data/smol_llama-220M-GQA
+# and then train the sucker!
+```
+
+Shove some Mistral models in a clown car:
+
+```yml
+base_model: NousResearch/Hermes-2-Pro-Mistral-7B
+gate_mode: hidden
+dtype: bfloat16
+experts:
+  - source_model: NousResearch/Hermes-2-Pro-Mistral-7B
+    positive_prompts:
+      - "<|im_start|>user\nHello, who are you?<|im_end|>"
+      - "<|im_start|>user\nI need help with"
+  - source_model: BioMistral/BioMistral-7B-DARE
+    positive_prompts:
+      - "As a doctor of medicine,"
+  - source_model: PocketDoc/Dans-AdventurousWinds-7b
+    positive_prompts:
+      - "[Genres: Science Fiction]\n[Tags: humor, old school, sci fi]"
+      - "> get ye flask"
+      - "[Mode: Interactive Storyteller]"
+  - source_model: VAGOsolutions/SauerkrautLM-7b-HerO
+    positive_prompts:
+      - "<|im_start|>user\nWie geht es dir?<|im_end|>"
+      - "Das ist ein Satz auf Deutsch."
+```
diff --git a/mergekit/architecture.py b/mergekit/architecture.py
@@ -350,6 +350,7 @@ def _load_all_architectures() -> (
 
 JSON_ARCHITECTURES, NAME_TO_ARCH = _load_all_architectures()
 MISTRAL_INFO = _load_json_arch("mistral.json")
+QWEN2_INFO = _load_json_arch("qwen2.json")
 
 
 def get_architecture_info(config: PretrainedConfig) -> ArchitectureInfo:

diff --git a/mergekit/common.py b/mergekit/common.py
@@ -184,7 +184,10 @@ def __str__(self) -> str:
         return str(self.model)
 
 
-def dtype_from_name(name: Optional[str]) -> torch.dtype:
+def dtype_from_name(name: Optional[str]) -> Optional[torch.dtype]:
+    if not name:
+        return None
+
     if name.startswith("torch."):
         name = name[len("torch.") :]
 

diff --git a/mergekit/moe/__init__.py b/mergekit/moe/__init__.py
@@ -0,0 +1,19 @@
+from typing import List
+
+from mergekit.moe.arch import MoEOutputArchitecture
+from mergekit.moe.deepseek import DeepseekMoE
+from mergekit.moe.mixtral import MixtralMoE
+
+ALL_OUTPUT_ARCHITECTURES: List[MoEOutputArchitecture] = [MixtralMoE(), DeepseekMoE()]
+
+try:
+    from mergekit.moe.qwen import QwenMoE
+except ImportError:
+    pass
+else:
+    ALL_OUTPUT_ARCHITECTURES.append(QwenMoE())
+
+__all__ = [
+    "ALL_OUTPUT_ARCHITECTURES",
+    "MoEOutputArchitecture",
+]
diff --git a/mergekit/moe/arch.py b/mergekit/moe/arch.py
@@ -0,0 +1,53 @@
+# Copyright (C) 2024 Charles O. Goddard
+#
+# This software is free software: you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public License as
+# published by the Free Software Foundation, either version 3 of the
+# License, or (at your option) any later version.
+#
+# This software is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public License
+# along with this program. If not, see http://www.gnu.org/licenses/.
+
+from abc import ABC, abstractmethod
+from typing import List, Optional
+
+import torch
+
+from mergekit.moe.config import MoEMergeConfig
+from mergekit.options import MergeOptions
+
+
+class MoEOutputArchitecture(ABC):
+    @abstractmethod
+    def name(self) -> str:
+        """Return a human-readable name for the architecture."""
+        pass
+
+    @abstractmethod
+    def supports_config(
+        self,
+        config: MoEMergeConfig,
+        explain: bool = False,
+        trust_remote_code: bool = False,
+    ) -> bool:
+        """Return whether this architecture supports the given config.
+
+        If `explain` is True, log an explanation of why the config is not supported."""
+        pass
+
+    @abstractmethod
+    def write_model(
+        self,
+        out_path: str,
+        config: MoEMergeConfig,
+        merge_options: MergeOptions,
+        router_weights: List[torch.Tensor],
+        shared_router_weights: Optional[List[torch.Tensor]] = None,
+    ):
+        """Write the config and tensors for the output MoE to the given path."""
+        pass
diff --git a/mergekit/moe/common.py b/mergekit/moe/common.py
@@ -0,0 +1,75 @@
+# Copyright (C) 2024 Charles O. Goddard
+#
+# This software is free software: you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public License as
+# published by the Free Software Foundation, either version 3 of the
+# License, or (at your option) any later version.
+#
+# This software is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public License
+# along with this program. If not, see http://www.gnu.org/licenses/.
+
+from typing import Dict, Optional
+
+import torch
+import tqdm
+import transformers
+
+from mergekit.common import ModelReference, dtype_from_name
+from mergekit.io import LazyTensorLoader, TensorWriter
+from mergekit.merge import MergeOptions
+from mergekit.moe.config import Expert, MoEMergeConfig
+
+
+def initialize_io(
+    config: MoEMergeConfig,
+    out_path: str,
+    merge_options: MergeOptions,
+) -> tuple[Dict[ModelReference, LazyTensorLoader], LazyTensorLoader, TensorWriter]:
+    base_model = config.base_model
+    loaders: Dict[ModelReference, LazyTensorLoader] = {}
+    for model in tqdm.tqdm(
+        [base_model] + [e.source_model for e in config.experts], desc="Warm up loaders"
+    ):
+        loaders[model] = model.lazy_loader(
+            cache_dir=merge_options.transformers_cache,
+            lazy_unpickle=merge_options.lazy_unpickle,
+        )
+
+    base_loader = loaders.get(base_model)
+    writer = TensorWriter(
+        out_path=out_path,
+        max_shard_size=merge_options.out_shard_size,
+        safe_serialization=merge_options.safe_serialization,
+    )
+
+    return loaders, base_loader, writer
+
+
+def select_dtype(
+    config: MoEMergeConfig, base_cfg: transformers.PretrainedConfig
+) -> Optional[torch.dtype]:
+    out_dtype = None
+    if config.dtype:
+        out_dtype = dtype_from_name(config.dtype)
+
+    if out_dtype is None and base_cfg.torch_dtype:
+        out_dtype = base_cfg.torch_dtype
+        if isinstance(out_dtype, str):
+            out_dtype = dtype_from_name(out_dtype)
+    return out_dtype
+
+
+def noise_and_scale(
+    tensor: torch.Tensor, expert: Expert, is_residual: bool = False
+) -> torch.Tensor:
+    if expert.noise_scale is not None:
+        noise = torch.randn_like(tensor) * expert.noise_scale
+        tensor = tensor + noise
+    if is_residual and expert.residual_scale is not None:
+        tensor = tensor * expert.residual_scale
+    return tensor