Skip to content

Commit

Permalink
MoE Merge Rework (#263)
Browse files Browse the repository at this point in the history
Expands the script `mergekit-moe` to support two new output
architectures, Deepseek MoE and Qwen 2 MoE.

Both architectures include support for "shared" experts. Currently the
script supports adding a single shared expert. The Deepseek architecture
uses the shared experts ungated and unweighted, so you probably want to
set the new `residual_scale` option on the shared expert to a relatively
low value (think 0.1ish) to keep the model from being completely
overcooked. Qwen 2 MoE has a gate parameter associated with the shared
expert so this is less necessary, but still advisable.

Deepseek MoE supports either Llama or Mistral based models as inputs.
Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models.

Addresses #117, #244, and #134.
  • Loading branch information
cg123 authored Apr 16, 2024
1 parent 846eb3a commit 215f767
Show file tree
Hide file tree
Showing 15 changed files with 1,327 additions and 492 deletions.
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ Features:
- Lazy loading of tensors for low memory use
- Interpolated gradients for parameter values (inspired by Gryphe's [BlockMerge_Gradient](https://github.com/Gryphe/BlockMerge_Gradient) script)
- Piecewise assembly of language models from layers ("Frankenmerging")
- [Mixture of Experts merging](#mixture-of-experts-merging)

🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see https://github.com/arcee-ai/mergekit/issues/207.
🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see <https://github.com/arcee-ai/mergekit/issues/207>.

🌐 GUI Launch Alert 🤗 - We are excited to announce the launch of a graphical user interface for mergekit in Hugging Face Spaces! This GUI simplifies the merging process, making it more accessible to a broader audience. Check it out and contribute at [Hugging Face Spaces - mergekit-community](https://huggingface.co/mergekit-community).

Expand Down Expand Up @@ -179,13 +180,17 @@ Parameters:

Mergekit allows extracting PEFT-compatible low-rank approximations of finetuned models.

### Usage:
### Usage

```sh
mergekit-extract-lora finetuned_model_id_or_path base_model_id_or_path output_path [--no-lazy-unpickle] --rank=desired_rank
```

# Citation
## Mixture of Experts merging

The `mergekit-moe` script supports merging multiple dense models into a mixture of experts, either for direct use or for further training. For more details see the [`mergekit-moe` documentation](docs/moe.md).

## Citation

We now have a [paper](https://arxiv.org/abs/2403.13257) you can cite for the MergeKit library:

Expand Down
87 changes: 82 additions & 5 deletions docs/moe.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# mergekit-moe

`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models. `mergekit-moe` uses its own YML configuration syntax, which looks like so:
`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models.

If using the `hidden` or `cheap_embed` gate mode, the output model will be usable without any further training. If you are initializing a model to do further training on, such as for sparse upcycling, then use the `random` gate mode to get a model ready for training.

## Configuration

`mergekit-moe` uses its own YML configuration syntax, which looks like so:

```yml
base_model: path/to/self_attn_donor
Expand All @@ -21,18 +27,89 @@ experts:

The script takes two arguments, an input config and an output path: `mergekit-moe ./config.yml ./my-clowncar-moe-12x180B`

## Gate Modes
Currently the script can output models that use the Mixtral, Deepseek MoE, or Qwen MoE architectures. Some output architectures support a shared expert which will be activated for all tokens, which can be configured like this:

```yml
base_model: path/to/self_attn_donor
gate_mode: hidden # one of "hidden", "cheap_embed", or "random"
dtype: bfloat16 # output dtype (float32, float16, or bfloat16)
experts:
...
shared_experts:
- source_model: model_name
positive_prompts: # required by Qwen MoE for "hidden" gate mode, otherwise not allowed
- "blah blah"
# (optional, but recommended:)
residual_scale: 0.1 # downweight output from shared expert to prevent overcooking the model
```
Currently only up to one shared expert is supported.
An appropriate architecture will be inferred based on the input models and presence or absence of shared experts in your configuration. Alternatively, you can explicitly specify an output architecture by setting the `architecture:` field in your config. For example:

```yml
base_model: path/to/self_attn_donor
architecture: qwen
# ... and so on
```

### Gate Modes

There are three methods for populating the MoE gates implemented.

### "hidden"
#### "hidden"

Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use `--load-in-8bit` or `--load-in-4bit` to reduce VRAM usage.

### "cheap_embed"
#### "cheap_embed"

Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.

### "random"
#### "random"

Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won't judge.

## Example Configurations

Sparse upcycling of smol_llama into a 8x220M MoE:

```yml
base_model: BEE-spoke-data/smol_llama-220M-GQA
gate_mode: random
dtype: bfloat16
experts:
- source_model: BEE-spoke-data/smol_llama-220M-GQA
- source_model: BEE-spoke-data/smol_llama-220M-GQA
- source_model: BEE-spoke-data/smol_llama-220M-GQA
- source_model: BEE-spoke-data/smol_llama-220M-GQA
- source_model: BEE-spoke-data/smol_llama-220M-GQA
- source_model: BEE-spoke-data/smol_llama-220M-GQA
- source_model: BEE-spoke-data/smol_llama-220M-GQA
- source_model: BEE-spoke-data/smol_llama-220M-GQA
# and then train the sucker!
```

Shove some Mistral models in a clown car:

```yml
base_model: NousResearch/Hermes-2-Pro-Mistral-7B
gate_mode: hidden
dtype: bfloat16
experts:
- source_model: NousResearch/Hermes-2-Pro-Mistral-7B
positive_prompts:
- "<|im_start|>user\nHello, who are you?<|im_end|>"
- "<|im_start|>user\nI need help with"
- source_model: BioMistral/BioMistral-7B-DARE
positive_prompts:
- "As a doctor of medicine,"
- source_model: PocketDoc/Dans-AdventurousWinds-7b
positive_prompts:
- "[Genres: Science Fiction]\n[Tags: humor, old school, sci fi]"
- "> get ye flask"
- "[Mode: Interactive Storyteller]"
- source_model: VAGOsolutions/SauerkrautLM-7b-HerO
positive_prompts:
- "<|im_start|>user\nWie geht es dir?<|im_end|>"
- "Das ist ein Satz auf Deutsch."
```
1 change: 1 addition & 0 deletions mergekit/architecture.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,7 @@ def _load_all_architectures() -> (

JSON_ARCHITECTURES, NAME_TO_ARCH = _load_all_architectures()
MISTRAL_INFO = _load_json_arch("mistral.json")
QWEN2_INFO = _load_json_arch("qwen2.json")


def get_architecture_info(config: PretrainedConfig) -> ArchitectureInfo:
Expand Down
5 changes: 4 additions & 1 deletion mergekit/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,10 @@ def __str__(self) -> str:
return str(self.model)


def dtype_from_name(name: Optional[str]) -> torch.dtype:
def dtype_from_name(name: Optional[str]) -> Optional[torch.dtype]:
if not name:
return None

if name.startswith("torch."):
name = name[len("torch.") :]

Expand Down
19 changes: 19 additions & 0 deletions mergekit/moe/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from typing import List

from mergekit.moe.arch import MoEOutputArchitecture
from mergekit.moe.deepseek import DeepseekMoE
from mergekit.moe.mixtral import MixtralMoE

ALL_OUTPUT_ARCHITECTURES: List[MoEOutputArchitecture] = [MixtralMoE(), DeepseekMoE()]

try:
from mergekit.moe.qwen import QwenMoE
except ImportError:
pass
else:
ALL_OUTPUT_ARCHITECTURES.append(QwenMoE())

__all__ = [
"ALL_OUTPUT_ARCHITECTURES",
"MoEOutputArchitecture",
]
53 changes: 53 additions & 0 deletions mergekit/moe/arch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright (C) 2024 Charles O. Goddard
#
# This software is free software: you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This software is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program. If not, see http://www.gnu.org/licenses/.

from abc import ABC, abstractmethod
from typing import List, Optional

import torch

from mergekit.moe.config import MoEMergeConfig
from mergekit.options import MergeOptions


class MoEOutputArchitecture(ABC):
@abstractmethod
def name(self) -> str:
"""Return a human-readable name for the architecture."""
pass

@abstractmethod
def supports_config(
self,
config: MoEMergeConfig,
explain: bool = False,
trust_remote_code: bool = False,
) -> bool:
"""Return whether this architecture supports the given config.
If `explain` is True, log an explanation of why the config is not supported."""
pass

@abstractmethod
def write_model(
self,
out_path: str,
config: MoEMergeConfig,
merge_options: MergeOptions,
router_weights: List[torch.Tensor],
shared_router_weights: Optional[List[torch.Tensor]] = None,
):
"""Write the config and tensors for the output MoE to the given path."""
pass
75 changes: 75 additions & 0 deletions mergekit/moe/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Copyright (C) 2024 Charles O. Goddard
#
# This software is free software: you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This software is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program. If not, see http://www.gnu.org/licenses/.

from typing import Dict, Optional

import torch
import tqdm
import transformers

from mergekit.common import ModelReference, dtype_from_name
from mergekit.io import LazyTensorLoader, TensorWriter
from mergekit.merge import MergeOptions
from mergekit.moe.config import Expert, MoEMergeConfig


def initialize_io(
config: MoEMergeConfig,
out_path: str,
merge_options: MergeOptions,
) -> tuple[Dict[ModelReference, LazyTensorLoader], LazyTensorLoader, TensorWriter]:
base_model = config.base_model
loaders: Dict[ModelReference, LazyTensorLoader] = {}
for model in tqdm.tqdm(
[base_model] + [e.source_model for e in config.experts], desc="Warm up loaders"
):
loaders[model] = model.lazy_loader(
cache_dir=merge_options.transformers_cache,
lazy_unpickle=merge_options.lazy_unpickle,
)

base_loader = loaders.get(base_model)
writer = TensorWriter(
out_path=out_path,
max_shard_size=merge_options.out_shard_size,
safe_serialization=merge_options.safe_serialization,
)

return loaders, base_loader, writer


def select_dtype(
config: MoEMergeConfig, base_cfg: transformers.PretrainedConfig
) -> Optional[torch.dtype]:
out_dtype = None
if config.dtype:
out_dtype = dtype_from_name(config.dtype)

if out_dtype is None and base_cfg.torch_dtype:
out_dtype = base_cfg.torch_dtype
if isinstance(out_dtype, str):
out_dtype = dtype_from_name(out_dtype)
return out_dtype


def noise_and_scale(
tensor: torch.Tensor, expert: Expert, is_residual: bool = False
) -> torch.Tensor:
if expert.noise_scale is not None:
noise = torch.randn_like(tensor) * expert.noise_scale
tensor = tensor + noise
if is_residual and expert.residual_scale is not None:
tensor = tensor * expert.residual_scale
return tensor
Loading

0 comments on commit 215f767

Please sign in to comment.