Merge branch 'main' into nuslerp

arcee-ai · Dec 8, 2024 · 7ee1c40 · 7ee1c40
2 parents 8a57623 + 00f8bf4
commit 7ee1c40
Show file tree

Hide file tree

Showing 57 changed files with 2,834 additions and 411 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,39 @@
 
 `mergekit` is a toolkit for merging pre-trained language models. `mergekit` uses an out-of-core approach to perform unreasonably elaborate merges in resource-constrained situations. Merges can be run entirely on CPU or accelerated with as little as 8 GB of VRAM. Many merging algorithms are supported, with more coming as they catch my attention.
 
-Features:
+## Contents
+
+- [Why Merge Models?](#why-merge-models)
+- [Features](#features)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Merge Configuration](#merge-configuration)
+  - [Parameter Specification](#parameter-specification)
+  - [Tokenizer Configuration](#tokenizer-configuration)
+  - [Chat Template Configuration](#chat-template-configuration)
+  - [Examples](#examples)
+- [Merge Methods](#merge-methods)
+- [LoRA extraction](#lora-extraction)
+- [Mixture of Experts merging](#mixture-of-experts-merging)
+- [Evolutionary merge methods](#evolutionary-merge-methods)
+- [Merge in the Cloud](#-merge-in-the-cloud-)
+- [Citation](#citation)
+
+## Why Merge Models?
+
+Model merging is a powerful technique that allows combining the strengths of different models without the computational overhead of ensembling or the need for additional training. By operating directly in the weight space of models, merging can:
+
+- Combine multiple specialized models into a single versatile model
+- Transfer capabilities between models without access to training data
+- Find optimal trade-offs between different model behaviors
+- Improve performance while maintaining inference costs
+- Create new capabilities through creative model combinations
+
+Unlike traditional ensembling which requires running multiple models, merged models maintain the same inference cost as a single model while often achieving comparable or superior performance.
+
+## Features
+
+Key features of `mergekit` include:
 
 - Supports Llama, Mistral, GPT-NeoX, StableLM, and more
 - Many [merge methods](#merge-methods)
@@ -11,10 +43,10 @@ Features:
 - Interpolated gradients for parameter values (inspired by Gryphe's [BlockMerge_Gradient](https://github.com/Gryphe/BlockMerge_Gradient) script)
 - Piecewise assembly of language models from layers ("Frankenmerging")
 - [Mixture of Experts merging](#mixture-of-experts-merging)
+- [LORA extraction](#lora-extraction)
+- [Evolutionary merge methods](#evolutionary-merge-methods)
 
-🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see <https://github.com/arcee-ai/mergekit/issues/207>.
-
-🌐 GUI Launch Alert 🤗 - We are excited to announce the launch of a graphical user interface for mergekit in Hugging Face Spaces! This GUI simplifies the merging process, making it more accessible to a broader audience. Check it out and contribute at [Hugging Face Spaces - mergekit-community](https://huggingface.co/mergekit-community).
+🌐 GUI Launch Alert 🤗 - We are excited to announce the launch of a mega-GPU backed graphical user interface for mergekit in Arcee! This GUI simplifies the merging process, making it more accessible to a broader audience. Check it out and contribute at the [Arcee App](https://app.arcee.ai). There is also a [Hugging Face Space](https://huggingface.co/mergekit-community) with limited amounts of GPUs.
 
 ## Installation
 
@@ -52,7 +84,7 @@ When you have a merged model you're happy with, you may want to share it on the
 
 Once you're happy with your model card and merged model, you can upload it to the Hugging Face Hub using the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/index) Python library.
 
-```
+```sh
 # log in to huggingface with an access token (must have write permission)
 huggingface-cli login
 # upload your model
@@ -72,7 +104,8 @@ Below are the primary elements of a configuration file:
 - `base_model`: Specifies the base model used in some merging methods.
 - `parameters`: Holds various parameters such as weights and densities, which can also be specified at different levels of the configuration.
 - `dtype`: Specifies the data type used for the merging operation.
-- `tokenizer_source`: Determines how to construct a tokenizer for the merged model.
+- `tokenizer` or `tokenizer_source`: Determines how to construct a tokenizer for the merged model.
+- `chat_template`: Specifies a chat template for the merged model.
 
 ### Parameter Specification
 
@@ -90,23 +123,112 @@ The parameters can be set at different levels, with decreasing precedence as fol
 3. `models.*.parameters` or `input_model_parameters` - applying to any tensors coming from specific input models
 4. `parameters` - catchall
 
-### Tokenizer Source
+### Tokenizer Configuration
+
+The tokenizer behavior can be configured in two ways: using the new `tokenizer` field (recommended) or the legacy `tokenizer_source` field (maintained for backward compatibility). These fields are mutually exclusive - you should use one or the other, not both.
+
+#### Modern Configuration (tokenizer)
+
+The `tokenizer` field provides fine-grained control over vocabulary and embeddings:
 
-The `tokenizer_source` field of a configuration file determines what tokenizer is used by the merged model. This also effects how embeddings and language model heads are merged.
+```yaml
+tokenizer:
+  source: "union"  # or "base" or a specific model path
+  tokens:          # Optional: configure specific tokens
+    <token_name>:
+      source: ...  # Specify embedding source
+      force: false # Optional: force this embedding for all models
+  pad_to_multiple_of: null  # Optional: pad vocabulary size
+```
+
+##### Tokenizer Source
+
+The `source` field determines the vocabulary of the output model:
+
+- `union`: Combine vocabularies from all input models (default)
+- `base`: Use vocabulary from the base model
+- `"path/to/model"`: Use vocabulary from a specific model
 
-This functionality is still experimental and may break. Please file an issue if you encounter any issues with it.
+##### Token Embedding Handling
 
-Valid values:
+When merging models with different vocabularies, mergekit uses smart defaults to handle token embeddings:
 
-- `base`: use the tokenizer from the base model
-- `union`: construct a tokenizer with all tokens from all models
-- `model:<model_path>`: use the tokenizer from a specific model
+- If a token exists in the base model, its embedding is used as the default
+- If only one model has the token, that model's embedding is used
+- Otherwise, an average of all available embeddings is used
 
-If set, mergekit will find a mapping between each model's vocabulary and the output tokenizer. This allows models with different vocabularies or added tokens to be meaningfully merged.
+You can override these defaults for specific tokens:
 
-`tokenizer_source` is compatible with all merge methods, but when used `lm_head`/`embed_tokens` will be merged linearly. For two-model merges, the `embed_slerp` parameter can be set to `true` to use SLERP instead.
+```yaml
+tokenizer:
+  source: union
+  tokens:
+    # Use embedding from a specific model
+    <|im_start|>:
+      source: "path/to/chatml/model"
+
+    # Force a specific embedding for all models
+    <|special|>:
+      source: "path/to/model"
+      force: true
+
+    # Map a token to another model's token embedding
+    <|renamed_token|>:
+      source:
+        kind: "model_token"
+        model: "path/to/model"
+        token: "<|original_token|>"  # or use token_id: 1234
+```
+
+##### Practical Example
+
+Here's how you might preserve both Llama 3 Instruct and ChatML prompt formats when merging models:
+
+```yaml
+tokenizer:
+  source: union
+  tokens:
+    # ChatML tokens
+    <|im_start|>:
+      source: "chatml_model"
+    <|im_end|>:
+      source: "chatml_model"
+
+    # Llama 3 tokens - force original embeddings
+    <|start_header_id|>:
+      source: "llama3_model"
+      force: true
+    <|end_header_id|>:
+      source: "llama3_model"
+      force: true
+    <|eot_id|>:
+      source: "llama3_model"
+      force: true
+```
+
+#### Legacy Configuration (tokenizer_source)
+
+For backward compatibility, the `tokenizer_source` field is still supported:
+
+```yaml
+tokenizer_source: "union"  # or "base" or a model path
+```
 
-If the `tokenizer_source` field is not set, mergekit will fall back to its legacy default behavior. The tokenizer for the base model (or first model in the merge, if no base model is specified) will be copied to the output directory. The parameter matrices for `lm_head`/`embed_tokens` will be truncated to the smallest size present in the merge. In _most_ cases this corresponds to using the tokenizer for the base model.
+This provides basic tokenizer selection but lacks the fine-grained control of the modern `tokenizer` field.
+
+### Chat Template Configuration
+
+The optional `chat_template` field allows overriding the chat template used for the merged model.
+
+```yaml
+chat_template: "auto"  # or a template name or Jinja2 template
+```
+
+Options include:
+
+- `"auto"`: Automatically select the most common template among input models
+- Built-in templates: `"alpaca"`, `"chatml"`, `"llama3"`, `"mistral"`, `"exaone"`
+- A Jinja2 template string for custom formatting
 
 ### Examples
 
@@ -129,6 +251,8 @@ A quick overview of the currently supported merge methods:
 | [Model Breadcrumbs](https://arxiv.org/abs/2312.06795) + [TIES](https://arxiv.org/abs/2306.01708) | `breadcrumbs_ties`   | ✅          | ✅              |
 | [Model Stock](https://arxiv.org/abs/2403.19522)                                                  | `model_stock`        | ✅          | ✅              |
 | NuSLERP                                                                                          | `nuslerp`            | ❌          | ✅              |
+| [DELLA](https://arxiv.org/abs/2406.11617)                                                        | `della`              | ✅          | ✅              |
+| [DELLA](https://arxiv.org/abs/2406.11617) [Task Arithmetic](https://arxiv.org/abs/2212.04089)    | `della_linear`       | ✅          | ✅              |
 
 ### Linear
 
@@ -173,7 +297,7 @@ Parameters: same as [TIES](#ties) for `dare_ties`, or [Linear](#linear) for `dar
 
 ### [Model Breadcrumbs](https://arxiv.org/abs/2312.06795)
 
-An extension of task arithmetic that discards both small and and extremely large differences from the base model. As with DARE, the Model Breadcrumbs algorithm can be used with (`breadcrumbs_ties`) or without (`breadcrumbs`) the sign consensus algorithm of TIES.
+An extension of task arithmetic that discards both small and extremely large differences from the base model. As with DARE, the Model Breadcrumbs algorithm can be used with (`breadcrumbs_ties`) or without (`breadcrumbs`) the sign consensus algorithm of TIES.
 
 Parameters: same as [Linear](#linear), plus:
 
@@ -202,6 +326,16 @@ Parameters:
 
 To replicate the behavior of the original `slerp` method, set `weight` to `1-t` and `t` for your first and second model respectively.
 
+### [DELLA](https://arxiv.org/abs/2406.11617)
+
+Building upon DARE, DELLA uses adaptive pruning based on parameter magnitudes. DELLA first ranks parameters in each row of delta parameters and assigns drop probabilities inversely proportional to their magnitudes. This allows it to retain more important changes while reducing interference. After pruning, it rescales the remaining parameters similar to [DARE](#dare). DELLA can be used with (`della`) or without (`della_linear`) the sign elect step of TIES
+
+Parameters: same as [Linear](#linear), plus:
+
+- `density` - fraction of weights in differences from the base model to retain
+- `epsilon` - maximum change in drop probability based on magnitude. Drop probabilities assigned will range from `density - epsilon` to `density + epsilon`. (When selecting values for `density` and `epsilon`, ensure that the range of probabilities falls within 0 to 1)
+- `lambda` - scaling factor for the final merged delta parameters before merging with the base parameters.
+
 ## LoRA extraction
 
 Mergekit allows extracting PEFT-compatible low-rank approximations of finetuned models.
@@ -216,15 +350,60 @@ mergekit-extract-lora finetuned_model_id_or_path base_model_id_or_path output_pa
 
 The `mergekit-moe` script supports merging multiple dense models into a mixture of experts, either for direct use or for further training. For more details see the [`mergekit-moe` documentation](docs/moe.md).
 
+## Evolutionary merge methods
+
+See [`docs/evolve.md`](docs/evolve.md) for details.
+
+## ✨ Merge in the Cloud ✨
+
+We host merging on Arcee's cloud GPUs - you can launch a cloud merge in the [Arcee App](https://app.arcee.ai). Or through python - grab an ARCEE_API_KEY:
+
+`export ARCEE_API_KEY=<your-api-key>`
+`pip install -q arcee-py`
+
+```python
+import arcee
+arcee.merge_yaml("bio-merge","./examples/bio-merge.yml")
+```
+
+Check your merge status at the [Arcee App](https://app.arcee.ai)
+
+When complete, either deploy your merge:
+
+```python
+arcee.start_deployment("bio-merge", merging="bio-merge")
+```
+
+Or download your merge:
+
+`!arcee merging download bio-merge`
+
 ## Citation
 
-We now have a [paper](https://arxiv.org/abs/2403.13257) you can cite for the MergeKit library:
+If you find `mergekit` useful in your research, please consider citing the [paper](https://aclanthology.org/2024.emnlp-industry.36/):
 
 ```bibtex
-@article{goddard2024arcee,
-  title={Arcee's MergeKit: A Toolkit for Merging Large Language Models},
-  author={Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vlad and Benedict, Brian and McQuade, Mark and Solawetz, Jacob},
-  journal={arXiv preprint arXiv:2403.13257},
-  year={2024}
+@inproceedings{goddard-etal-2024-arcees,
+    title = "Arcee{'}s {M}erge{K}it: A Toolkit for Merging Large Language Models",
+    author = "Goddard, Charles  and
+      Siriwardhana, Shamane  and
+      Ehghaghi, Malikeh  and
+      Meyers, Luke  and
+      Karpukhin, Vladimir  and
+      Benedict, Brian  and
+      McQuade, Mark  and
+      Solawetz, Jacob",
+    editor = "Dernoncourt, Franck  and
+      Preo{\c{t}}iuc-Pietro, Daniel  and
+      Shimorina, Anastasia",
+    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track",
+    month = nov,
+    year = "2024",
+    address = "Miami, Florida, US",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2024.emnlp-industry.36",
+    doi = "10.18653/v1/2024.emnlp-industry.36",
+    pages = "477--485",
+    abstract = "The rapid growth of open-source language models provides the opportunity to merge model checkpoints, combining their parameters to improve performance and versatility. Advances in transfer learning have led to numerous task-specific models, which model merging can integrate into powerful multitask models without additional training. MergeKit is an open-source library designed to support this process with an efficient and extensible framework suitable for any hardware. It has facilitated the merging of thousands of models, contributing to some of the world{'}s most powerful open-source model checkpoints. The library is accessible at: https://github.com/arcee-ai/mergekit.",
 }
 ```
diff --git a/docs/evolve.md b/docs/evolve.md
@@ -121,7 +121,7 @@ Assigns an actor to each GPU in your cluster and guarantees merges and evaluatio
 
 #### `buffered`
 
-Maintains a buffer of tasks scheduled to ensure that there is always a model mergign or ready to evaluate for each gpu. Allows for concurrent merging and evaluation of models on the same GPU if enough VRAM is available. Only suitable for a single-node setup or when `--storage-path` points to a fast shared filesystem.
+Maintains a buffer of tasks scheduled to ensure that there is always a model merging or ready to evaluate for each GPU. Allows for concurrent merging and evaluation of models on the same GPU if enough VRAM is available. Only suitable for a single-node setup or when `--storage-path` points to a fast shared filesystem.
 
 #### `serial`
 

diff --git a/examples/bio-merge.yml b/examples/bio-merge.yml
@@ -0,0 +1,15 @@
+models:
+  - model: mistralai/Mistral-7B-Instruct-v0.2
+    parameters:
+      density: 0.5
+      weight: 0.5
+  - model: BioMistral/BioMistral-7B
+    parameters:
+      density: 0.5
+      weight: 0.5
+merge_method: ties
+base_model: mistralai/Mistral-7B-v0.1
+parameters:
+  normalize: false
+  int8_mask: true
+dtype: float16
diff --git a/mergekit/_data/architectures/bert-masked-lm.json b/mergekit/_data/architectures/bert-masked-lm.json
@@ -44,7 +44,8 @@
         },
         {
             "name": "cls.predictions.decoder.weight",
-            "aliases": [
+            "optional": true,
+            "tied_names": [
                 "bert.embeddings.word_embeddings.weight"
             ],
             "is_embed": true

diff --git a/mergekit/_data/architectures/cohere.json b/mergekit/_data/architectures/cohere.json
@@ -16,9 +16,7 @@
         {
             "name": "lm_head.weight",
             "is_embed": true,
-            "aliases": [
-                "model.embed_tokens.weight"
-            ]
+            "optional": true
         }
     ],
     "num_layers_config_key": "num_hidden_layers",

diff --git a/mergekit/_data/architectures/distilbert-masked-lm.json b/mergekit/_data/architectures/distilbert-masked-lm.json
@@ -40,7 +40,8 @@
         {
             "name": "vocab_projector.weight",
             "is_embed": true,
-            "aliases": [
+            "optional": true,
+            "tied_names": [
                 "distilbert.embeddings.word_embeddings.weight"
             ]
         },