[`Add Mamba`] Adds support for the `Mamba` models #28094

ArthurZucker · 2023-12-16T18:35:52Z

What does this PR do?

Implement cpu ops
Add integration tests
Implement fast path
check training + peft
convert all checkpoints: just need to make sure config is correct

Feel free to try this:

from transformers import MambaConfig, MambaForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("ArthurZ/mamba-130m")
tokenizer.pad_token = tokenizer.eos_token

model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-130m", vocab_size=50280, num_hidden_layers=24, torch_dtype=torch.float32)
model.config.use_cache = True
input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]

out = model.generate(input_ids, max_new_tokens=10)
print(tokenizer.batch_decode(out))

Peft training that works, thanks @younesbelkada : Results: https://huggingface.co/ArthurZ/mamba-2.4b-english-quotes

from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
model_id = "ArthurZ/mamba-2.8b"
tokenizer = AutoTokenizer.from_pretrained(model_id, pad_token ="<s>")
model = AutoModelForCausalLM.from_pretrained(model_id)
dataset = load_dataset("Abirate/english_quotes", split="train")
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    logging_dir='./logs',
    logging_steps=10,
    learning_rate=2e-3
)
lora_config =  LoraConfig(
        r=8,
        target_modules="all-linear",
        task_type="CAUSAL_LM",
        bias="none"
)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    train_dataset=dataset,
    dataset_text_field="quote",
)
trainer.train()

pink: 360m, full fine-tune
bleue: 2.8b peft
red: 2.8b peft

fixes #28086

ArthurZucker · 2024-01-16T08:06:06Z

Oups! Still planned but KVCache will come first

ArthurZucker · 2024-01-31T03:01:54Z

Alright I am picking this back up!

…amba

HuggingFaceDocBuilderDev · 2024-01-31T07:22:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

apoorvkh · 2024-02-01T22:49:00Z

Hey, it's great to see that mamba is being integrated in Transformers! Just wondering, is there a timeline or ETA for this PR? Thanks so much.

ArthurZucker · 2024-02-02T07:17:16Z

I want to merge it asap so probably max end of next week!

This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. #### Helpful resources [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs huggingface/transformers#28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` ## Update / Current State Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' | jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <[email protected]>

ArthurZucker · 2024-02-12T14:09:41Z

Got side tracked, done with caching issues!
Was meditating the stateful vs stateless approach we want to take to support torch compile and graphs without the extra complexity similarly to #27931.
It was advised that for mamba, cache should work in a stateless manner

ArthurZucker · 2024-03-05T05:43:59Z

Done! 🤗

LysandreJik

This looks good to me, please add the example that you have in the PR description somewhere in the documentation as well. The current examples don't really show how to use the model imo.

docs/source/en/model_doc/mamba.md

Co-authored-by: Lysandre Debut <[email protected]>

* initial-commit * start cleaning * small nits * small nits * current updates * add kernels * small refactoring little step * add comments * styling * nit * nits * Style * Small changes * Push dummy mambda simple slow * nit * Use original names * Use original names and remove norm * Updates for inference params * Style nd updates * nits * Match logits * Add a test * Add expected generated text * nits doc, imports and styling * style * oups * dont install kernels, invite users to install the required kernels * let use use the original packages * styling * nits * fix some copieds * update doc * fix-copies * styling done * nits * fix import check * run but wrong cuda ress * mamba CUDA works :) * fix the fast path * config naming nits * conversion script is not required at this stage * finish fixing the fast path: generation make sense now! * nit * Let's start working on the CIs * style * better style * more nits * test nit * quick fix for now * nits * nit * nit * nit * nits * update test rest * fixup * update test * nit * some fixes * nits * update test values * fix styling * nit * support peft * integrations tests require torchg * also add slow markers * styling * chose forward wisely * nits * update tests * fix gradient checkpointing * fixup * nit * fix doc * check copies * fix the docstring * fix some more tests * style * fix beam search * add init schene * update * nit * fix * fixup the doc * fix the doc * fixup * tentative update but slow is no longer good * nit * should we always use float32? * nits * revert wrong changes * res in float32 * cleanup * skip fmt for now * update generation values * update test values running original model * fixup * update tests + rename inference_params to cache_params + make sure training does not use cache_params * small nits * more nits * fix final CIs * style * nit doc * I hope final doc nits * nit * 🫠 * final touch! * fix torch import * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Apply suggestions from code review * fix fix and fix * fix base model prefix! * nit * Update src/transformers/models/mamba/__init__.py * Update docs/source/en/model_doc/mamba.md Co-authored-by: Lysandre Debut <[email protected]> * nit --------- Co-authored-by: Lysandre Debut <[email protected]>

abdulfatir · 2024-03-15T10:17:34Z

@ArthurZucker Thank you for this amazing addition. Are there any plans to add something equivalent to attention_mask for Mamba?

ArthurZucker · 2024-03-30T15:15:52Z

not sure why would you need it?

abdulfatir · 2024-03-31T21:16:15Z

For batched inference with inputs of different length.
For pretraining with different masking schemes than a causal mask.

ArthurZucker · 2024-04-01T06:59:03Z

There is no notion of causal mask or masking in mamba as it is not based on attention. That's why I am not sure I follow

lkurlandski · 2024-04-15T14:15:49Z

Hi.

There is a problem in the Trainer where the logits returned by Trainer.prediction_step will return a tuple[Tensor, MambaCache] object. This causes a host of issues when accelerate tries to move the logits on the same device, change datatypes, etc. The solution is to set the "keys_to_ignore_at_inference" field of the associated Config class to include "cache_params". The change is simple:

class MambaConfig:
    keys_to_ignore_at_inference = ["cache_params"]

Full disclosure, I encountered this "bug" in my own MambaForSequenceClassification class, not a module from transformers itself and I have not really tested this thoroughly to see if it is present in the classes from transformers.

@ArthurZucker tagging you :)

ArthurZucker · 2024-04-18T09:15:01Z

Feel free to open a PR for the fix! 🤗

ArthurZucker · 2024-04-18T09:15:15Z

Also use_cache=False should prevent this as well no?

This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. #### Helpful resources [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs huggingface/transformers#28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` ## Update / Current State Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' | jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <[email protected]>

This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs huggingface/transformers#28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' | jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <[email protected]>

* initial-commit * start cleaning * small nits * small nits * current updates * add kernels * small refactoring little step * add comments * styling * nit * nits * Style * Small changes * Push dummy mambda simple slow * nit * Use original names * Use original names and remove norm * Updates for inference params * Style nd updates * nits * Match logits * Add a test * Add expected generated text * nits doc, imports and styling * style * oups * dont install kernels, invite users to install the required kernels * let use use the original packages * styling * nits * fix some copieds * update doc * fix-copies * styling done * nits * fix import check * run but wrong cuda ress * mamba CUDA works :) * fix the fast path * config naming nits * conversion script is not required at this stage * finish fixing the fast path: generation make sense now! * nit * Let's start working on the CIs * style * better style * more nits * test nit * quick fix for now * nits * nit * nit * nit * nits * update test rest * fixup * update test * nit * some fixes * nits * update test values * fix styling * nit * support peft * integrations tests require torchg * also add slow markers * styling * chose forward wisely * nits * update tests * fix gradient checkpointing * fixup * nit * fix doc * check copies * fix the docstring * fix some more tests * style * fix beam search * add init schene * update * nit * fix * fixup the doc * fix the doc * fixup * tentative update but slow is no longer good * nit * should we always use float32? * nits * revert wrong changes * res in float32 * cleanup * skip fmt for now * update generation values * update test values running original model * fixup * update tests + rename inference_params to cache_params + make sure training does not use cache_params * small nits * more nits * fix final CIs * style * nit doc * I hope final doc nits * nit * 🫠 * final touch! * fix torch import * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Apply suggestions from code review * fix fix and fix * fix base model prefix! * nit * Update src/transformers/models/mamba/__init__.py * Update docs/source/en/model_doc/mamba.md Co-authored-by: Lysandre Debut <[email protected]> * nit --------- Co-authored-by: Lysandre Debut <[email protected]>

This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. #### Helpful resources [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs huggingface/transformers#28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` ## Update / Current State Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' | jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <[email protected]>

initial-commit

81c642f

ArthurZucker linked an issue Dec 16, 2023 that may be closed by this pull request

Add [Mamba] model #28086

Closed

2 tasks

huggingface deleted a comment from github-actions bot Jan 16, 2024

drbh mentioned this pull request Jan 25, 2024

Impl simple mamba model huggingface/text-generation-inference#1480

Merged

5 tasks

Merge branch 'main' of github.com:huggingface/transformers into add-m…

c50602b

…amba

ArthurZucker added 2 commits February 1, 2024 08:09

start cleaning

00d3a6c

small nits

921bb24

ArthurZucker added 7 commits February 3, 2024 18:26

small nits

b3f216d

current updates

7235b57

add kernels

7a407a7

small refactoring little step

9f2a982

add comments

04c991a

styling

aa7e8d2

nit

26748c4

IggoOnCode mentioned this pull request Feb 10, 2024

Mamba-Ssm - Loader for Mamba State Space models oobabooga/text-generation-webui#5228

Closed

1 task

ArthurZucker added 8 commits February 14, 2024 01:25

nits

75e376a

Style

1c104b5

Merge

0e90dae

Small changes

a804466

Push dummy mambda simple slow

6b87ad2

nit

a7ec8d6

Use original names

5046451

Use original names and remove norm

b5831e3

Update src/transformers/models/mamba/__init__.py

28e5ef0

ArthurZucker requested a review from LysandreJik March 5, 2024 01:38

LysandreJik approved these changes Mar 5, 2024

View reviewed changes

docs/source/en/model_doc/mamba.md Outdated Show resolved Hide resolved

Update docs/source/en/model_doc/mamba.md

f963e38

Co-authored-by: Lysandre Debut <[email protected]>

ArthurZucker force-pushed the add-mamba branch 2 times, most recently from ee6a9c2 to f963e38 Compare March 5, 2024 10:38

nit

095dabd

ArthurZucker merged commit fb1c62e into main Mar 5, 2024
23 checks passed

ArthurZucker deleted the add-mamba branch March 5, 2024 11:01

JLTastet mentioned this pull request Mar 5, 2024

Submit implementation to HuggingFace Transformers library state-spaces/mamba#60

Closed

compilade mentioned this pull request Mar 8, 2024

llama : support Mamba Selective State Space Models ggerganov/llama.cpp#5328

Merged

8 tasks

khipp mentioned this pull request Mar 10, 2024

Add missing localized READMEs to the copies check #29575

Merged

1 task

haileyschoelkopf mentioned this pull request Mar 13, 2024

Conversion Script for Mamba checkpoints (mamba_ssm -> transformers) #29631

Closed

joelburget mentioned this pull request Mar 15, 2024

Switch to HuggingFace's Mamba implementation joelburget/mamba-sae#2

Closed

ghosthamlet mentioned this pull request Mar 17, 2024

Will deep StateSpace models add to this library? #28101

Closed

xenova mentioned this pull request Mar 22, 2024

Generating onnx file for the inference of Mamba? state-spaces/mamba#200

Open

HansPolo113 mentioned this pull request Jul 5, 2024

Support for Transformers library microsoft/Samba#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Add Mamba`] Adds support for the `Mamba` models #28094

[`Add Mamba`] Adds support for the `Mamba` models #28094

ArthurZucker commented Dec 16, 2023 •

edited

Loading

ArthurZucker commented Jan 16, 2024

ArthurZucker commented Jan 31, 2024

HuggingFaceDocBuilderDev commented Jan 31, 2024

apoorvkh commented Feb 1, 2024

ArthurZucker commented Feb 2, 2024

ArthurZucker commented Feb 12, 2024

ArthurZucker commented Mar 5, 2024

LysandreJik left a comment

abdulfatir commented Mar 15, 2024

ArthurZucker commented Mar 30, 2024

abdulfatir commented Mar 31, 2024

ArthurZucker commented Apr 1, 2024

lkurlandski commented Apr 15, 2024

ArthurZucker commented Apr 18, 2024

ArthurZucker commented Apr 18, 2024

[Add Mamba] Adds support for the Mamba models #28094

[Add Mamba] Adds support for the Mamba models #28094

Conversation

ArthurZucker commented Dec 16, 2023 • edited Loading

What does this PR do?

ArthurZucker commented Jan 16, 2024

ArthurZucker commented Jan 31, 2024

HuggingFaceDocBuilderDev commented Jan 31, 2024

apoorvkh commented Feb 1, 2024

ArthurZucker commented Feb 2, 2024

ArthurZucker commented Feb 12, 2024

ArthurZucker commented Mar 5, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

abdulfatir commented Mar 15, 2024

ArthurZucker commented Mar 30, 2024

abdulfatir commented Mar 31, 2024

ArthurZucker commented Apr 1, 2024

lkurlandski commented Apr 15, 2024

ArthurZucker commented Apr 18, 2024

ArthurZucker commented Apr 18, 2024

[`Add Mamba`] Adds support for the `Mamba` models #28094

[`Add Mamba`] Adds support for the `Mamba` models #28094

ArthurZucker commented Dec 16, 2023 •

edited

Loading