Skip to content

Commit

Permalink
Add ModernBERT to Transformers (#35158)
Browse files Browse the repository at this point in the history
* initial cut of modernbert for transformers

* small bug fixes

* fixes

* Update import

* Use compiled mlp->mlp_norm to match research implementation

* Propagate changes in modular to modeling

* Replace duplicate attn_out_dropout in favor of attention_dropout

cc @warner-benjamin let me know if the two should remain separate!

* Update BOS to CLS and EOS to SEP

Please confirm @warner-benjamin

* Set default classifier bias to False, matching research repo

* Update tie_word_embeddings description

* Fix _init_weights for ForMaskedLM

* Match base_model_prefix

* Add compiled_head to match research repo outputs

* Fix imports for ModernBertForMaskedLM

* Just use "gelu" default outright for classifier

* Fix config name typo: initalizer -> initializer

* Remove some unused parameters in docstring. Still lots to edit there!

* Compile the embeddings forward

Not having this resulted in very slight differences - so small it wasn't even noticed for the base model, only for the large model.

But the tiny difference for large propagated at the embedding layer through the rest of the model, leading to notable differences of ~0.0084 average per value, up to 0.2343 for the worst case.

* Add drafts for ForSequenceClassification/ForTokenClassification

* Add initial SDPA support (not exactly equivalent to FA2 yet!)

During testing, FA2 and SDPA still differ by about 0.0098 per value in the token embeddings. It still predicts the correct mask fills, but I'd like to get it fully 1-1 if possible.

* Only use attention dropout if training

* Add initial eager attention support (also not equivalent to FA2 yet!)

Frustratingly, I also can't get eager to be equivalent to FA2 (or sdpa), but it does get really close, i.e. avg ~0.010 difference per value.

Especially if I use fp32 for both FA2&eager, avg ~0.0029 difference per value

The fill-mask results are good with eager.

* Add initial tests, output_attentions, output_hidden_states, prune_heads

Tests are based on BERT, not all tests pass yet: 23 failed, 79 passed, 100 skipped

* Remove kwargs from ModernBertForMaskedLM

Disable sparse_prediction by default to match the normal HF, can be enabled via config

* Remove/adjust/skip improper tests; warn if padding but no attn mask

* Run formatting etc.

* Run python utils/custom_init_isort.py

* FlexAttention with unpadded sequences(matches FA2 within bf16 numerics)

* Reformat init_weights based on review

* self -> module in attention forwards

* Remove if config.tie_word_embeddings

* Reformat output projection on a different line

* Remove pruning

* Remove assert

* Call contiguous() to simplify paths

* Remove prune_qkv_linear_layer

* Format code

* Keep as kwargs, only use if needed

* Remove unused codepaths & related config options

* Remove 3d attn_mask test; fix token classification tuple output

* Reorder: attention_mask above position_ids, fixes gradient checkpointing

* Fix usage if no FA2 or torch v2.5+

* Make torch.compile/triton optional

Should we rename 'compile'? It's a bit vague

* Separate pooling options into separate functions (cls, mean) - cls as default

* Simplify _pad_modernbert_output, remove unused labels path

* Update tied weights to remove decoder.weight, simplify decoder loading

* Adaptively set config.compile based on hf_device_map/device/resize, etc.

* Update ModernBertConfig docstring

* Satisfy some consistency checks, add unfinished docs

* Only set compile to False if there's more than 1 device

* Add docstrings for public ModernBert classes

* Dont replace docstring returns - ends up being duplicate

* Fix mistake in toctree

* Reformat toctree

* Patched FlexAttention, SDPA, Eager with Local Attention

* Implement FA2 -> SDPA -> Eager attn_impl defaulting, crucial

both to match the original performance, and to get the highest inference speed without requiring users to manually pick FA2

* Patch test edge case with Idefics3 not working with 'attn_implementation="sdpa"'

* Repad all_hidden_states as well

* rename config.compile to reference_compile

* disable flex_attention since it crashes

* Update modernbert.md

* Using dtype min to mask in eager

* Fully remove flex attention for now

It's only compatible with the nightly torch 2.6, so we'll leave it be for now. It's also slower than eager/sdpa.

Also, update compile -> reference_compile in one more case

* Call contiguous to allow for .view()

* Copyright 2020 -> 2024

Co-authored-by: Arthur <[email protected]>

* Update/simplify __init__ structure

Co-authored-by: Arthur <[email protected]>

* Remove "... if dropout_prob > 0 else identity"

As dropout with 0.0 should be efficient like identity

* re-use existing pad/unpad functions instead of creating new ones

* remove flexattention method

* Compute attention_mask and local_attention_mask once in modeling

* Simplify sequence classification prediction heads, only CLS now

Users can make custom heads if they feel like it

Also removes the unnecessary pool parameter

* Simplify module.training in eager attn

* Also export ModernBertPreTrainedModel

* Update the documentation with links to finetuning scripts

* Explain local_attention_mask parameter in docstring

* Simplify _autoset_attn_implementation, rely on super()

* Keep "in" to initialize Prediction head

Doublechecked with Benjamin that it's correct/what we used for pretraining

* add back mean pooling

* Use the pooling head in TokenClassification

* update copyright

* Reset config._attn_implementation_internal on failure

* Allow optional attention_mask in ForMaskedLM head

* fix failing run_slow tests

* Add links to the paper

* Remove unpad_no_grad, always pad/unpad without gradients

* local_attention_mask -> sliding_window_mask

* Revert "Use the pooling head in TokenClassification"

This reverts commit 99c38ba.

There was no real motivation, no info on whether having this bigger head does anything useful.

* Simplify pooling, 2 options via if-else

---------

Co-authored-by: Tom Aarsen <[email protected]>
Co-authored-by: Tom Aarsen <[email protected]>
Co-authored-by: Said Taghadouini <[email protected]>
Co-authored-by: Benjamin Clavié <[email protected]>
Co-authored-by: Antoine Chaffin <[email protected]>
Co-authored-by: Arthur <[email protected]>
  • Loading branch information
7 people authored Dec 19, 2024
1 parent 56ff1e9 commit 667ed56
Show file tree
Hide file tree
Showing 19 changed files with 3,568 additions and 2 deletions.
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,8 @@
title: mLUKE
- local: model_doc/mobilebert
title: MobileBERT
- local: model_doc/modernbert
title: ModernBert
- local: model_doc/mpnet
title: MPNet
- local: model_doc/mpt
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ Flax), PyTorch, and/or TensorFlow.
| [MobileNetV2](model_doc/mobilenet_v2) ||||
| [MobileViT](model_doc/mobilevit) ||||
| [MobileViTV2](model_doc/mobilevitv2) ||||
| [ModernBERT](model_doc/modernbert) ||||
| [Moshi](model_doc/moshi) ||||
| [MPNet](model_doc/mpnet) ||||
| [MPT](model_doc/mpt) ||||
Expand Down
91 changes: 91 additions & 0 deletions docs/source/en/model_doc/modernbert.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# ModernBert

<div class="flex flex-wrap space-x-1">
<a href="https://huggingface.co/models?filter=modernbert">
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-modernbert-blueviolet">
</a>
<a href="https://arxiv.org/abs/2412.13663">
<img alt="Paper page" src="https://img.shields.io/badge/Paper%20page-2412.13663-green">
</a>
</div>

## Overview

The ModernBert model was proposed in [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663) by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta).

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
- [Rotary Positional Embeddings](https://huggingface.co/blog/designing-positional-encoding) to support sequences of up to 8192 tokens.
- [Unpadding](https://arxiv.org/abs/2208.08124) to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
- [GeGLU](https://arxiv.org/abs/2002.05202) Replacing the original MLP layers with GeGLU layers, shown to improve performance.
- [Alternating Attention](https://arxiv.org/abs/2004.05150v2) where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
- [Flash Attention](https://github.com/Dao-AILab/flash-attention) to speed up processing.
- A model designed following recent [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/abs/2401.14489), ensuring maximum efficiency across inference GPUs.
- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

The abstract from the paper is the following:

*Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.*

The original code can be found [here](https://github.com/answerdotai/modernbert).

## Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ModernBert.

<PipelineTag pipeline="sentence-similarity"/>

- A script on how to [finetune for text similarity or information retrieval with Sentence Transformers](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_st.py). 🌎
- A script on how to [finetune for information retrieval with PyLate](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_pylate.py). 🌎

<PipelineTag pipeline="fill-mask"/>

- [Masked language modeling task guide](../tasks/masked_language_modeling)


## ModernBertConfig

[[autodoc]] ModernBertConfig

<frameworkcontent>
<pt>

## ModernBertModel

[[autodoc]] ModernBertModel
- forward

## ModernBertForMaskedLM

[[autodoc]] ModernBertForMaskedLM
- forward

## ModernBertForSequenceClassification

[[autodoc]] ModernBertForSequenceClassification
- forward

## ModernBertForTokenClassification

[[autodoc]] ModernBertForTokenClassification
- forward

</pt>
</frameworkcontent>
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [MBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel)
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
* [ModernBert](https://huggingface.co/docs/transformers/model_doc/modernbert#transformers.ModernBert)
* [Moshi](https://huggingface.co/docs/transformers/model_doc/moshi#transformers.MoshiModel)
* [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel)
* [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel)
Expand Down Expand Up @@ -265,6 +266,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mllama](https://huggingface.co/docs/transformers/model_doc/mllama#transformers.MllamaForConditionalGeneration)
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
* [ModernBert](https://huggingface.co/docs/transformers/model_doc/modernbert#transformers.ModernBert)
* [Moshi](https://huggingface.co/docs/transformers/model_doc/moshi#transformers.MoshiModel)
* [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel)
* [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel)
Expand Down
18 changes: 18 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -606,6 +606,7 @@
"models.mobilenet_v2": ["MobileNetV2Config"],
"models.mobilevit": ["MobileViTConfig"],
"models.mobilevitv2": ["MobileViTV2Config"],
"models.modernbert": ["ModernBertConfig"],
"models.moshi": [
"MoshiConfig",
"MoshiDepthConfig",
Expand Down Expand Up @@ -2869,6 +2870,15 @@
"MobileViTV2PreTrainedModel",
]
)
_import_structure["models.modernbert"].extend(
[
"ModernBertForMaskedLM",
"ModernBertForSequenceClassification",
"ModernBertForTokenClassification",
"ModernBertModel",
"ModernBertPreTrainedModel",
]
)
_import_structure["models.moshi"].extend(
[
"MoshiForCausalLM",
Expand Down Expand Up @@ -5565,6 +5575,7 @@
from .models.mobilevitv2 import (
MobileViTV2Config,
)
from .models.modernbert import ModernBertConfig
from .models.moshi import (
MoshiConfig,
MoshiDepthConfig,
Expand Down Expand Up @@ -7556,6 +7567,13 @@
MobileViTV2Model,
MobileViTV2PreTrainedModel,
)
from .models.modernbert import (
ModernBertForMaskedLM,
ModernBertForSequenceClassification,
ModernBertForTokenClassification,
ModernBertModel,
ModernBertPreTrainedModel,
)
from .models.moshi import (
MoshiForCausalLM,
MoshiForConditionalGeneration,
Expand Down
17 changes: 17 additions & 0 deletions src/transformers/loss/loss_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,22 @@ def ForCausalLMLoss(
return loss


def ForMaskedLMLoss(
logits, labels, vocab_size: int, num_items_in_batch: int = None, ignore_index: int = -100, **kwargs
):
# Upcast to float if we need to compute the loss to avoid potential precision issues
logits = logits.float()

# Flatten the tokens
logits = logits.view(-1, vocab_size)
labels = labels.view(-1)
# Enable model parallelism

labels = labels.to(logits.device)
loss = fixed_cross_entropy(logits, labels, num_items_in_batch, ignore_index, **kwargs)
return loss


def ForSequenceClassificationLoss(labels, pooled_logits, config, **kwargs):
num_labels = config.num_labels
if config.problem_type is None:
Expand Down Expand Up @@ -101,6 +117,7 @@ def ForTokenClassification(logits, labels, config, **kwargs):

LOSS_MAPPING = {
"ForCausalLM": ForCausalLMLoss,
"ForMaskedLM": ForMaskedLMLoss,
"ForQuestionAnswering": ForQuestionAnsweringLoss,
"ForSequenceClassification": ForSequenceClassificationLoss,
"ForTokenClassification": ForTokenClassification,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@
mobilenet_v2,
mobilevit,
mobilevitv2,
modernbert,
moshi,
mpnet,
mpt,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@
("mobilenet_v2", "MobileNetV2Config"),
("mobilevit", "MobileViTConfig"),
("mobilevitv2", "MobileViTV2Config"),
("modernbert", "ModernBertConfig"),
("moshi", "MoshiConfig"),
("mpnet", "MPNetConfig"),
("mpt", "MptConfig"),
Expand Down Expand Up @@ -510,6 +511,7 @@
("mobilenet_v2", "MobileNetV2"),
("mobilevit", "MobileViT"),
("mobilevitv2", "MobileViTV2"),
("modernbert", "ModernBERT"),
("moshi", "Moshi"),
("mpnet", "MPNet"),
("mpt", "MPT"),
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@
("mobilenet_v2", "MobileNetV2Model"),
("mobilevit", "MobileViTModel"),
("mobilevitv2", "MobileViTV2Model"),
("modernbert", "ModernBertModel"),
("moshi", "MoshiModel"),
("mpnet", "MPNetModel"),
("mpt", "MptModel"),
Expand Down Expand Up @@ -838,6 +839,7 @@
("mega", "MegaForMaskedLM"),
("megatron-bert", "MegatronBertForMaskedLM"),
("mobilebert", "MobileBertForMaskedLM"),
("modernbert", "ModernBertForMaskedLM"),
("mpnet", "MPNetForMaskedLM"),
("mra", "MraForMaskedLM"),
("mvp", "MvpForConditionalGeneration"),
Expand Down Expand Up @@ -992,6 +994,7 @@
("mistral", "MistralForSequenceClassification"),
("mixtral", "MixtralForSequenceClassification"),
("mobilebert", "MobileBertForSequenceClassification"),
("modernbert", "ModernBertForSequenceClassification"),
("mpnet", "MPNetForSequenceClassification"),
("mpt", "MptForSequenceClassification"),
("mra", "MraForSequenceClassification"),
Expand Down Expand Up @@ -1178,6 +1181,7 @@
("mistral", "MistralForTokenClassification"),
("mixtral", "MixtralForTokenClassification"),
("mobilebert", "MobileBertForTokenClassification"),
("modernbert", "ModernBertForTokenClassification"),
("mpnet", "MPNetForTokenClassification"),
("mpt", "MptForTokenClassification"),
("mra", "MraForTokenClassification"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,7 @@
("mllama", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("moshi", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
Expand Down
27 changes: 27 additions & 0 deletions src/transformers/models/modernbert/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_modernbert import *
from .modeling_modernbert import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading

0 comments on commit 667ed56

Please sign in to comment.