Add ModernBERT to Transformers #35158

warner-benjamin · 2024-12-09T07:18:20Z

This PR will add ModernBERT to Transformers.

ArthurZucker · 2024-12-10T07:21:37Z

tomaarsen

Beyond the obvious (sdpa, eager, flex attention, and documentation), I haven't seen anything outrageous or very unexpected in my first scroll-through.
I recognize that this implementation goes a bit beyond our "usual" with unpadding/padding when possible, but I personally don't mind. Beyond this change (and the other obvious upgrades like RoPE), I quite like how this still mirrors the original BERT rather closely.

I'll have to actually start running this to get a better feel, but so far so good.

Also, the SequenceClassification and TokenClassification classes don't exist yet.

src/transformers/models/modernbert/modular_modernbert.py

src/transformers/models/auto/modeling_auto.py

tomaarsen · 2024-12-10T21:35:03Z

@ArthurZucker @Cyrilvallez
ModernBERT requires no token_type_ids, but the tokenizers rely on PreTrainedTokenizerFast. By default, this produces token_type_ids. Is it preferable that we:

Use model_input_names in all config.json of all ModernBERT models. This means that "fresh tokenizers" won't work out of the box, but people don't normally make fresh tokenizers.
Create a custom ModernBertTokenizerFast that is literally just:

class ModernBertTokenizerFast(PreTrainedTokenizerFast):
    model_input_names = ["input_ids", "attention_mask"]

cc @warner-benjamin @orionw

Tom Aarsen

src/transformers/models/auto/tokenization_auto.py

src/transformers/models/modernbert/modular_modernbert.py

tomaarsen · 2024-12-11T08:23:15Z

With all of these changes in place, I was able to confirm that the output to one of the trained models using the original research implementation nearly matches the output of the transformers ModernBERT-converted model. The only difference that remains is that the research implementation fuses the self.mlp(self.mlp_norm(...)) using @torch.compile(dynamic=True). I also only tested with a small input - I still have to test 1) larger inputs, 2) truncation, 3) batches, etc.

Do we allow something like this to get an exact 1-1 match?

    @torch.compile(dynamic=True)
    def compiled_mlp(self, hidden_states: torch.Tensor) -> torch.Tensor:
        return self.mlp(self.mlp_norm(hidden_states))

Here's an indication of the difference between with and without:

tensor([[[ 0.0078,  0.0078,  0.0039,  ...,  0.0117,  0.0342,  0.0039],
         [ 0.0156,  0.0000,  0.0234,  ...,  0.0000,  0.0156, -0.0273],
         [-0.0015, -0.0010, -0.0020,  ...,  0.0000,  0.0022, -0.0015],
         ...,
         [-0.0195,  0.0000, -0.0029,  ..., -0.0156,  0.0146,  0.0039],
         [ 0.0078, -0.0273, -0.0059,  ..., -0.0215,  0.0103,  0.0088],
         [-0.0020,  0.0005, -0.0005,  ..., -0.0012, -0.0012, -0.0020]]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<SubBackward0>)
Min: -0.375
Max: 0.140625
Mean: -8.249282836914062e-05
Std: 0.01708984375

Tom Aarsen

src/transformers/models/modernbert/modular_modernbert.py

@warner-benjamin

cc @warner-benjamin let me know if the two should remain separate!

@warner-benjamin

Please confirm @warner-benjamin

src/transformers/models/modernbert/configuration_modernbert.py

staghado · 2024-12-18T14:15:50Z

the model with FA2 and the RoPE kernel are not torch.compile compatible, we can't compile the whole model while using these.

warner-benjamin · 2024-12-18T15:43:15Z

the model with FA2 and the RoPE kernel are not torch.compile compatible, we can't compile the whole model while using these.

FA2 is compatible now, but the FA RoPE kernel isn't yet. I have a in progress fix I need to get merged into the FA repo.

Users can make custom heads if they feel like it Also removes the unnecessary pool parameter

Doublechecked with Benjamin that it's correct/what we used for pretraining

warner-benjamin · 2024-12-18T17:14:12Z

We want mean pooling as an option for classification because Local Attention means unlike BERT the CLS token doesn't see all the output in all the attention layers, so mean pooling could outperform CLS on greater than 128 token sequences.

Also, I added the pooling head to TokenClassification because otherwise we are throwing away one pretrained linear layer ModernBertPoolingHead.dense.

ArthurZucker

Super cool! 2 things left: 1. remove the gradient thing: padding and unpadding is not model weight dependant, and should never have gradients.
2. remove the 2 functions, and just call the Head ClsPooling or MeanPooling depending on the one that was release / most common cf our offline discussion @tomaarsen

src/transformers/models/modernbert/modular_modernbert.py

This reverts commit 99c38ba. There was no real motivation, no info on whether having this bigger head does anything useful.

ArthurZucker · 2024-12-19T12:47:51Z

Great addition! Thanks all for your hard work! 🤗

initial cut of modernbert for transformers

6b5a823

ArthurZucker requested a review from Cyrilvallez December 10, 2024 07:21

tomaarsen self-requested a review December 10, 2024 15:09