Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
yao-matrix authored Dec 20, 2024
2 parents b08d1fc + f42084e commit 0904268
Show file tree
Hide file tree
Showing 27 changed files with 3,601 additions and 152 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/self-comment-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-22.04
name: Get PR number
# For security: only allow team members to run
if: contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "qubvel", "molbap", "gante", "LysandreJik", "Cyrilvallez"]'), github.actor)
if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "qubvel", "molbap", "gante", "LysandreJik", "Cyrilvallez"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
outputs:
PR_NUMBER: ${{ steps.set_pr_number.outputs.PR_NUMBER }}
steps:
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,8 @@
title: mLUKE
- local: model_doc/mobilebert
title: MobileBERT
- local: model_doc/modernbert
title: ModernBert
- local: model_doc/mpnet
title: MPNet
- local: model_doc/mpt
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ Flax), PyTorch, and/or TensorFlow.
| [MobileNetV2](model_doc/mobilenet_v2) ||||
| [MobileViT](model_doc/mobilevit) ||||
| [MobileViTV2](model_doc/mobilevitv2) ||||
| [ModernBERT](model_doc/modernbert) ||||
| [Moshi](model_doc/moshi) ||||
| [MPNet](model_doc/mpnet) ||||
| [MPT](model_doc/mpt) ||||
Expand Down
21 changes: 8 additions & 13 deletions docs/source/en/model_doc/colpali.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,24 @@ rendered properly in your Markdown viewer.

## Overview

The ColPali model was proposed in [ColPali: Efficient Document Retrieval with Vision Language Models](https://doi.org/10.48550/arXiv.2407.01449) by **Manuel Faysse***, **Hugues Sibille***, **Tony Wu***, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution).
The *ColPali* model was proposed in [ColPali: Efficient Document Retrieval with Vision Language Models](https://doi.org/10.48550/arXiv.2407.01449) by **Manuel Faysse***, **Hugues Sibille***, **Tony Wu***, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.

With our new model *ColPali*, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.
In our proposed *ColPali* approach, we leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. We train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document. ColPali is also highly interpretable: similarity maps can be obtained between patches and query tokens. These maps highlight ColPali’s strong OCR capabilities and chart understanding.

**Paper abstract:**

> Documents are visually rich structures that convey information through text, but also figures, page layouts, tables, or even fonts. Since modern retrieval systems mainly rely on the textual information they extract from document pages to index documents -often through lengthy and brittle processes-, they struggle to exploit key visual cues efficiently. This limits their capabilities in many practical document retrieval applications such as Retrieval Augmented Generation (RAG). To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark *ViDoRe*, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings. The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages. We release *ColPali*, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages. Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically simpler, faster and end-to-end trainable.
>
> We release models, data, code and benchmarks under open licenses at [https://huggingface.co/vidore](https://huggingface.co/vidore).
Using *ColPali* removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, etc.) of a document.

## Resources

- The *ColPali* arXiv paper can be found [here](https://doi.org/10.48550/arXiv.2407.01449). 📄
- The official blog post detailing ColPali can be found [here](https://huggingface.co/blog/manu/colpali). 📝
- The original model implementation code for the ColPali model and for the `colpali-engine` package can be found [here](https://github.com/illuin-tech/colpali). 🌎
- Cookbooks for learning to use the transformers-native version of ColPali, fine-tuning, and similarity maps generation can be found [here](https://github.com/tonywu71/colpali-cookbooks). 📚
- Cookbooks for learning to use the transformers-native version of *ColPali*, fine-tuning, and similarity maps generation can be found [here](https://github.com/tonywu71/colpali-cookbooks). 📚

This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) and [@yonigozlan](https://huggingface.co/yonigozlan).

## Usage

This example demonstrates how to use ColPali to embed both queries and images, calculate their similarity scores, and identify the most relevant matches. For a specific query, you can retrieve the top-k most similar images by selecting the ones with the highest similarity scores.
This example demonstrates how to use *ColPali* to embed both queries and images, calculate their similarity scores, and identify the most relevant matches. For a specific query, you can retrieve the top-k most similar images by selecting the ones with the highest similarity scores.

```python
import torch
Expand Down Expand Up @@ -74,8 +69,8 @@ batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
image_embeddings = model(**batch_images).embeddings
query_embeddings = model(**batch_queries).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)
Expand Down
95 changes: 95 additions & 0 deletions docs/source/en/model_doc/modernbert.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# ModernBert

<div class="flex flex-wrap space-x-1">
<a href="https://huggingface.co/models?filter=modernbert">
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-modernbert-blueviolet">
</a>
<a href="https://arxiv.org/abs/2412.13663">
<img alt="Paper page" src="https://img.shields.io/badge/Paper%20page-2412.13663-green">
</a>
</div>

## Overview

The ModernBert model was proposed in [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663) by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta).

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
- [Rotary Positional Embeddings](https://huggingface.co/blog/designing-positional-encoding) to support sequences of up to 8192 tokens.
- [Unpadding](https://arxiv.org/abs/2208.08124) to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
- [GeGLU](https://arxiv.org/abs/2002.05202) Replacing the original MLP layers with GeGLU layers, shown to improve performance.
- [Alternating Attention](https://arxiv.org/abs/2004.05150v2) where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
- [Flash Attention](https://github.com/Dao-AILab/flash-attention) to speed up processing.
- A model designed following recent [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/abs/2401.14489), ensuring maximum efficiency across inference GPUs.
- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

The abstract from the paper is the following:

*Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.*

The original code can be found [here](https://github.com/answerdotai/modernbert).

## Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ModernBert.

<PipelineTag pipeline="text-classification"/>

- A notebook on how to [finetune for General Language Understanding Evaluation (GLUE) with Transformers](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/finetune_modernbert_on_glue.ipynb), also available as a Google Colab [notebook](https://colab.research.google.com/github/AnswerDotAI/ModernBERT/blob/main/examples/finetune_modernbert_on_glue.ipynb). 🌎

<PipelineTag pipeline="sentence-similarity"/>

- A script on how to [finetune for text similarity or information retrieval with Sentence Transformers](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_st.py). 🌎
- A script on how to [finetune for information retrieval with PyLate](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_pylate.py). 🌎

<PipelineTag pipeline="fill-mask"/>

- [Masked language modeling task guide](../tasks/masked_language_modeling)


## ModernBertConfig

[[autodoc]] ModernBertConfig

<frameworkcontent>
<pt>

## ModernBertModel

[[autodoc]] ModernBertModel
- forward

## ModernBertForMaskedLM

[[autodoc]] ModernBertForMaskedLM
- forward

## ModernBertForSequenceClassification

[[autodoc]] ModernBertForSequenceClassification
- forward

## ModernBertForTokenClassification

[[autodoc]] ModernBertForTokenClassification
- forward

</pt>
</frameworkcontent>
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [MBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel)
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
* [ModernBert](https://huggingface.co/docs/transformers/model_doc/modernbert#transformers.ModernBert)
* [Moshi](https://huggingface.co/docs/transformers/model_doc/moshi#transformers.MoshiModel)
* [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel)
* [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel)
Expand Down Expand Up @@ -265,6 +266,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mllama](https://huggingface.co/docs/transformers/model_doc/mllama#transformers.MllamaForConditionalGeneration)
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
* [ModernBert](https://huggingface.co/docs/transformers/model_doc/modernbert#transformers.ModernBert)
* [Moshi](https://huggingface.co/docs/transformers/model_doc/moshi#transformers.MoshiModel)
* [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel)
* [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel)
Expand Down
18 changes: 18 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -606,6 +606,7 @@
"models.mobilenet_v2": ["MobileNetV2Config"],
"models.mobilevit": ["MobileViTConfig"],
"models.mobilevitv2": ["MobileViTV2Config"],
"models.modernbert": ["ModernBertConfig"],
"models.moshi": [
"MoshiConfig",
"MoshiDepthConfig",
Expand Down Expand Up @@ -2869,6 +2870,15 @@
"MobileViTV2PreTrainedModel",
]
)
_import_structure["models.modernbert"].extend(
[
"ModernBertForMaskedLM",
"ModernBertForSequenceClassification",
"ModernBertForTokenClassification",
"ModernBertModel",
"ModernBertPreTrainedModel",
]
)
_import_structure["models.moshi"].extend(
[
"MoshiForCausalLM",
Expand Down Expand Up @@ -5565,6 +5575,7 @@
from .models.mobilevitv2 import (
MobileViTV2Config,
)
from .models.modernbert import ModernBertConfig
from .models.moshi import (
MoshiConfig,
MoshiDepthConfig,
Expand Down Expand Up @@ -7556,6 +7567,13 @@
MobileViTV2Model,
MobileViTV2PreTrainedModel,
)
from .models.modernbert import (
ModernBertForMaskedLM,
ModernBertForSequenceClassification,
ModernBertForTokenClassification,
ModernBertModel,
ModernBertPreTrainedModel,
)
from .models.moshi import (
MoshiForCausalLM,
MoshiForConditionalGeneration,
Expand Down
17 changes: 17 additions & 0 deletions src/transformers/loss/loss_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,22 @@ def ForCausalLMLoss(
return loss


def ForMaskedLMLoss(
logits, labels, vocab_size: int, num_items_in_batch: int = None, ignore_index: int = -100, **kwargs
):
# Upcast to float if we need to compute the loss to avoid potential precision issues
logits = logits.float()

# Flatten the tokens
logits = logits.view(-1, vocab_size)
labels = labels.view(-1)
# Enable model parallelism

labels = labels.to(logits.device)
loss = fixed_cross_entropy(logits, labels, num_items_in_batch, ignore_index, **kwargs)
return loss


def ForSequenceClassificationLoss(labels, pooled_logits, config, **kwargs):
num_labels = config.num_labels
if config.problem_type is None:
Expand Down Expand Up @@ -101,6 +117,7 @@ def ForTokenClassification(logits, labels, config, **kwargs):

LOSS_MAPPING = {
"ForCausalLM": ForCausalLMLoss,
"ForMaskedLM": ForMaskedLMLoss,
"ForQuestionAnswering": ForQuestionAnsweringLoss,
"ForSequenceClassification": ForSequenceClassificationLoss,
"ForTokenClassification": ForTokenClassification,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@
mobilenet_v2,
mobilevit,
mobilevitv2,
modernbert,
moshi,
mpnet,
mpt,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@
("mobilenet_v2", "MobileNetV2Config"),
("mobilevit", "MobileViTConfig"),
("mobilevitv2", "MobileViTV2Config"),
("modernbert", "ModernBertConfig"),
("moshi", "MoshiConfig"),
("mpnet", "MPNetConfig"),
("mpt", "MptConfig"),
Expand Down Expand Up @@ -510,6 +511,7 @@
("mobilenet_v2", "MobileNetV2"),
("mobilevit", "MobileViT"),
("mobilevitv2", "MobileViTV2"),
("modernbert", "ModernBERT"),
("moshi", "Moshi"),
("mpnet", "MPNet"),
("mpt", "MPT"),
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@
("mobilenet_v2", "MobileNetV2Model"),
("mobilevit", "MobileViTModel"),
("mobilevitv2", "MobileViTV2Model"),
("modernbert", "ModernBertModel"),
("moshi", "MoshiModel"),
("mpnet", "MPNetModel"),
("mpt", "MptModel"),
Expand Down Expand Up @@ -838,6 +839,7 @@
("mega", "MegaForMaskedLM"),
("megatron-bert", "MegatronBertForMaskedLM"),
("mobilebert", "MobileBertForMaskedLM"),
("modernbert", "ModernBertForMaskedLM"),
("mpnet", "MPNetForMaskedLM"),
("mra", "MraForMaskedLM"),
("mvp", "MvpForConditionalGeneration"),
Expand Down Expand Up @@ -992,6 +994,7 @@
("mistral", "MistralForSequenceClassification"),
("mixtral", "MixtralForSequenceClassification"),
("mobilebert", "MobileBertForSequenceClassification"),
("modernbert", "ModernBertForSequenceClassification"),
("mpnet", "MPNetForSequenceClassification"),
("mpt", "MptForSequenceClassification"),
("mra", "MraForSequenceClassification"),
Expand Down Expand Up @@ -1178,6 +1181,7 @@
("mistral", "MistralForTokenClassification"),
("mixtral", "MixtralForTokenClassification"),
("mobilebert", "MobileBertForTokenClassification"),
("modernbert", "ModernBertForTokenClassification"),
("mpnet", "MPNetForTokenClassification"),
("mpt", "MptForTokenClassification"),
("mra", "MraForTokenClassification"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,7 @@
("mllama", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("moshi", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
Expand Down
Loading

0 comments on commit 0904268

Please sign in to comment.