From dabe8556686a5727f7b707099967c8ce8ff16e96 Mon Sep 17 00:00:00 2001
From: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Date: Thu, 22 Feb 2024 11:48:01 +0100
Subject: [PATCH 001/549] [Mistral, Mixtral] Improve docs (#29084)
* Improve docs
* Improve chat template
---
docs/source/en/model_doc/mistral.md | 135 ++++++++++++++++++---------
docs/source/en/model_doc/mixtral.md | 137 ++++++++++++++++++----------
2 files changed, 184 insertions(+), 88 deletions(-)
diff --git a/docs/source/en/model_doc/mistral.md b/docs/source/en/model_doc/mistral.md
index 31b5deaf9dd63b..0ab214206165f1 100644
--- a/docs/source/en/model_doc/mistral.md
+++ b/docs/source/en/model_doc/mistral.md
@@ -18,71 +18,80 @@ rendered properly in your Markdown viewer.
## Overview
-Mistral-7B-v0.1 is Mistral AI's first Large Language Model (LLM).
+Mistral was introduced in the [this blogpost](https://mistral.ai/news/announcing-mistral-7b/) by Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
-### Model Details
+The introduction of the blog post says:
-Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:
-* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
-* GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
-* Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
+*Mistral AI team is proud to release Mistral 7B, the most powerful language model for its size to date.*
-We also provide an instruction fine-tuned model: `Mistral-7B-Instruct-v0.1` which can be used for chat-based inference.
+Mistral-7B is the first large language model (LLM) released by [mistral.ai](https://mistral.ai/).
-For more details please read our [release blog post](https://mistral.ai/news/announcing-mistral-7b/)
+### Architectural details
+
+Mistral-7B is a decoder-only Transformer with the following architectural choices:
+
+- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
+- GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
+- Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
+
+For more details refer to the [release blog post](https://mistral.ai/news/announcing-mistral-7b/).
### License
-Both `Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` are released under the Apache 2.0 license.
+`Mistral-7B` is released under the Apache 2.0 license.
## Usage tips
-`Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` can be found on the [Huggingface Hub](https://huggingface.co/mistralai)
+The Mistral team has released 3 checkpoints:
-These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hub:
+- a base model, [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), which has been pre-trained to predict the next token on internet-scale data.
+- an instruction tuned model, [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1), which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO).
+- an improved instruction tuned model, [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), which improves upon v1.
+
+The base model can be used as follows:
```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
->>> device = "cuda" # the device to load the model onto
->>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
>>> prompt = "My favourite condiment is"
->>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
-"The expected output"
+"My favourite condiment is to ..."
```
-Raw weights for `Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` can be downloaded from:
+The instruction tuned model can be used as follows:
-| Model Name | Checkpoint |
-|----------------------------|-----------------------------------------------------------------------------------------|
-| `Mistral-7B-v0.1` | [Raw Checkpoint](https://files.mistral-7b-v0-1.mistral.ai/mistral-7B-v0.1.tar) |
-| `Mistral-7B-Instruct-v0.1` | [Raw Checkpoint](https://files.mistral-7b-v0-1.mistral.ai/mistral-7B-instruct-v0.1.tar) |
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", device_map="auto")
+>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
-To use these raw checkpoints with HuggingFace you can use the `convert_mistral_weights_to_hf.py` script to convert them to the HuggingFace format:
+>>> messages = [
+... {"role": "user", "content": "What is your favourite condiment?"},
+... {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
+... {"role": "user", "content": "Do you have mayonnaise recipes?"}
+... ]
-```bash
-python src/transformers/models/mistral/convert_mistral_weights_to_hf.py \
- --input_dir /path/to/downloaded/mistral/weights --model_size 7B --output_dir /output/path
-```
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
-You can then load the converted model from the `output/path`:
+>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
+>>> tokenizer.batch_decode(generated_ids)[0]
+"Mayonnaise can be made as follows: (...)"
+```
-```python
-from transformers import MistralForCausalLM, LlamaTokenizer
+As can be seen, the instruction-tuned model requires a [chat template](../chat_templating) to be applied to make sure the inputs are prepared in the right format.
-tokenizer = LlamaTokenizer.from_pretrained("/output/path")
-model = MistralForCausalLM.from_pretrained("/output/path")
-```
+## Speeding up Mistral by using Flash Attention
-## Combining Mistral and Flash Attention 2
+The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
@@ -90,26 +99,25 @@ First, make sure to install the latest version of Flash Attention 2 to include t
pip install -U flash-attn --no-build-isolation
```
-Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of [`flash-attn`](https://github.com/Dao-AILab/flash-attention) repository. Make also sure to load your model in half-precision (e.g. `torch.float16`)
+Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). Make also sure to load your model in half-precision (e.g. `torch.float16`)
-To load and run a model using Flash Attention 2, refer to the snippet below:
+To load and run a model using Flash Attention-2, refer to the snippet below:
```python
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
->>> device = "cuda" # the device to load the model onto
->>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
>>> prompt = "My favourite condiment is"
->>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
-"The expected output"
+"My favourite condiment is to (...)"
```
### Expected speedups
@@ -127,9 +135,54 @@ To enable sliding window attention, just make sure to have a `flash-attn` versio
The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (`self.config.sliding_window`), support batched generation only for `padding_side="left"` and use the absolute position of the current token to compute the positional embedding.
-## The Mistral Team
+## Shrinking down Mistral using quantization
+
+As the Mistral model has 7 billion parameters, that would require about 14GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter),that requires only about 3.5GB of RAM.
+
+Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization.md) for other quantization methods):
+
+```python
+>>> import torch
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+
+>>> # specify how to quantize the model
+>>> quantization_config = BitsAndBytesConfig(
+... load_in_4bit=True,
+... bnb_4bit_quant_type="nf4",
+... bnb_4bit_compute_dtype="torch.float16",
+... )
-Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", quantization_config=True, device_map="auto")
+>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
+
+>>> prompt = "My favourite condiment is"
+
+>>> messages = [
+... {"role": "user", "content": "What is your favourite condiment?"},
+... {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
+... {"role": "user", "content": "Do you have mayonnaise recipes?"}
+... ]
+
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+
+>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
+>>> tokenizer.batch_decode(generated_ids)[0]
+"The expected output"
+```
+
+This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) .
+The original code can be found [here](https://github.com/mistralai/mistral-src).
+
+## Resources
+
+A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Mistral. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+
+
+
+- A demo notebook to perform supervised fine-tuning (SFT) of Mistral-7B can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb). 🌎
+- A [blog post](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl) on how to fine-tune LLMs in 2024 using Hugging Face tooling. 🌎
+- The [Alignment Handbook](https://github.com/huggingface/alignment-handbook) by Hugging Face includes scripts and recipes to perform supervised fine-tuning (SFT) and direct preference optimization with Mistral-7B. This includes scripts for full fine-tuning, QLoRa on a single GPU as well as multi-GPU fine-tuning.
+- [Causal language modeling task guide](../tasks/language_modeling)
## MistralConfig
@@ -158,4 +211,4 @@ Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Sin
## FlaxMistralForCausalLM
[[autodoc]] FlaxMistralForCausalLM
- - __call__
+ - __call__
\ No newline at end of file
diff --git a/docs/source/en/model_doc/mixtral.md b/docs/source/en/model_doc/mixtral.md
index d1a9ee0a1a07e2..942b040c3f2fd5 100644
--- a/docs/source/en/model_doc/mixtral.md
+++ b/docs/source/en/model_doc/mixtral.md
@@ -18,38 +18,27 @@ rendered properly in your Markdown viewer.
## Overview
-Mixtral-8x7B is Mistral AI's second Large Language Model (LLM).
+Mixtral-8x7B was introduced in the [Mixtral of Experts blogpost](https://mistral.ai/news/mixtral-of-experts/) by Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
-The Mixtral model was proposed by the [Mistral AI](https://mistral.ai/) team.
-
-It was introduced in the [Mixtral of Experts blogpost](https://mistral.ai/news/mixtral-of-experts/) with the following introduction:
+The introduction of the blog post says:
*Today, the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts models (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.*
-Tips:
-
-
-- The model needs to be converted using the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mixtral/convert_mixtral_weights_to_hf.py).
-- If the model is quantized to 4bits, a single A100 is enough to fit the entire 45B model.
-
-This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) .
-The original code can be found [here](https://github.com/mistralai/mistral-src).
-
-
-### Model Details
+Mixtral-8x7B is the second large language model (LLM) released by [mistral.ai](https://mistral.ai/), after [Mistral-7B](mistral).
-Mixtral-45B is a decoder-based LM with the following architectural choices:
+### Architectural details
-* Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
+Mixtral-8x7B is a decoder-only Transformer with the following architectural choices:
-The following implementation details are shared with Mistral AI's first model [mistral](mistral):
-* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
-* GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
-* Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
+- Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. To learn more about mixture-of-experts, refer to the [blog post](https://huggingface.co/blog/moe).
+- Despite the model having 45 billion parameters,, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
-They also provide an instruction fine-tuned model: `mistralai/Mixtral-8x7B-v0.1` which can be used for chat-based inference.
+The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral):
+- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
+- GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
+- Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
-For more details please read our [release blog post](https://mistral.ai/news/mixtral-of-experts/)
+For more details refer to the [release blog post](https://mistral.ai/news/mixtral-of-experts/).
### License
@@ -57,44 +46,54 @@ For more details please read our [release blog post](https://mistral.ai/news/mix
## Usage tips
-`Mixtral-8x7B` can be found on the [Huggingface Hub](https://huggingface.co/mistralai)
+The Mistral team has released 2 checkpoints:
+- a base model, [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), which has been pre-trained to predict the next token on internet-scale data.
+- an instruction tuned model, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO).
-These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hub:
+The base model can be used as follows:
```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
->>> device = "cuda" # the device to load the model onto
->>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
>>> prompt = "My favourite condiment is"
->>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
-"The expected output"
+"My favourite condiment is to ..."
```
-To use the raw checkpoints with HuggingFace you can use the `convert_mixtral_weights_to_hf.py` script to convert them to the HuggingFace format:
+The instruction tuned model can be used as follows:
-```bash
-python src/transformers/models/mixtral/convert_mixtral_weights_to_hf.py \
- --input_dir /path/to/downloaded/mistral/weights --output_dir /output/path
-```
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer
-You can then load the converted model from the `output/path`:
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", device_map="auto")
+>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
-```python
-from transformers import MixtralForCausalLM, LlamaTokenizer
+>>> messages = [
+... {"role": "user", "content": "What is your favourite condiment?"},
+... {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
+... {"role": "user", "content": "Do you have mayonnaise recipes?"}
+... ]
+
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
-tokenizer = LlamaTokenizer.from_pretrained("/output/path")
-model = MixtralForCausalLM.from_pretrained("/output/path")
+>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
+>>> tokenizer.batch_decode(generated_ids)[0]
+"Mayonnaise can be made as follows: (...)"
```
-## Combining Mixtral and Flash Attention 2
+As can be seen, the instruction-tuned model requires a [chat template](../chat_templating) to be applied to make sure the inputs are prepared in the right format.
+
+## Speeding up Mixtral by using Flash Attention
+
+The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
@@ -102,21 +101,20 @@ First, make sure to install the latest version of Flash Attention 2 to include t
pip install -U flash-attn --no-build-isolation
```
-Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of [`flash-attn`](https://github.com/Dao-AILab/flash-attention) repository. Make also sure to load your model in half-precision (e.g. `torch.float16`)
+Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). Make also sure to load your model in half-precision (e.g. `torch.float16`)
-To load and run a model using Flash Attention 2, refer to the snippet below:
+To load and run a model using Flash Attention-2, refer to the snippet below:
```python
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
->>> device = "cuda" # the device to load the model onto
->>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
>>> prompt = "My favourite condiment is"
->>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
@@ -139,9 +137,54 @@ To enable sliding window attention, just make sure to have a `flash-attn` versio
The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (`self.config.sliding_window`), support batched generation only for `padding_side="left"` and use the absolute position of the current token to compute the positional embedding.
-## The Mistral Team
+## Shrinking down Mixtral using quantization
+
+As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required.
+
+Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization.md) for other quantization methods):
+
+```python
+>>> import torch
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+
+>>> # specify how to quantize the model
+>>> quantization_config = BitsAndBytesConfig(
+... load_in_4bit=True,
+... bnb_4bit_quant_type="nf4",
+... bnb_4bit_compute_dtype="torch.float16",
+... )
+
+>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", quantization_config=True, device_map="auto")
+>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
+
+>>> prompt = "My favourite condiment is"
+
+>>> messages = [
+... {"role": "user", "content": "What is your favourite condiment?"},
+... {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
+... {"role": "user", "content": "Do you have mayonnaise recipes?"}
+... ]
+
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+
+>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
+>>> tokenizer.batch_decode(generated_ids)[0]
+"The expected output"
+```
+
+This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) .
+The original code can be found [here](https://github.com/mistralai/mistral-src).
+
+## Resources
+
+A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Mixtral. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+
+
-Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
+- A demo notebook to perform supervised fine-tuning (SFT) of Mixtral-8x7B can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb). 🌎
+- A [blog post](https://medium.com/@prakharsaxena11111/finetuning-mixtral-7bx8-6071b0ebf114) on fine-tuning Mixtral-8x7B using PEFT. 🌎
+- The [Alignment Handbook](https://github.com/huggingface/alignment-handbook) by Hugging Face includes scripts and recipes to perform supervised fine-tuning (SFT) and direct preference optimization with Mistral-7B. This includes scripts for full fine-tuning, QLoRa on a single GPU as well as multi-GPU fine-tuning.
+- [Causal language modeling task guide](../tasks/language_modeling)
## MixtralConfig
From 2cc8cf6ce7ae0416561acbb639df4bbc5f409b6f Mon Sep 17 00:00:00 2001
From: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date: Thu, 22 Feb 2024 16:40:06 +0100
Subject: [PATCH 002/549] Fix `torch.compile` with `fullgraph=True` when
`attention_mask` input is used (#29211)
* fix torch.export.export for llama
* do not change doc title
* make fix copies
---
docs/source/en/perf_infer_gpu_one.md | 2 +-
docs/source/en/perf_train_gpu_one.md | 20 +------------------
src/transformers/modeling_attn_mask_utils.py | 18 ++++++++++++-----
.../models/gemma/modeling_gemma.py | 16 +++++++++++----
.../models/llama/modeling_llama.py | 16 +++++++++++----
5 files changed, 39 insertions(+), 33 deletions(-)
diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md
index 69512acd6a6c3f..b03460a7a0d15c 100644
--- a/docs/source/en/perf_infer_gpu_one.md
+++ b/docs/source/en/perf_infer_gpu_one.md
@@ -184,7 +184,7 @@ For now, Transformers supports SDPA inference and training for the following arc
-FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first.
+FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first. The memory-efficient attention backend is able to handle `fp32` models.
diff --git a/docs/source/en/perf_train_gpu_one.md b/docs/source/en/perf_train_gpu_one.md
index 1d885ba03646c7..df27f178616b91 100644
--- a/docs/source/en/perf_train_gpu_one.md
+++ b/docs/source/en/perf_train_gpu_one.md
@@ -529,24 +529,6 @@ And for Pytorch DeepSpeed has built one as well: [DeepSpeed-MoE: Advancing Mixtu
## Using PyTorch native attention and Flash Attention
-PyTorch 2.0 released a native [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA),
-that allows using fused GPU kernels such as [memory-efficient attention](https://arxiv.org/abs/2112.05682) and [flash attention](https://arxiv.org/abs/2205.14135).
-
-After installing the [`optimum`](https://github.com/huggingface/optimum) package, the relevant internal modules can be
-replaced to use PyTorch's native attention with:
-
-```python
-model = model.to_bettertransformer()
-```
-
-Once converted, train the model as usual.
-
-
-
-The PyTorch-native `scaled_dot_product_attention` operator can only dispatch to Flash Attention if no `attention_mask` is provided.
-
-By default, in training mode, the BetterTransformer integration **drops the mask support and can only be used for training that does not require a padding mask for batched training**. This is the case, for example, during masked language modeling or causal language modeling. BetterTransformer is not suited for fine-tuning models on tasks that require a padding mask.
-
-
+PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers and is used by default for `torch>=2.1.1` when an implementation is available. Please refer to [PyTorch scaled dot product attention](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) for a list of supported models and more details.
Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA.
diff --git a/src/transformers/modeling_attn_mask_utils.py b/src/transformers/modeling_attn_mask_utils.py
index 67555239c758ae..1a2c0db7bb140c 100755
--- a/src/transformers/modeling_attn_mask_utils.py
+++ b/src/transformers/modeling_attn_mask_utils.py
@@ -349,8 +349,12 @@ def _prepare_4d_causal_attention_mask_for_sdpa(
# torch.jit.trace, symbolic_trace and torchdynamo with fullgraph=True are unable to capture the controlflow `is_causal=attention_mask is None and q_len > 1`
# used as an SDPA argument. We keep compatibility with these tracing tools by always using SDPA's `attn_mask` argument in case we are tracing.
- # TODO: Fix this as well when using torchdynamo with fullgraph=True.
- is_tracing = torch.jit.is_tracing() or isinstance(inputs_embeds, torch.fx.Proxy)
+ # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
+ is_tracing = (
+ torch.jit.is_tracing()
+ or isinstance(inputs_embeds, torch.fx.Proxy)
+ or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
+ )
if attention_mask is not None:
# 4d mask is passed through
@@ -448,10 +452,14 @@ def _prepare_4d_attention_mask_for_sdpa(mask: torch.Tensor, dtype: torch.dtype,
batch_size, key_value_length = mask.shape
tgt_len = tgt_len if tgt_len is not None else key_value_length
- # torch.jit.trace and torchdynamo with fullgraph=True are unable to capture the controlflow `is_causal=attention_mask is None and q_len > 1`
+ # torch.jit.trace, symbolic_trace and torchdynamo with fullgraph=True are unable to capture the controlflow `is_causal=attention_mask is None and q_len > 1`
# used as an SDPA argument. We keep compatibility with these tracing tools by always using SDPA's `attn_mask` argument in case we are tracing.
- # TODO: Fix this as well when using torchdynamo with fullgraph=True.
- is_tracing = torch.jit.is_tracing()
+ # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
+ is_tracing = (
+ torch.jit.is_tracing()
+ or isinstance(mask, torch.fx.Proxy)
+ or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
+ )
if torch.all(mask == 1):
if is_tracing:
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index 165ef5a0545182..d5cfed296a903e 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -969,10 +969,18 @@ def _update_causal_mask(self, attention_mask, input_tensor):
padding_mask, torch.finfo(dtype).min
)
- if self.config._attn_implementation == "sdpa":
- is_tracing = torch.jit.is_tracing() or isinstance(input_tensor, torch.fx.Proxy)
- if not is_tracing and attention_mask is not None and torch.any(attention_mask != 1):
- causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1)[..., None]).to(
+ if self.config._attn_implementation == "sdpa" and attention_mask is not None:
+ # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
+ is_tracing = (
+ torch.jit.is_tracing()
+ or isinstance(input_tensor, torch.fx.Proxy)
+ or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
+ )
+ if not is_tracing and torch.any(attention_mask != 1):
+ # Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+ # Details: https://github.com/pytorch/pytorch/issues/110213
+ causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1, keepdim=True)).to(
dtype
)
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 8e494adefc2d73..1d41bf13710e62 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -1076,10 +1076,18 @@ def _update_causal_mask(self, attention_mask, input_tensor):
padding_mask, torch.finfo(dtype).min
)
- if self.config._attn_implementation == "sdpa":
- is_tracing = torch.jit.is_tracing() or isinstance(input_tensor, torch.fx.Proxy)
- if not is_tracing and attention_mask is not None and torch.any(attention_mask != 1):
- causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1)[..., None]).to(
+ if self.config._attn_implementation == "sdpa" and attention_mask is not None:
+ # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
+ is_tracing = (
+ torch.jit.is_tracing()
+ or isinstance(input_tensor, torch.fx.Proxy)
+ or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
+ )
+ if not is_tracing and torch.any(attention_mask != 1):
+ # Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+ # Details: https://github.com/pytorch/pytorch/issues/110213
+ causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1, keepdim=True)).to(
dtype
)
From 45244940725ec1b3e4c390b74dbafe65b298acca Mon Sep 17 00:00:00 2001
From: cchen-dialpad <47165889+cchen-dialpad@users.noreply.github.com>
Date: Fri, 23 Feb 2024 00:19:51 -0800
Subject: [PATCH 003/549] fix(mlflow): check mlflow version to use the
synchronous flag (#29195)
* fix(mlflow): check mlflow version to use the flag
* fix indent
* add log_params async and fix quality
---
.../integrations/integration_utils.py | 20 +++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/src/transformers/integrations/integration_utils.py b/src/transformers/integrations/integration_utils.py
index 3af00c98eb66b2..9367256c870058 100644
--- a/src/transformers/integrations/integration_utils.py
+++ b/src/transformers/integrations/integration_utils.py
@@ -29,6 +29,7 @@
from typing import TYPE_CHECKING, Any, Dict, Literal, Optional, Union
import numpy as np
+import packaging.version
from .. import __version__ as version
from ..utils import flatten_dict, is_datasets_available, is_pandas_available, is_torch_available, logging
@@ -985,6 +986,12 @@ def setup(self, args, state, model):
self._experiment_name = os.getenv("MLFLOW_EXPERIMENT_NAME", None)
self._flatten_params = os.getenv("MLFLOW_FLATTEN_PARAMS", "FALSE").upper() in ENV_VARS_TRUE_VALUES
self._run_id = os.getenv("MLFLOW_RUN_ID", None)
+ self._async_log = False
+ # "synchronous" flag is only available with mlflow version >= 2.8.0
+ # https://github.com/mlflow/mlflow/pull/9705
+ # https://github.com/mlflow/mlflow/releases/tag/v2.8.0
+ if packaging.version.parse(importlib.metadata.version("mlflow")) >= packaging.version.parse("2.8.0"):
+ self._async_log = True
logger.debug(
f"MLflow experiment_name={self._experiment_name}, run_name={args.run_name}, nested={self._nested_run},"
f" tags={self._nested_run}, tracking_uri={self._tracking_uri}"
@@ -1023,7 +1030,12 @@ def setup(self, args, state, model):
# MLflow cannot log more than 100 values in one go, so we have to split it
combined_dict_items = list(combined_dict.items())
for i in range(0, len(combined_dict_items), self._MAX_PARAMS_TAGS_PER_BATCH):
- self._ml_flow.log_params(dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH]))
+ if self._async_log:
+ self._ml_flow.log_params(
+ dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH]), synchronous=False
+ )
+ else:
+ self._ml_flow.log_params(dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH]))
mlflow_tags = os.getenv("MLFLOW_TAGS", None)
if mlflow_tags:
mlflow_tags = json.loads(mlflow_tags)
@@ -1047,7 +1059,11 @@ def on_log(self, args, state, control, logs, model=None, **kwargs):
f'Trainer is attempting to log a value of "{v}" of type {type(v)} for key "{k}" as a metric. '
"MLflow's log_metric() only accepts float and int types so we dropped this attribute."
)
- self._ml_flow.log_metrics(metrics=metrics, step=state.global_step, synchronous=False)
+
+ if self._async_log:
+ self._ml_flow.log_metrics(metrics=metrics, step=state.global_step, synchronous=False)
+ else:
+ self._ml_flow.log_metrics(metrics=metrics, step=state.global_step)
def on_train_end(self, args, state, control, **kwargs):
if self._initialized and state.is_world_process_zero:
From 75ed76eceaf9b20c7ec37395e4f5d491135186f9 Mon Sep 17 00:00:00 2001
From: Amin
Date: Fri, 23 Feb 2024 11:26:21 +0300
Subject: [PATCH 004/549] Fix missing translation in README_ru (#29054)
* Fix missing translation in README_ru
* Update README_ru.md
Co-authored-by: Maria Khalusova
---------
Co-authored-by: Maria Khalusova
---
README_ru.md | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/README_ru.md b/README_ru.md
index 3e6f3d54f27e22..1c0f4d41c75592 100644
--- a/README_ru.md
+++ b/README_ru.md
@@ -520,7 +520,8 @@ conda install conda-forge::transformers
1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.
1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.
-1. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
+
+1. Хотите внести новую модель? Мы добавили **подробное руководство и шаблоны**, чтобы помочь вам в процессе добавления новой модели. Вы можете найти их в папке [`templates`](./templates) репозитория. Обязательно ознакомьтесь с [руководством по внесению изменений](./CONTRIBUTING.md) и свяжитесь с ответственным разработчиком или откройте задачу, чтобы собрать отзывы перед началом работы над вашим пулл-реквестом.
Чтобы проверить, есть ли у каждой модели реализация на Flax, PyTorch или TensorFlow, или связанный с ней токенизатор, поддерживаемый библиотекой 🤗 Tokenizers, обратитесь к [этой таблице](https://huggingface.co/docs/transformers/index#supported-frameworks).
From 3f60d11a8750992287cd0d1f3dbc9df6ffc34288 Mon Sep 17 00:00:00 2001
From: Alessandro Palla
Date: Fri, 23 Feb 2024 10:40:44 +0100
Subject: [PATCH 005/549] Improve _update_causal_mask performance (#29210)
* Fix issue 29206
* Fix style
---
src/transformers/models/gemma/modeling_gemma.py | 11 ++++-------
src/transformers/models/llama/modeling_llama.py | 11 ++++-------
2 files changed, 8 insertions(+), 14 deletions(-)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index d5cfed296a903e..4cb12ff4700598 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -959,15 +959,14 @@ def _update_causal_mask(self, attention_mask, input_tensor):
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
# We use the current dtype to avoid any overflows
- causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * torch.finfo(dtype).min
+ min_dtype = torch.finfo(dtype).min
+ causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * min_dtype
causal_mask = causal_mask.to(dtype=dtype, device=device)
if attention_mask is not None and attention_mask.dim() == 2:
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
- causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(
- padding_mask, torch.finfo(dtype).min
- )
+ causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
if self.config._attn_implementation == "sdpa" and attention_mask is not None:
# TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
@@ -980,9 +979,7 @@ def _update_causal_mask(self, attention_mask, input_tensor):
# Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
- causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1, keepdim=True)).to(
- dtype
- )
+ causal_mask = causal_mask.mul(~torch.all(causal_mask == min_dtype, dim=-1, keepdim=True)).to(dtype)
return causal_mask
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 1d41bf13710e62..66a50c58089191 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -1066,15 +1066,14 @@ def _update_causal_mask(self, attention_mask, input_tensor):
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
# We use the current dtype to avoid any overflows
- causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * torch.finfo(dtype).min
+ min_dtype = torch.finfo(dtype).min
+ causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * min_dtype
causal_mask = causal_mask.to(dtype=dtype, device=device)
if attention_mask is not None and attention_mask.dim() == 2:
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
- causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(
- padding_mask, torch.finfo(dtype).min
- )
+ causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
if self.config._attn_implementation == "sdpa" and attention_mask is not None:
# TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
@@ -1087,9 +1086,7 @@ def _update_causal_mask(self, attention_mask, input_tensor):
# Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
- causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1, keepdim=True)).to(
- dtype
- )
+ causal_mask = causal_mask.mul(~torch.all(causal_mask == min_dtype, dim=-1, keepdim=True)).to(dtype)
return causal_mask
From 89c64817ce4172bc8bb58c675c445a63f16d0e38 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Fri, 23 Feb 2024 10:43:31 +0100
Subject: [PATCH 006/549] [`Doc`] update model doc qwen2 (#29238)
* update model doc qwen2
* Update docs/source/en/model_doc/qwen2.md
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---------
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
docs/source/en/model_doc/qwen2.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/docs/source/en/model_doc/qwen2.md b/docs/source/en/model_doc/qwen2.md
index 61e45fd9c2c8e2..5f9e5dba22b844 100644
--- a/docs/source/en/model_doc/qwen2.md
+++ b/docs/source/en/model_doc/qwen2.md
@@ -35,8 +35,8 @@ In the following, we demonstrate how to use `Qwen2-7B-Chat-beta` for the inferen
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto
->>> model = AutoModelForCausalLM.from_pretrained("Qwen2/Qwen2-7B-Chat-beta", device_map="auto")
->>> tokenizer = AutoTokenizer.from_pretrained("Qwen2/Qwen2-7B-Chat-beta")
+>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-7B-Chat", device_map="auto")
+>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat")
>>> prompt = "Give me a short introduction to large language model."
From 371b572e5504f72024249858861743834c8924b2 Mon Sep 17 00:00:00 2001
From: Matt
Date: Fri, 23 Feb 2024 12:46:31 +0000
Subject: [PATCH 007/549] Allow remote code repo names to contain "." (#29175)
* stash commit
* stash commit
* It works!
* Remove unnecessary change
* We don't actually need the cache_dir!
* Update docstring
* Add test
* Add test with custom cache dir too
* Update model repo path
---
src/transformers/dynamic_module_utils.py | 22 +++++++++++++++++++---
tests/models/auto/test_modeling_auto.py | 21 +++++++++++++++++++++
2 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/src/transformers/dynamic_module_utils.py b/src/transformers/dynamic_module_utils.py
index 2236b30f778c99..34486bb74632d6 100644
--- a/src/transformers/dynamic_module_utils.py
+++ b/src/transformers/dynamic_module_utils.py
@@ -185,19 +185,35 @@ def check_imports(filename: Union[str, os.PathLike]) -> List[str]:
return get_relative_imports(filename)
-def get_class_in_module(class_name: str, module_path: Union[str, os.PathLike]) -> typing.Type:
+def get_class_in_module(repo_id: str, class_name: str, module_path: Union[str, os.PathLike]) -> typing.Type:
"""
Import a module on the cache directory for modules and extract a class from it.
Args:
+ repo_id (`str`): The repo containing the module. Used for path manipulation.
class_name (`str`): The name of the class to import.
module_path (`str` or `os.PathLike`): The path to the module to import.
+
Returns:
`typing.Type`: The class looked for.
"""
module_path = module_path.replace(os.path.sep, ".")
- module = importlib.import_module(module_path)
+ try:
+ module = importlib.import_module(module_path)
+ except ModuleNotFoundError as e:
+ # This can happen when the repo id contains ".", which Python's import machinery interprets as a directory
+ # separator. We do a bit of monkey patching to detect and fix this case.
+ if not (
+ "." in repo_id
+ and module_path.startswith("transformers_modules")
+ and repo_id.replace("/", ".") in module_path
+ ):
+ raise e # We can't figure this one out, just reraise the original error
+ corrected_path = os.path.join(HF_MODULES_CACHE, module_path.replace(".", "/")) + ".py"
+ corrected_path = corrected_path.replace(repo_id.replace(".", "/"), repo_id)
+ module = importlib.machinery.SourceFileLoader(module_path, corrected_path).load_module()
+
return getattr(module, class_name)
@@ -497,7 +513,7 @@ def get_class_from_dynamic_module(
local_files_only=local_files_only,
repo_type=repo_type,
)
- return get_class_in_module(class_name, final_module.replace(".py", ""))
+ return get_class_in_module(repo_id, class_name, final_module.replace(".py", ""))
def custom_object_save(obj: Any, folder: Union[str, os.PathLike], config: Optional[Dict] = None) -> List[str]:
diff --git a/tests/models/auto/test_modeling_auto.py b/tests/models/auto/test_modeling_auto.py
index 7c47f39ea68c8a..ab5fa95796eac5 100644
--- a/tests/models/auto/test_modeling_auto.py
+++ b/tests/models/auto/test_modeling_auto.py
@@ -376,6 +376,27 @@ def test_from_pretrained_dynamic_model_distant_with_ref(self):
for p1, p2 in zip(model.parameters(), reloaded_model.parameters()):
self.assertTrue(torch.equal(p1, p2))
+ def test_from_pretrained_dynamic_model_with_period(self):
+ # We used to have issues where repos with "." in the name would cause issues because the Python
+ # import machinery would treat that as a directory separator, so we test that case
+
+ # If remote code is not set, we will time out when asking whether to load the model.
+ with self.assertRaises(ValueError):
+ model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model_v1.0")
+ # If remote code is disabled, we can't load this config.
+ with self.assertRaises(ValueError):
+ model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model_v1.0", trust_remote_code=False)
+
+ model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model_v1.0", trust_remote_code=True)
+ self.assertEqual(model.__class__.__name__, "NewModel")
+
+ # Test that it works with a custom cache dir too
+ with tempfile.TemporaryDirectory() as tmp_dir:
+ model = AutoModel.from_pretrained(
+ "hf-internal-testing/test_dynamic_model_v1.0", trust_remote_code=True, cache_dir=tmp_dir
+ )
+ self.assertEqual(model.__class__.__name__, "NewModel")
+
def test_new_model_registration(self):
AutoConfig.register("custom", CustomConfig)
From c8d98405a8f7b0e5d07391b671dcc61bb9d7bad5 Mon Sep 17 00:00:00 2001
From: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Date: Fri, 23 Feb 2024 21:37:08 +0800
Subject: [PATCH 008/549] Use torch 2.2 for daily CI (model tests) (#29208)
* Use torch 2.2 for daily CI (model tests)
* update
* update
---------
Co-authored-by: ydshieh
---
.github/workflows/build-docker-images.yml | 12 +------
docker/transformers-all-latest-gpu/Dockerfile | 33 +++++++------------
2 files changed, 12 insertions(+), 33 deletions(-)
diff --git a/.github/workflows/build-docker-images.yml b/.github/workflows/build-docker-images.yml
index be070a95d3a94f..2b198bd4af56c5 100644
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@@ -20,18 +20,8 @@ concurrency:
jobs:
latest-docker:
name: "Latest PyTorch + TensorFlow [dev]"
- runs-on: ubuntu-22.04
+ runs-on: [intel-cpu, 8-cpu, ci]
steps:
- - name: Cleanup disk
- run: |
- sudo ls -l /usr/local/lib/
- sudo ls -l /usr/share/
- sudo du -sh /usr/local/lib/
- sudo du -sh /usr/share/
- sudo rm -rf /usr/local/lib/android
- sudo rm -rf /usr/share/dotnet
- sudo du -sh /usr/local/lib/
- sudo du -sh /usr/share/
-
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
diff --git a/docker/transformers-all-latest-gpu/Dockerfile b/docker/transformers-all-latest-gpu/Dockerfile
index e96eb9539c8bd2..9afac41d5b040e 100644
--- a/docker/transformers-all-latest-gpu/Dockerfile
+++ b/docker/transformers-all-latest-gpu/Dockerfile
@@ -9,9 +9,9 @@ SHELL ["sh", "-lc"]
# The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant
# to be used as arguments for docker build (so far).
-ARG PYTORCH='2.1.1'
+ARG PYTORCH='2.2.0'
# (not always a valid torch version)
-ARG INTEL_TORCH_EXT='2.1.100'
+ARG INTEL_TORCH_EXT='2.2.0'
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu118'
@@ -23,6 +23,14 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip
ARG REF=main
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+# During switch torch 2.2, we need to move (explicit) torch installation below but keep tf installation here.
+# (otherwise we get `The runner has received a shutdown signal.` whose root cause is unknown but likely disk being full)
+RUN python3 -m pip install --no-cache-dir -U tensorflow==2.13 protobuf==3.20.3 tensorflow_text tensorflow_probability
+
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime]
+
+# RUN python3 -m pip uninstall -y torch torchvision torchaudio && python3 -m pip install --no-cache-dir -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+
# TODO: Handle these in a python utility script
RUN [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile
RUN echo torch=$VERSION
@@ -31,10 +39,6 @@ RUN echo torch=$VERSION
# TODO: We might need to specify proper versions that work with a specific torch version (especially for past CI).
RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
-RUN python3 -m pip install --no-cache-dir -U tensorflow==2.13 protobuf==3.20.3 tensorflow_text tensorflow_probability
-
-RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime]
-
RUN python3 -m pip uninstall -y flax jax
RUN python3 -m pip install --no-cache-dir intel_extension_for_pytorch==$INTEL_TORCH_EXT -f https://developer.intel.com/ipex-whl-stable-cpu
@@ -46,22 +50,7 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/acc
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/peft@main#egg=peft
-# Add bitsandbytes for mixed int8 testing
-RUN python3 -m pip install --no-cache-dir bitsandbytes
-
-# Add auto-gptq for gtpq quantization testing
-RUN python3 -m pip install --no-cache-dir auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
-
-# Add einops for additional model testing
-RUN python3 -m pip install --no-cache-dir einops
-
-# Add aqlm for quantization testing
-RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.1
-
-# Add autoawq for quantization testing
-RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp38-cp38-linux_x86_64.whl
-
-# For bettertransformer + gptq
+# For bettertransformer
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
# For video model testing
From 9fe360883e14c3373014e276b2c9db66f77049c1 Mon Sep 17 00:00:00 2001
From: Benjamin Muskalla
Date: Mon, 26 Feb 2024 10:01:45 +0100
Subject: [PATCH 009/549] Cache `is_vision_available` result (#29280)
Cache `is_vision_available`
This check is used quite often during process in image models and can take up a serious amount of time compared to the other processing steps.
---
src/transformers/utils/import_utils.py | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/transformers/utils/import_utils.py b/src/transformers/utils/import_utils.py
index 57b4e840414be0..8cf6c1a14f372f 100644
--- a/src/transformers/utils/import_utils.py
+++ b/src/transformers/utils/import_utils.py
@@ -741,6 +741,7 @@ def is_tokenizers_available():
return _tokenizers_available
+@lru_cache
def is_vision_available():
_pil_available = importlib.util.find_spec("PIL") is not None
if _pil_available:
From 93f8617afdadf34a3815921510b3a83925ef5db2 Mon Sep 17 00:00:00 2001
From: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Date: Mon, 26 Feb 2024 17:41:01 +0800
Subject: [PATCH 010/549] Use `DS_DISABLE_NINJA=1` (#29290)
Co-authored-by: ydshieh
---
.github/workflows/self-scheduled.yml | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/.github/workflows/self-scheduled.yml b/.github/workflows/self-scheduled.yml
index d44e9a29ecf0da..c3c77925bbe734 100644
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@@ -265,7 +265,7 @@ jobs:
working-directory: /workspace
run: |
python3 -m pip uninstall -y deepspeed
- DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
+ DS_DISABLE_NINJA=1 DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
- name: NVIDIA-SMI
run: |
From 2a7746c4d16eebc58a315cdd15720c69c65eac6f Mon Sep 17 00:00:00 2001
From: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date: Mon, 26 Feb 2024 11:05:49 +0100
Subject: [PATCH 011/549] Add `non_device_test` pytest mark to filter out
non-device tests (#29213)
* add conftest
* fix
* remove deselected
---
conftest.py | 46 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/conftest.py b/conftest.py
index 0b5daf574f0bc9..74220895aaec71 100644
--- a/conftest.py
+++ b/conftest.py
@@ -21,10 +21,49 @@
from os.path import abspath, dirname, join
import _pytest
+import pytest
from transformers.testing_utils import HfDoctestModule, HfDocTestParser
+NOT_DEVICE_TESTS = {
+ "test_tokenization",
+ "test_processor",
+ "test_processing",
+ "test_feature_extraction",
+ "test_image_processing",
+ "test_image_processor",
+ "test_retrieval",
+ "test_config",
+ "test_from_pretrained_no_checkpoint",
+ "test_keep_in_fp32_modules",
+ "test_gradient_checkpointing_backward_compatibility",
+ "test_gradient_checkpointing_enable_disable",
+ "test_save_load_fast_init_from_base",
+ "test_fast_init_context_manager",
+ "test_fast_init_tied_embeddings",
+ "test_save_load_fast_init_to_base",
+ "test_torch_save_load",
+ "test_initialization",
+ "test_forward_signature",
+ "test_model_common_attributes",
+ "test_model_main_input_name",
+ "test_correct_missing_keys",
+ "test_tie_model_weights",
+ "test_can_use_safetensors",
+ "test_load_save_without_tied_weights",
+ "test_tied_weights_keys",
+ "test_model_weights_reload_no_missing_tied_weights",
+ "test_pt_tf_model_equivalence",
+ "test_mismatched_shapes_have_properly_initialized_weights",
+ "test_matched_shapes_have_loaded_weights_when_some_mismatched_shapes_exist",
+ "test_model_is_small",
+ "test_tf_from_pt_safetensors",
+ "test_flax_from_pt_safetensors",
+ "ModelTest::test_pipeline_", # None of the pipeline tests from PipelineTesterMixin (of which XxxModelTest inherits from) are running on device
+ "ModelTester::test_pipeline_",
+}
+
# allow having multiple repository checkouts and not needing to remember to rerun
# `pip install -e '.[dev]'` when switching between checkouts and running tests.
git_repo_path = abspath(join(dirname(__file__), "src"))
@@ -46,6 +85,13 @@ def pytest_configure(config):
config.addinivalue_line("markers", "is_staging_test: mark test to run only in the staging environment")
config.addinivalue_line("markers", "accelerate_tests: mark test that require accelerate")
config.addinivalue_line("markers", "tool_tests: mark the tool tests that are run on their specific schedule")
+ config.addinivalue_line("markers", "not_device_test: mark the tests always running on cpu")
+
+
+def pytest_collection_modifyitems(items):
+ for item in items:
+ if any(test_name in item.nodeid for test_name in NOT_DEVICE_TESTS):
+ item.add_marker(pytest.mark.not_device_test)
def pytest_addoption(parser):
From 7c4995f93d8d24aae05e1e43279c96dce736e5c8 Mon Sep 17 00:00:00 2001
From: Merve Noyan
Date: Mon, 26 Feb 2024 13:35:37 +0300
Subject: [PATCH 012/549] Add feature extraction mapping for automatic metadata
update (#28944)
* add feature extraction mapping
* added prefix
* ruff check
* minor fix
* Update modeling_auto.py
* fix typo
* remove prefix to make variable public/importable
* Update src/transformers/models/auto/modeling_auto.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* fixes
* addressed comments
* nit
* fix-copies
* remove from tests
* this should fix
* Update tests/models/convnextv2/test_modeling_convnextv2.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* nits
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---
src/transformers/__init__.py | 2 +
src/transformers/models/auto/__init__.py | 2 +
src/transformers/models/auto/modeling_auto.py | 54 ++++++++++++++++++-
src/transformers/trainer.py | 5 +-
src/transformers/utils/dummy_pt_objects.py | 3 ++
src/transformers/utils/fx.py | 2 +
tests/test_modeling_common.py | 5 +-
utils/check_repo.py | 2 +
utils/update_metadata.py | 1 +
9 files changed, 73 insertions(+), 3 deletions(-)
mode change 100644 => 100755 utils/update_metadata.py
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index 88c67226bc7742..f427c4be7b3c76 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -1460,6 +1460,7 @@
"MODEL_FOR_DEPTH_ESTIMATION_MAPPING",
"MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING",
"MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING",
+ "MODEL_FOR_IMAGE_MAPPING",
"MODEL_FOR_IMAGE_SEGMENTATION_MAPPING",
"MODEL_FOR_IMAGE_TO_IMAGE_MAPPING",
"MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING",
@@ -6203,6 +6204,7 @@
MODEL_FOR_DEPTH_ESTIMATION_MAPPING,
MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING,
MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
+ MODEL_FOR_IMAGE_MAPPING,
MODEL_FOR_IMAGE_SEGMENTATION_MAPPING,
MODEL_FOR_IMAGE_TO_IMAGE_MAPPING,
MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING,
diff --git a/src/transformers/models/auto/__init__.py b/src/transformers/models/auto/__init__.py
index 153f7f10def694..3db995a9c74092 100644
--- a/src/transformers/models/auto/__init__.py
+++ b/src/transformers/models/auto/__init__.py
@@ -49,6 +49,7 @@
"MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING",
"MODEL_FOR_DEPTH_ESTIMATION_MAPPING",
"MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING",
+ "MODEL_FOR_IMAGE_MAPPING",
"MODEL_FOR_IMAGE_SEGMENTATION_MAPPING",
"MODEL_FOR_IMAGE_TO_IMAGE_MAPPING",
"MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING",
@@ -233,6 +234,7 @@
MODEL_FOR_DEPTH_ESTIMATION_MAPPING,
MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING,
MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
+ MODEL_FOR_IMAGE_MAPPING,
MODEL_FOR_IMAGE_SEGMENTATION_MAPPING,
MODEL_FOR_IMAGE_TO_IMAGE_MAPPING,
MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING,
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 1fc959119d99fb..50534c58e8aaf4 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -29,7 +29,6 @@
logger = logging.get_logger(__name__)
-
MODEL_MAPPING_NAMES = OrderedDict(
[
# Base model mapping
@@ -478,6 +477,58 @@
]
)
+MODEL_FOR_IMAGE_MAPPING_NAMES = OrderedDict(
+ [
+ # Model for Image mapping
+ ("beit", "BeitModel"),
+ ("bit", "BitModel"),
+ ("conditional_detr", "ConditionalDetrModel"),
+ ("convnext", "ConvNextModel"),
+ ("convnextv2", "ConvNextV2Model"),
+ ("data2vec-vision", "Data2VecVisionModel"),
+ ("deformable_detr", "DeformableDetrModel"),
+ ("deit", "DeiTModel"),
+ ("deta", "DetaModel"),
+ ("detr", "DetrModel"),
+ ("dinat", "DinatModel"),
+ ("dinov2", "Dinov2Model"),
+ ("dpt", "DPTModel"),
+ ("efficientformer", "EfficientFormerModel"),
+ ("efficientnet", "EfficientNetModel"),
+ ("focalnet", "FocalNetModel"),
+ ("glpn", "GLPNModel"),
+ ("imagegpt", "ImageGPTModel"),
+ ("levit", "LevitModel"),
+ ("mobilenet_v1", "MobileNetV1Model"),
+ ("mobilenet_v2", "MobileNetV2Model"),
+ ("mobilevit", "MobileViTModel"),
+ ("mobilevitv2", "MobileViTV2Model"),
+ ("nat", "NatModel"),
+ ("poolformer", "PoolFormerModel"),
+ ("pvt", "PvtModel"),
+ ("regnet", "RegNetModel"),
+ ("resnet", "ResNetModel"),
+ ("segformer", "SegformerModel"),
+ ("siglip_vision_model", "SiglipVisionModel"),
+ ("swiftformer", "SwiftFormerModel"),
+ ("swin", "SwinModel"),
+ ("swin2sr", "Swin2SRModel"),
+ ("swinv2", "Swinv2Model"),
+ ("table-transformer", "TableTransformerModel"),
+ ("timesformer", "TimesformerModel"),
+ ("timm_backbone", "TimmBackbone"),
+ ("van", "VanModel"),
+ ("videomae", "VideoMAEModel"),
+ ("vit", "ViTModel"),
+ ("vit_hybrid", "ViTHybridModel"),
+ ("vit_mae", "ViTMAEModel"),
+ ("vit_msn", "ViTMSNModel"),
+ ("vitdet", "VitDetModel"),
+ ("vivit", "VivitModel"),
+ ("yolos", "YolosModel"),
+ ]
+)
+
MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING_NAMES = OrderedDict(
[
("deit", "DeiTForMaskedImageModeling"),
@@ -1243,6 +1294,7 @@
CONFIG_MAPPING_NAMES, MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES
)
MODEL_FOR_MASKED_LM_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_MASKED_LM_MAPPING_NAMES)
+MODEL_FOR_IMAGE_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_MAPPING_NAMES)
MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING = _LazyAutoMapping(
CONFIG_MAPPING_NAMES, MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING_NAMES
)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index a2436dadc1a812..1b70db000ccfeb 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -63,7 +63,10 @@
from .integrations.tpu import tpu_spmd_dataloader
from .modelcard import TrainingSummary
from .modeling_utils import PreTrainedModel, load_sharded_checkpoint, unwrap_model
-from .models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES, MODEL_MAPPING_NAMES
+from .models.auto.modeling_auto import (
+ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES,
+ MODEL_MAPPING_NAMES,
+)
from .optimization import Adafactor, get_scheduler
from .pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_13
from .tokenization_utils_base import PreTrainedTokenizerBase
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index de22b2d36fe127..dd2e50c67d0e3f 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -598,6 +598,9 @@ def __init__(self, *args, **kwargs):
MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING = None
+MODEL_FOR_IMAGE_MAPPING = None
+
+
MODEL_FOR_IMAGE_SEGMENTATION_MAPPING = None
diff --git a/src/transformers/utils/fx.py b/src/transformers/utils/fx.py
index 9f5c36a18a356b..be726b8541691d 100755
--- a/src/transformers/utils/fx.py
+++ b/src/transformers/utils/fx.py
@@ -39,6 +39,7 @@
MODEL_FOR_CTC_MAPPING_NAMES,
MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES,
MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_FOR_IMAGE_MAPPING_NAMES,
MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING_NAMES,
MODEL_FOR_MASKED_LM_MAPPING_NAMES,
MODEL_FOR_MULTIPLE_CHOICE_MAPPING_NAMES,
@@ -95,6 +96,7 @@ def _generate_supported_model_class_names(
"audio-classification": MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES,
"semantic-segmentation": MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES,
"backbone": MODEL_FOR_BACKBONE_MAPPING_NAMES,
+ "image-feature-extraction": MODEL_FOR_IMAGE_MAPPING_NAMES,
}
if supported_tasks is None:
diff --git a/tests/test_modeling_common.py b/tests/test_modeling_common.py
index 32f6abcbe3aad1..a2a16a1400069c 100755
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -700,7 +700,10 @@ def check_training_gradient_checkpointing(self, gradient_checkpointing_kwargs=No
for model_class in self.all_model_classes:
if (
model_class.__name__
- in [*get_values(MODEL_MAPPING_NAMES), *get_values(MODEL_FOR_BACKBONE_MAPPING_NAMES)]
+ in [
+ *get_values(MODEL_MAPPING_NAMES),
+ *get_values(MODEL_FOR_BACKBONE_MAPPING_NAMES),
+ ]
or not model_class.supports_gradient_checkpointing
):
continue
diff --git a/utils/check_repo.py b/utils/check_repo.py
index aa448f32e62d8f..ca25d7d9e32bf1 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -732,6 +732,8 @@ def check_all_auto_object_names_being_defined():
# module, if it's a private model defined in this file.
if name.endswith("MODEL_MAPPING_NAMES") and is_a_private_model(class_name):
continue
+ if name.endswith("MODEL_FOR_IMAGE_MAPPING_NAMES") and is_a_private_model(class_name):
+ continue
failures.append(
f"`{class_name}` appears in the mapping `{name}` but it is not defined in the library."
)
diff --git a/utils/update_metadata.py b/utils/update_metadata.py
old mode 100644
new mode 100755
index 2104d53b6e6f27..0762c4c2aa73fd
--- a/utils/update_metadata.py
+++ b/utils/update_metadata.py
@@ -62,6 +62,7 @@
PIPELINE_TAGS_AND_AUTO_MODELS = [
("pretraining", "MODEL_FOR_PRETRAINING_MAPPING_NAMES", "AutoModelForPreTraining"),
("feature-extraction", "MODEL_MAPPING_NAMES", "AutoModel"),
+ ("image-feature-extraction", "MODEL_FOR_IMAGE_MAPPING_NAMES", "AutoModel"),
("audio-classification", "MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES", "AutoModelForAudioClassification"),
("text-generation", "MODEL_FOR_CAUSAL_LM_MAPPING_NAMES", "AutoModelForCausalLM"),
("automatic-speech-recognition", "MODEL_FOR_CTC_MAPPING_NAMES", "AutoModelForCTC"),
From 24d59c79698d5d6c0364f9445acca29a4bd3153b Mon Sep 17 00:00:00 2001
From: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date: Mon, 26 Feb 2024 14:06:43 +0100
Subject: [PATCH 013/549] Use `torch.bool` instead of `torch.int64` for
non-persistant causal mask buffer (#29241)
use torch.bool instead of torch.int64
---
src/transformers/models/gemma/modeling_gemma.py | 7 +++++--
src/transformers/models/llama/modeling_llama.py | 11 ++++++++---
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index 4cb12ff4700598..4e6e7cd8ab6d35 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -810,8 +810,11 @@ def __init__(self, config: GemmaConfig):
self.norm = GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
- # register a causal mask to separate causal and padding mask creation. Merging happends in the attention class
- causal_mask = torch.full((config.max_position_embeddings, config.max_position_embeddings), fill_value=1)
+ # Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
+ # NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_position_embeddings`.
+ causal_mask = torch.full(
+ (config.max_position_embeddings, config.max_position_embeddings), fill_value=True, dtype=torch.bool
+ )
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
# Initialize weights and apply final processing
self.post_init()
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 66a50c58089191..8b55b4f7a3f78c 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -811,7 +811,9 @@ def _setup_cache(self, cache_cls, max_batch_size, max_cache_len: Optional[int] =
)
if max_cache_len > self.model.causal_mask.shape[-1] or self.device != self.model.causal_mask.device:
- causal_mask = torch.full((max_cache_len, max_cache_len), fill_value=1, device=self.device)
+ causal_mask = torch.full(
+ (max_cache_len, max_cache_len), fill_value=True, device=self.device, dtype=torch.bool
+ )
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
for layer in self.model.layers:
@@ -919,8 +921,11 @@ def __init__(self, config: LlamaConfig):
self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
- # register a causal mask to separate causal and padding mask creation. Merging happends in the attention class
- causal_mask = torch.full((config.max_position_embeddings, config.max_position_embeddings), fill_value=1)
+ # Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
+ # NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_position_embeddings`.
+ causal_mask = torch.full(
+ (config.max_position_embeddings, config.max_position_embeddings), fill_value=True, dtype=torch.bool
+ )
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
# Initialize weights and apply final processing
self.post_init()
From ece1b62b93cde70233f235f6a4c84e37bfc8eba0 Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Mon, 26 Feb 2024 13:36:12 +0000
Subject: [PATCH 014/549] Generate: v4.38 removals and related updates (#29171)
---
src/transformers/generation/__init__.py | 6 ++++
.../generation/candidate_generator.py | 3 +-
src/transformers/generation/utils.py | 36 ++++++-------------
src/transformers/models/opt/modeling_opt.py | 4 +--
src/transformers/utils/__init__.py | 1 -
src/transformers/utils/import_utils.py | 8 -----
6 files changed, 20 insertions(+), 38 deletions(-)
diff --git a/src/transformers/generation/__init__.py b/src/transformers/generation/__init__.py
index d1e81cffca67ed..e45f546cdc2780 100644
--- a/src/transformers/generation/__init__.py
+++ b/src/transformers/generation/__init__.py
@@ -40,6 +40,11 @@
"BeamSearchScorer",
"ConstrainedBeamSearchScorer",
]
+ _import_structure["candidate_generator"] = [
+ "AssistedCandidateGenerator",
+ "CandidateGenerator",
+ "PromptLookupCandidateGenerator",
+ ]
_import_structure["logits_process"] = [
"AlternatingCodebooksLogitsProcessor",
"ClassifierFreeGuidanceLogitsProcessor",
@@ -178,6 +183,7 @@
else:
from .beam_constraints import Constraint, ConstraintListState, DisjunctiveConstraint, PhrasalConstraint
from .beam_search import BeamHypotheses, BeamScorer, BeamSearchScorer, ConstrainedBeamSearchScorer
+ from .candidate_generator import AssistedCandidateGenerator, CandidateGenerator, PromptLookupCandidateGenerator
from .logits_process import (
AlternatingCodebooksLogitsProcessor,
ClassifierFreeGuidanceLogitsProcessor,
diff --git a/src/transformers/generation/candidate_generator.py b/src/transformers/generation/candidate_generator.py
index 616afa193176ea..4b8fa144f04b6b 100644
--- a/src/transformers/generation/candidate_generator.py
+++ b/src/transformers/generation/candidate_generator.py
@@ -99,7 +99,8 @@ def __init__(
# Make sure all data at the same device as assistant model
device = assistant_model.device
input_ids = input_ids.to(device)
- inputs_tensor = inputs_tensor.to(device)
+ if inputs_tensor is not None:
+ inputs_tensor = inputs_tensor.to(device)
# Prepare the assistant and the starting number of candidate tokens
self.assistant_model = assistant_model
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index d337e559344099..c7e03123a9eaf3 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -4319,7 +4319,6 @@ def constrained_beam_search(
def assisted_decoding(
self,
input_ids: torch.LongTensor,
- assistant_model: Optional["PreTrainedModel"] = None,
candidate_generator: Optional["CandidateGenerator"] = None,
do_sample: bool = False,
logits_processor: Optional[LogitsProcessorList] = None,
@@ -4355,12 +4354,7 @@ def assisted_decoding(
The sequence used as a prompt for the generation.
candidate_generator (`CandidateGenerator`, *optional*):
A derived instance of [`CandidateGenerator`] that defines how candidate sequences are generated. For
- more information, the documentation of [`CandidateGenerator`] should be read. Only one of `assistant_model` or `candidate_generator` should be passed as input to this function.
- assistant_model (`PreTrainedModel`, *optional*):
- An assistant model that can be used to accelerate generation. The assistant model must have the exact
- same tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistent model
- is much faster than running generation with the model you're calling generate from. As such, the
- assistant model should be much smaller.
+ more information, the documentation of [`CandidateGenerator`] should be read.
do_sample (`bool`, *optional*, defaults to `False`):
Whether or not to use sampling ; use greedy decoding otherwise.
logits_processor (`LogitsProcessorList`, *optional*):
@@ -4417,6 +4411,7 @@ def assisted_decoding(
... StoppingCriteriaList,
... MaxLengthCriteria,
... )
+ >>> from transformers.generation import AssistedCandidateGenerator
>>> tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
@@ -4432,33 +4427,22 @@ def assisted_decoding(
... ]
... )
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
+ >>> candidate_generator = AssistedCandidateGenerator(
+ ... input_ids=input_ids,
+ ... assistant_model=assistant_model,
+ ... generation_config=model.generation_config,
+ ... logits_processor=logits_processor,
+ ... model_kwargs={},
+ ... )
>>> outputs = model.assisted_decoding(
... input_ids,
- ... assistant_model=assistant_model,
+ ... candidate_generator=candidate_generator,
... logits_processor=logits_processor,
... stopping_criteria=stopping_criteria,
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["It might be possible to get a better understanding of the nature of the problem, but it's not"]
```"""
- # handling deprecated arguments
- if (assistant_model is None) == (candidate_generator is None):
- raise ValueError("One (and only one) of `assistant_model` and `candidate_generator` should be defined.")
-
- if assistant_model is not None:
- candidate_generator = AssistedCandidateGenerator(
- input_ids=input_ids,
- assistant_model=assistant_model,
- logits_processor=logits_processor,
- model_kwargs=model_kwargs,
- eos_token_id=eos_token_id,
- )
- warnings.warn(
- "Passing `assistant_model` to `assisted_decoding` is deprecated and will be removed in v4.38. "
- "Pass the `candidate_generator` argument instead.",
- FutureWarning,
- )
-
# init values
logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
logits_warper = logits_warper if logits_warper is not None else LogitsProcessorList()
diff --git a/src/transformers/models/opt/modeling_opt.py b/src/transformers/models/opt/modeling_opt.py
index d6f0924f427bb3..7c66f5c255e584 100644
--- a/src/transformers/models/opt/modeling_opt.py
+++ b/src/transformers/models/opt/modeling_opt.py
@@ -129,8 +129,8 @@ def _handle_deprecated_argument(config_arg_name, config, fn_arg_name, kwargs):
val = None
if fn_arg_name in kwargs:
logging.warning(
- "Passing in {} to {self.__class__.__name__} is deprecated and won't be supported from v4.38."
- " Please set it in the config instead"
+ "Passing in {fn_arg_name} to {self.__class__.__name__} is deprecated and won't be supported from "
+ "v4.39. Please set it in the config instead"
)
val = kwargs.pop(fn_arg_name)
else:
diff --git a/src/transformers/utils/__init__.py b/src/transformers/utils/__init__.py
index 3a3c65a3b7d670..154077924beadf 100644
--- a/src/transformers/utils/__init__.py
+++ b/src/transformers/utils/__init__.py
@@ -120,7 +120,6 @@
is_essentia_available,
is_faiss_available,
is_flash_attn_2_available,
- is_flash_attn_available,
is_flash_attn_greater_or_equal_2_10,
is_flax_available,
is_fsdp_available,
diff --git a/src/transformers/utils/import_utils.py b/src/transformers/utils/import_utils.py
index 8cf6c1a14f372f..095af536621f27 100644
--- a/src/transformers/utils/import_utils.py
+++ b/src/transformers/utils/import_utils.py
@@ -665,14 +665,6 @@ def is_flash_attn_greater_or_equal_2_10():
return version.parse(importlib.metadata.version("flash_attn")) >= version.parse("2.1.0")
-def is_flash_attn_available():
- logger.warning(
- "Using `is_flash_attn_available` is deprecated and will be removed in v4.38. "
- "Please use `is_flash_attn_2_available` instead."
- )
- return is_flash_attn_2_available()
-
-
def is_torchdistx_available():
return _torchdistx_available
From 8f2f0f0f85f9e517c495b2083c218215819bae34 Mon Sep 17 00:00:00 2001
From: Raushan Turganbay
Date: Mon, 26 Feb 2024 21:06:16 +0500
Subject: [PATCH 015/549] Track each row separately for stopping criteria
(#29116)
---
.../generation/stopping_criteria.py | 26 +++++++-----
src/transformers/generation/utils.py | 40 ++++++++-----------
tests/generation/test_stopping_criteria.py | 22 +++++-----
3 files changed, 43 insertions(+), 45 deletions(-)
diff --git a/src/transformers/generation/stopping_criteria.py b/src/transformers/generation/stopping_criteria.py
index ca3e8509644081..8516c6157250d4 100644
--- a/src/transformers/generation/stopping_criteria.py
+++ b/src/transformers/generation/stopping_criteria.py
@@ -29,7 +29,8 @@
Additional stopping criteria specific kwargs.
Return:
- `bool`. `False` indicates we should continue, `True` indicates we should stop.
+ `torch.BoolTensor`. (`torch.BoolTensor` of shape `(batch_size, 1)`), where `True` indicates we stop generation
+ for a particular row, `True` indicates we should continue.
"""
@@ -42,7 +43,7 @@ class StoppingCriteria(ABC):
"""
@add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
- def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
raise NotImplementedError("StoppingCriteria needs to be subclassed")
@@ -63,7 +64,7 @@ def __init__(self, max_length: int, max_position_embeddings: Optional[int] = Non
self.max_position_embeddings = max_position_embeddings
@add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
- def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
cur_len = input_ids.shape[-1]
is_done = cur_len >= self.max_length
if self.max_position_embeddings is not None and not is_done and cur_len >= self.max_position_embeddings:
@@ -72,7 +73,7 @@ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwa
f"maximum length ({self.max_position_embeddings}). Depending on the model, you may observe "
"exceptions, performance degradation, or nothing at all."
)
- return is_done
+ return torch.full((input_ids.shape[0],), is_done, device=input_ids.device)
class MaxNewTokensCriteria(StoppingCriteria):
@@ -100,8 +101,9 @@ def __init__(self, start_length: int, max_new_tokens: int):
self.max_length = start_length + max_new_tokens
@add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
- def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
- return input_ids.shape[-1] >= self.max_length
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
+ is_done = input_ids.shape[-1] >= self.max_length
+ return torch.full((input_ids.shape[0],), is_done, device=input_ids.device)
class MaxTimeCriteria(StoppingCriteria):
@@ -122,14 +124,18 @@ def __init__(self, max_time: float, initial_timestamp: Optional[float] = None):
self.initial_timestamp = time.time() if initial_timestamp is None else initial_timestamp
@add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
- def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
- return time.time() - self.initial_timestamp > self.max_time
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
+ is_done = time.time() - self.initial_timestamp > self.max_time
+ return torch.full((input_ids.shape[0],), is_done, device=input_ids.device)
class StoppingCriteriaList(list):
@add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
- def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
- return any(criteria(input_ids, scores, **kwargs) for criteria in self)
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
+ is_done = torch.full((input_ids.shape[0],), False, device=input_ids.device)
+ for criteria in self:
+ is_done = is_done | criteria(input_ids, scores, **kwargs)
+ return is_done
@property
def max_length(self) -> Optional[int]:
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index c7e03123a9eaf3..ff5421ad4832a5 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -2194,12 +2194,10 @@ def contrastive_search(
next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
)
- # stop when each sentence is finished
- if unfinished_sequences.max() == 0:
- this_peer_finished = True
+ # stop when each sentence is finished
+ unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
- # stop if we exceed the maximum length
- if stopping_criteria(input_ids, scores):
+ if unfinished_sequences.max() == 0:
this_peer_finished = True
if this_peer_finished and not synced_gpus:
@@ -2478,12 +2476,10 @@ def greedy_search(
next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
)
- # stop when each sentence is finished
- if unfinished_sequences.max() == 0:
- this_peer_finished = True
+ unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
- # stop if we exceed the maximum length
- if stopping_criteria(input_ids, scores):
+ # stop when each sentence is finished
+ if unfinished_sequences.max() == 0:
this_peer_finished = True
if this_peer_finished and not synced_gpus:
@@ -2772,12 +2768,10 @@ def sample(
next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
)
- # stop when each sentence is finished
- if unfinished_sequences.max() == 0:
- this_peer_finished = True
+ unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
- # stop if we exceed the maximum length
- if stopping_criteria(input_ids, scores):
+ # stop when each sentence is finished
+ if unfinished_sequences.max() == 0:
this_peer_finished = True
if this_peer_finished and not synced_gpus:
@@ -3169,7 +3163,7 @@ def beam_search(
# increase cur_len
cur_len = cur_len + 1
- if beam_scorer.is_done or stopping_criteria(input_ids, scores):
+ if beam_scorer.is_done or all(stopping_criteria(input_ids, scores)):
if not synced_gpus:
break
else:
@@ -3516,7 +3510,7 @@ def beam_sample(
# increase cur_len
cur_len = cur_len + 1
- if beam_scorer.is_done or stopping_criteria(input_ids, scores):
+ if beam_scorer.is_done or all(stopping_criteria(input_ids, scores)):
if not synced_gpus:
break
else:
@@ -3912,7 +3906,7 @@ def group_beam_search(
# increase cur_len
cur_len = cur_len + 1
- if beam_scorer.is_done or stopping_criteria(input_ids, scores):
+ if beam_scorer.is_done or all(stopping_criteria(input_ids, scores)):
if not synced_gpus:
break
else:
@@ -4267,7 +4261,7 @@ def constrained_beam_search(
# increase cur_len
cur_len = cur_len + 1
- if constrained_beam_scorer.is_done or stopping_criteria(input_ids, scores):
+ if constrained_beam_scorer.is_done or all(stopping_criteria(input_ids, scores)):
if not synced_gpus:
break
else:
@@ -4657,12 +4651,10 @@ def assisted_decoding(
.prod(dim=0)
)
- # stop when each sentence is finished
- if unfinished_sequences.max() == 0:
- this_peer_finished = True
+ unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
- # stop if we exceed the maximum length
- if stopping_criteria(input_ids, scores):
+ # stop when each sentence is finished
+ if unfinished_sequences.max() == 0:
this_peer_finished = True
if this_peer_finished and not synced_gpus:
diff --git a/tests/generation/test_stopping_criteria.py b/tests/generation/test_stopping_criteria.py
index dfc5308359ffb3..7fa118c9e3550d 100644
--- a/tests/generation/test_stopping_criteria.py
+++ b/tests/generation/test_stopping_criteria.py
@@ -54,37 +54,37 @@ def test_list_criteria(self):
]
)
- self.assertFalse(criteria(input_ids, scores))
+ self.assertFalse(all(criteria(input_ids, scores)))
input_ids, scores = self._get_tensors(9)
- self.assertFalse(criteria(input_ids, scores))
+ self.assertFalse(all(criteria(input_ids, scores)))
input_ids, scores = self._get_tensors(10)
- self.assertTrue(criteria(input_ids, scores))
+ self.assertTrue(all(criteria(input_ids, scores)))
def test_max_length_criteria(self):
criteria = MaxLengthCriteria(max_length=10)
input_ids, scores = self._get_tensors(5)
- self.assertFalse(criteria(input_ids, scores))
+ self.assertFalse(all(criteria(input_ids, scores)))
input_ids, scores = self._get_tensors(9)
- self.assertFalse(criteria(input_ids, scores))
+ self.assertFalse(all(criteria(input_ids, scores)))
input_ids, scores = self._get_tensors(10)
- self.assertTrue(criteria(input_ids, scores))
+ self.assertTrue(all(criteria(input_ids, scores)))
def test_max_new_tokens_criteria(self):
criteria = MaxNewTokensCriteria(start_length=5, max_new_tokens=5)
input_ids, scores = self._get_tensors(5)
- self.assertFalse(criteria(input_ids, scores))
+ self.assertFalse(all(criteria(input_ids, scores)))
input_ids, scores = self._get_tensors(9)
- self.assertFalse(criteria(input_ids, scores))
+ self.assertFalse(all(criteria(input_ids, scores)))
input_ids, scores = self._get_tensors(10)
- self.assertTrue(criteria(input_ids, scores))
+ self.assertTrue(all(criteria(input_ids, scores)))
criteria_list = StoppingCriteriaList([criteria])
self.assertEqual(criteria_list.max_length, 10)
@@ -93,10 +93,10 @@ def test_max_time_criteria(self):
input_ids, scores = self._get_tensors(5)
criteria = MaxTimeCriteria(max_time=0.1)
- self.assertFalse(criteria(input_ids, scores))
+ self.assertFalse(all(criteria(input_ids, scores)))
criteria = MaxTimeCriteria(max_time=0.1, initial_timestamp=time.time() - 0.2)
- self.assertTrue(criteria(input_ids, scores))
+ self.assertTrue(all(criteria(input_ids, scores)))
def test_validate_stopping_criteria(self):
validate_stopping_criteria(StoppingCriteriaList([MaxLengthCriteria(10)]), 10)
From 9f7535bda8dd932fbd252916366fc44221cf7bcc Mon Sep 17 00:00:00 2001
From: Aaron Jimenez
Date: Mon, 26 Feb 2024 08:18:15 -0800
Subject: [PATCH 016/549] [docs] Spanish translation of tasks_explained.md
(#29224)
* Add tasks_explained.md to es/
* Fix little typo in en/ version
* translate speach/audio section
* translate part of vision computer section | fix little typo in en/
* Fix little typo in en/
* Translate vision computer section | remove ** ** to * * in both files
* Translate NLP section | fix link to task/translation in en/
* Updete link in es/tasks_summary.md
* Fix task_summary title link
---
docs/source/en/tasks_explained.md | 2 +-
docs/source/es/_toctree.yml | 2 +
docs/source/es/task_summary.md | 9 +-
docs/source/es/tasks_explained.md | 295 ++++++++++++++++++++++++++++++
4 files changed, 299 insertions(+), 9 deletions(-)
create mode 100644 docs/source/es/tasks_explained.md
diff --git a/docs/source/en/tasks_explained.md b/docs/source/en/tasks_explained.md
index d453e38e86b9fa..f860377c7c9f0c 100644
--- a/docs/source/en/tasks_explained.md
+++ b/docs/source/en/tasks_explained.md
@@ -286,7 +286,7 @@ BART adapts to translation by adding a separate randomly initialized encoder to
BART has since been followed up by a multilingual version, mBART, intended for translation and pretrained on many different languages.
-Ready to try your hand at translation? Check out our complete [translation guide](tasks/summarization) to learn how to finetune T5 and use it for inference!
+Ready to try your hand at translation? Check out our complete [translation guide](tasks/translation) to learn how to finetune T5 and use it for inference!
diff --git a/docs/source/es/_toctree.yml b/docs/source/es/_toctree.yml
index 0be8191ecfff84..69334ba267e42e 100644
--- a/docs/source/es/_toctree.yml
+++ b/docs/source/es/_toctree.yml
@@ -84,6 +84,8 @@
title: Glosario
- local: task_summary
title: Lo que 🤗 Transformers puede hacer
+ - local: tasks_explained
+ title: Como los 🤗 Transformers resuelven tareas
- local: pad_truncation
title: Relleno y truncamiento
- local: bertology
diff --git a/docs/source/es/task_summary.md b/docs/source/es/task_summary.md
index 4aa6852ed35606..3c24f0dad14f2c 100644
--- a/docs/source/es/task_summary.md
+++ b/docs/source/es/task_summary.md
@@ -337,11 +337,4 @@ Las respuestas a preguntas de documentos es una tarea que responde preguntas en
[{'score': 0.8531, 'answer': '17,000', 'start': 4, 'end': 4}]
```
-Con suerte, esta página te ha proporcionado más información de fondo sobre todos los tipos de tareas en cada modalidad y la importancia práctica de cada una. En la próxima [sección](https://huggingface.co/docs/transformers/tasks_explained), aprenderás **cómo** 🤗 Transformers trabaja para resolver estas tareas.
-
-
\ No newline at end of file
+Con suerte, esta página te ha proporcionado más información de fondo sobre todos los tipos de tareas en cada modalidad y la importancia práctica de cada una. En la próxima [sección](tasks_explained), aprenderás **cómo** 🤗 Transformers trabaja para resolver estas tareas.
diff --git a/docs/source/es/tasks_explained.md b/docs/source/es/tasks_explained.md
new file mode 100644
index 00000000000000..9b13f521417890
--- /dev/null
+++ b/docs/source/es/tasks_explained.md
@@ -0,0 +1,295 @@
+
+
+# ¿Cómo los 🤗 Transformers resuelven tareas?
+
+En [Lo que 🤗 Transformers puede hacer](task_summary), aprendiste sobre el procesamiento de lenguaje natural (NLP), tareas de voz y audio, visión por computadora y algunas aplicaciones importantes de ellas. Esta página se centrará en cómo los modelos resuelven estas tareas y explicará lo que está sucediendo debajo de la superficie. Hay muchas maneras de resolver una tarea dada, y diferentes modelos pueden implementar ciertas técnicas o incluso abordar la tarea desde un ángulo nuevo, pero para los modelos Transformer, la idea general es la misma. Debido a su arquitectura flexible, la mayoría de los modelos son una variante de una estructura de codificador, descodificador o codificador-descodificador. Además de los modelos Transformer, nuestra biblioteca también tiene varias redes neuronales convolucionales (CNNs) modernas, que todavía se utilizan hoy en día para tareas de visión por computadora. También explicaremos cómo funciona una CNN moderna.
+
+Para explicar cómo se resuelven las tareas, caminaremos a través de lo que sucede dentro del modelo para generar predicciones útiles.
+
+- [Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2) para clasificación de audio y reconocimiento automático de habla (ASR)
+- [Transformador de Visión (ViT)](https://huggingface.co/docs/transformers/model_doc/vit) y [ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext) para clasificación de imágenes
+- [DETR](https://huggingface.co/docs/transformers/model_doc/detr) para detección de objetos
+- [Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former) para segmentación de imagen
+- [GLPN](https://huggingface.co/docs/transformers/model_doc/glpn) para estimación de profundidad
+- [BERT](https://huggingface.co/docs/transformers/model_doc/bert) para tareas de NLP como clasificación de texto, clasificación de tokens y preguntas y respuestas que utilizan un codificador
+- [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2) para tareas de NLP como generación de texto que utilizan un descodificador
+- [BART](https://huggingface.co/docs/transformers/model_doc/bart) para tareas de NLP como resumen y traducción que utilizan un codificador-descodificador
+
+
+
+Antes de continuar, es bueno tener un conocimiento básico de la arquitectura original del Transformer. Saber cómo funcionan los codificadores, decodificadores y la atención te ayudará a entender cómo funcionan los diferentes modelos de Transformer. Si estás empezando o necesitas repasar, ¡echa un vistazo a nuestro [curso](https://huggingface.co/course/chapter1/4?fw=pt) para obtener más información!
+
+
+
+## Habla y audio
+
+[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2) es un modelo auto-supervisado preentrenado en datos de habla no etiquetados y ajustado en datos etiquetados para clasificación de audio y reconocimiento automático de voz.
+
+
+
+
+
+Este modelo tiene cuatro componentes principales:
+
+1. Un *codificador de características* toma la forma de onda de audio cruda, la normaliza a media cero y varianza unitaria, y la convierte en una secuencia de vectores de características, cada uno de 20 ms de duración.
+
+2. Las formas de onda son continuas por naturaleza, por lo que no se pueden dividir en unidades separadas como una secuencia de texto se puede dividir en palabras. Por eso, los vectores de características se pasan a un *módulo de cuantificación*, que tiene como objetivo aprender unidades de habla discretas. La unidad de habla se elige de una colección de palabras de código, conocidas como *codebook* (puedes pensar en esto como el vocabulario). Del codebook, se elige el vector o unidad de habla que mejor representa la entrada de audio continua y se envía a través del modelo.
+
+3. Alrededor de la mitad de los vectores de características se enmascaran aleatoriamente, y el vector de características enmascarado se alimenta a una *red de contexto*, que es un codificador Transformer que también agrega incrustaciones posicionales relativas.
+
+4. El objetivo del preentrenamiento de la red de contexto es una *tarea contrastiva*. El modelo tiene que predecir la verdadera representación de habla cuantizada de la predicción enmascarada a partir de un conjunto de falsas, lo que anima al modelo a encontrar el vector de contexto y la unidad de habla cuantizada más similares (la etiqueta objetivo).
+
+¡Ahora que wav2vec2 está preentrenado, puedes ajustarlo con tus datos para clasificación de audio o reconocimiento automático de voz!
+
+### Clasificación de audio
+
+Para usar el modelo preentrenado para la clasificación de audio, añade una capa de clasificación de secuencia encima del modelo base de Wav2Vec2. La capa de clasificación es una capa lineal que acepta los estados ocultos del codificador. Los estados ocultos representan las características aprendidas de cada fotograma de audio, que pueden tener longitudes variables. Para crear un vector de longitud fija, primero se agrupan los estados ocultos y luego se transforman en logits sobre las etiquetas de clase. La pérdida de entropía cruzada se calcula entre los logits y el objetivo para encontrar la clase más probable.
+
+¿Listo para probar la clasificación de audio? ¡Consulta nuestra guía completa de [clasificación de audio](https://huggingface.co/docs/transformers/tasks/audio_classification) para aprender cómo ajustar Wav2Vec2 y usarlo para inferencia!
+
+### Reconocimiento automático de voz
+
+Para usar el modelo preentrenado para el reconocimiento automático de voz, añade una capa de modelado del lenguaje encima del modelo base de Wav2Vec2 para [CTC (clasificación temporal conexista)](glossary#connectionist-temporal-classification-ctc). La capa de modelado del lenguaje es una capa lineal que acepta los estados ocultos del codificador y los transforma en logits. Cada logit representa una clase de token (el número de tokens proviene del vocabulario de la tarea). La pérdida de CTC se calcula entre los logits y los objetivos para encontrar la secuencia de tokens más probable, que luego se decodifican en una transcripción.
+
+¿Listo para probar el reconocimiento automático de voz? ¡Consulta nuestra guía completa de [reconocimiento automático de voz](tasks/asr) para aprender cómo ajustar Wav2Vec2 y usarlo para inferencia!
+
+## Visión por computadora
+
+Hay dos formas de abordar las tareas de visión por computadora:
+
+1. Dividir una imagen en una secuencia de parches y procesarlos en paralelo con un Transformer.
+2. Utilizar una CNN moderna, como [ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext), que se basa en capas convolucionales pero adopta diseños de redes modernas.
+
+
+
+Un tercer enfoque combina Transformers con convoluciones (por ejemplo, [Convolutional Vision Transformer](https://huggingface.co/docs/transformers/model_doc/cvt) o [LeViT](https://huggingface.co/docs/transformers/model_doc/levit)). No discutiremos estos porque simplemente combinan los dos enfoques que examinamos aquí.
+
+
+
+ViT y ConvNeXT se utilizan comúnmente para la clasificación de imágenes, pero para otras tareas de visión como la detección de objetos, la segmentación y la estimación de profundidad, veremos DETR, Mask2Former y GLPN, respectivamente; estos modelos son más adecuados para esas tareas.
+
+### Clasificación de imágenes
+
+ViT y ConvNeXT pueden usarse ambos para la clasificación de imágenes; la diferencia principal es que ViT utiliza un mecanismo de atención mientras que ConvNeXT utiliza convoluciones.
+
+#### Transformer
+
+[ViT](https://huggingface.co/docs/transformers/model_doc/vit) reemplaza completamente las convoluciones con una arquitectura de Transformer pura. Si estás familiarizado con el Transformer original, entonces ya estás en el camino para entender ViT.
+
+
+
+
+
+El cambio principal que introdujo ViT fue en cómo se alimentan las imágenes a un Transformer:
+
+1. Una imagen se divide en parches cuadrados no superpuestos, cada uno de los cuales se convierte en un vector o *incrustación de parche*(patch embedding). Las incrustaciones de parche se generan a partir de una capa convolucional 2D que crea las dimensiones de entrada adecuadas (que para un Transformer base son 768 valores para cada incrustación de parche). Si tuvieras una imagen de 224x224 píxeles, podrías dividirla en 196 parches de imagen de 16x16. Al igual que el texto se tokeniza en palabras, una imagen se "tokeniza" en una secuencia de parches.
+
+2. Se agrega una *incrustación aprendida* - un token especial `[CLS]` - al principio de las incrustaciones del parche, al igual que en BERT. El estado oculto final del token `[CLS]` se utiliza como la entrada para la cabecera de clasificación adjunta; otras salidas se ignoran. Este token ayuda al modelo a aprender cómo codificar una representación de la imagen.
+
+3. Lo último que se agrega a las incrustaciones de parche e incrustaciones aprendidas son las *incrustaciones de posición* porque el modelo no sabe cómo están ordenados los parches de imagen. Las incrustaciones de posición también son aprendibles y tienen el mismo tamaño que las incrustaciones de parche. Finalmente, todas las incrustaciones se pasan al codificador Transformer.
+
+4. La salida, específicamente solo la salida con el token `[CLS]`, se pasa a una cabecera de perceptrón multicapa (MLP). El objetivo del preentrenamiento de ViT es simplemente la clasificación. Al igual que otras cabeceras de clasificación, la cabecera de MLP convierte la salida en logits sobre las etiquetas de clase y calcula la pérdida de entropía cruzada para encontrar la clase más probable.
+
+¿Listo para probar la clasificación de imágenes? ¡Consulta nuestra guía completa de [clasificación de imágenes](tasks/image_classification) para aprender cómo ajustar ViT y usarlo para inferencia!
+
+#### CNN
+
+
+
+Esta sección explica brevemente las convoluciones, pero sería útil tener un entendimiento previo de cómo cambian la forma y el tamaño de una imagen. Si no estás familiarizado con las convoluciones, ¡echa un vistazo al [capítulo de Redes Neuronales Convolucionales](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb) del libro fastai!
+
+
+
+[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext) es una arquitectura de CNN que adopta diseños de redes nuevas y modernas para mejorar el rendimiento. Sin embargo, las convoluciones siguen siendo el núcleo del modelo. Desde una perspectiva de alto nivel, una [convolución](glossary#convolution) es una operación donde una matriz más pequeña (*kernel*) se multiplica por una pequeña ventana de píxeles de la imagen. Esta calcula algunas características de ella, como una textura particular o la curvatura de una línea. Luego, se desliza hacia la siguiente ventana de píxeles; la distancia que recorre la convolución se conoce como el *stride*.
+
+
+
+
+
+Una convolución básica sin relleno ni paso, tomada de Una guía para la aritmética de convoluciones para el aprendizaje profundo.
+
+Puedes alimentar esta salida a otra capa convolucional, y con cada capa sucesiva, la red aprende cosas más complejas y abstractas como perros calientes o cohetes. Entre capas convolucionales, es común añadir una capa de agrupación para reducir la dimensionalidad y hacer que el modelo sea más robusto a las variaciones de la posición de una característica.
+
+
+
+
+
+ConvNeXT moderniza una CNN de cinco maneras:
+
+1. Cambia el número de bloques en cada etapa y "fragmenta" una imagen con un paso y tamaño de kernel más grandes. La ventana deslizante no superpuesta hace que esta estrategia de fragmentación sea similar a cómo ViT divide una imagen en parches.
+
+2. Una capa de *cuello de botella* reduce el número de canales y luego lo restaura porque es más rápido hacer una convolución de 1x1, y se puede aumentar la profundidad. Un cuello de botella invertido hace lo contrario al expandir el número de canales y luego reducirlos, lo cual es más eficiente en memoria.
+
+3. Reemplaza la típica capa convolucional de 3x3 en la capa de cuello de botella con una convolución *depthwise*, que aplica una convolución a cada canal de entrada por separado y luego los apila de nuevo al final. Esto ensancha el ancho de la red para mejorar el rendimiento.
+
+4. ViT tiene un campo receptivo global, lo que significa que puede ver más de una imagen a la vez gracias a su mecanismo de atención. ConvNeXT intenta replicar este efecto aumentando el tamaño del kernel a 7x7.
+
+5. ConvNeXT también hace varios cambios en el diseño de capas que imitan a los modelos Transformer. Hay menos capas de activación y normalización, la función de activación se cambia a GELU en lugar de ReLU, y utiliza LayerNorm en lugar de BatchNorm.
+
+La salida de los bloques convolucionales se pasa a una cabecera de clasificación que convierte las salidas en logits y calcula la pérdida de entropía cruzada para encontrar la etiqueta más probable.
+
+### Object detection
+
+[DETR](https://huggingface.co/docs/transformers/model_doc/detr), *DEtection TRansformer*, es un modelo de detección de objetos de un extremo a otro que combina una CNN con un codificador-decodificador Transformer.
+
+
+
+
+
+1. Una CNN preentrenada *backbone* toma una imagen, representada por sus valores de píxeles, y crea un mapa de características de baja resolución de la misma. A continuación, se aplica una convolución 1x1 al mapa de características para reducir la dimensionalidad y se crea un nuevo mapa de características con una representación de imagen de alto nivel. Dado que el Transformer es un modelo secuencial, el mapa de características se aplana en una secuencia de vectores de características que se combinan con incrustaciones posicionales.
+
+2. Los vectores de características se pasan al codificador, que aprende las representaciones de imagen usando sus capas de atención. A continuación, los estados ocultos del codificador se combinan con *consultas de objeto* en el decodificador. Las consultas de objeto son incrustaciones aprendidas que se enfocan en las diferentes regiones de una imagen, y se actualizan a medida que avanzan a través de cada capa de atención. Los estados ocultos del decodificador se pasan a una red feedforward que predice las coordenadas del cuadro delimitador y la etiqueta de clase para cada consulta de objeto, o `no objeto` si no hay ninguno.
+
+ DETR descodifica cada consulta de objeto en paralelo para producir *N* predicciones finales, donde *N* es el número de consultas. A diferencia de un modelo autoregresivo típico que predice un elemento a la vez, la detección de objetos es una tarea de predicción de conjuntos (`cuadro delimitador`, `etiqueta de clase`) que hace *N* predicciones en un solo paso.
+
+3. DETR utiliza una **pérdida de coincidencia bipartita** durante el entrenamiento para comparar un número fijo de predicciones con un conjunto fijo de etiquetas de verdad básica. Si hay menos etiquetas de verdad básica en el conjunto de *N* etiquetas, entonces se rellenan con una clase `no objeto`. Esta función de pérdida fomenta que DETR encuentre una asignación uno a uno entre las predicciones y las etiquetas de verdad básica. Si los cuadros delimitadores o las etiquetas de clase no son correctos, se incurre en una pérdida. Del mismo modo, si DETR predice un objeto que no existe, se penaliza. Esto fomenta que DETR encuentre otros objetos en una imagen en lugar de centrarse en un objeto realmente prominente.
+
+Se añade una cabecera de detección de objetos encima de DETR para encontrar la etiqueta de clase y las coordenadas del cuadro delimitador. Hay dos componentes en la cabecera de detección de objetos: una capa lineal para transformar los estados ocultos del decodificador en logits sobre las etiquetas de clase, y una MLP para predecir el cuadro delimitador.
+
+¿Listo para probar la detección de objetos? ¡Consulta nuestra guía completa de [detección de objetos](https://huggingface.co/docs/transformers/tasks/object_detection) para aprender cómo ajustar DETR y usarlo para inferencia!
+
+### Segmentación de imágenes
+
+[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former) es una arquitectura universal para resolver todos los tipos de tareas de segmentación de imágenes. Los modelos de segmentación tradicionales suelen estar adaptados a una tarea particular de segmentación de imágenes, como la segmentación de instancias, semántica o panóptica. Mask2Former enmarca cada una de esas tareas como un problema de *clasificación de máscaras*. La clasificación de máscaras agrupa píxeles en *N* segmentos, y predice *N* máscaras y su etiqueta de clase correspondiente para una imagen dada. Explicaremos cómo funciona Mask2Former en esta sección, y luego podrás probar el ajuste fino de SegFormer al final.
+
+
+
+
+
+Hay tres componentes principales en Mask2Former:
+
+1. Un [backbone Swin](https://huggingface.co/docs/transformers/model_doc/swin) acepta una imagen y crea un mapa de características de imagen de baja resolución a partir de 3 convoluciones consecutivas de 3x3.
+
+2. El mapa de características se pasa a un *decodificador de píxeles* que aumenta gradualmente las características de baja resolución en incrustaciones de alta resolución por píxel. De hecho, el decodificador de píxeles genera características multiescala (contiene características de baja y alta resolución) con resoluciones de 1/32, 1/16 y 1/8 de la imagen original.
+
+3. Cada uno de estos mapas de características de diferentes escalas se alimenta sucesivamente a una capa decodificadora Transformer a la vez para capturar objetos pequeños de las características de alta resolución. La clave de Mask2Former es el mecanismo de *atención enmascarada* en el decodificador. A diferencia de la atención cruzada que puede atender a toda la imagen, la atención enmascarada solo se centra en cierta área de la imagen. Esto es más rápido y conduce a un mejor rendimiento porque las características locales de una imagen son suficientes para que el modelo aprenda.
+
+4. Al igual que [DETR](tasks_explained#object-detection), Mask2Former también utiliza consultas de objetos aprendidas y las combina con las características de la imagen del decodificador de píxeles para hacer una predicción de conjunto (`etiqueta de clase`, `predicción de máscara`). Los estados ocultos del decodificador se pasan a una capa lineal y se transforman en logits sobre las etiquetas de clase. Se calcula la pérdida de entropía cruzada entre los logits y la etiqueta de clase para encontrar la más probable.
+
+ Las predicciones de máscara se generan combinando las incrustaciones de píxeles con los estados ocultos finales del decodificador. La pérdida de entropía cruzada sigmoidea y de la pérdida DICE se calcula entre los logits y la máscara de verdad básica para encontrar la máscara más probable.
+
+¿Listo para probar la detección de objetos? ¡Consulta nuestra guía completa de [segmentación de imágenes](https://huggingface.co/docs/transformers/tasks/semantic_segmentation) para aprender cómo ajustar SegFormer y usarlo para inferencia!
+
+### Estimación de profundidad
+
+[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn), *Global-Local Path Network*, es un Transformer para la estimación de profundidad que combina un codificador [SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer) con un decodificador ligero.
+
+
+
+
+
+1. Al igual que ViT, una imagen se divide en una secuencia de parches, excepto que estos parches de imagen son más pequeños. Esto es mejor para tareas de predicción densa como la segmentación o la estimación de profundidad. Los parches de imagen se transforman en incrustaciones de parches (ver la sección de [clasificación de imágenes](#clasificación-de-imágenes) para más detalles sobre cómo se crean las incrustaciones de parches), que se alimentan al codificador.
+
+2. El codificador acepta las incrustaciones de parches y las pasa a través de varios bloques codificadores. Cada bloque consiste en capas de atención y Mix-FFN. El propósito de este último es proporcionar información posicional. Al final de cada bloque codificador hay una capa de *fusión de parches* para crear representaciones jerárquicas. Las características de cada grupo de parches vecinos se concatenan, y se aplica una capa lineal a las características concatenadas para reducir el número de parches a una resolución de 1/4. Esto se convierte en la entrada al siguiente bloque codificador, donde se repite todo este proceso hasta que tengas características de imagen con resoluciones de 1/8, 1/16 y 1/32.
+
+3. Un decodificador ligero toma el último mapa de características (escala 1/32) del codificador y lo aumenta a una escala de 1/16. A partir de aquí, la característica se pasa a un módulo de *Fusión Selectiva de Características (SFF)*, que selecciona y combina características locales y globales de un mapa de atención para cada característica y luego la aumenta a 1/8. Este proceso se repite hasta que las características decodificadas sean del mismo tamaño que la imagen original. La salida se pasa a través de dos capas de convolución y luego se aplica una activación sigmoide para predecir la profundidad de cada píxel.
+
+## Procesamiento del lenguaje natural
+
+El Transformer fue diseñado inicialmente para la traducción automática, y desde entonces, prácticamente se ha convertido en la arquitectura predeterminada para resolver todas las tareas de procesamiento del lenguaje natural (NLP, por sus siglas en inglés). Algunas tareas se prestan a la estructura del codificador del Transformer, mientras que otras son más adecuadas para el decodificador. Todavía hay otras tareas que hacen uso de la estructura codificador-decodificador del Transformer.
+
+### Clasificación de texto
+
+[BERT](https://huggingface.co/docs/transformers/model_doc/bert) es un modelo que solo tiene codificador y es el primer modelo en implementar efectivamente la bidireccionalidad profunda para aprender representaciones más ricas del texto al atender a las palabras en ambos lados.
+
+1. BERT utiliza la tokenización [WordPiece](https://huggingface.co/docs/transformers/tokenizer_summary#wordpiece) para generar una incrustación de tokens del texto. Para diferenciar entre una sola oración y un par de oraciones, se agrega un token especial `[SEP]` para diferenciarlos. También se agrega un token especial `[CLS]` al principio de cada secuencia de texto. La salida final con el token `[CLS]` se utiliza como la entrada a la cabeza de clasificación para tareas de clasificación. BERT también agrega una incrustación de segmento para indicar si un token pertenece a la primera o segunda oración en un par de oraciones.
+
+2. BERT se preentrena con dos objetivos: modelar el lenguaje enmascarado y predecir de próxima oración. En el modelado de lenguaje enmascarado, un cierto porcentaje de los tokens de entrada se enmascaran aleatoriamente, y el modelo necesita predecir estos. Esto resuelve el problema de la bidireccionalidad, donde el modelo podría hacer trampa y ver todas las palabras y "predecir" la siguiente palabra. Los estados ocultos finales de los tokens de máscara predichos se pasan a una red feedforward con una softmax sobre el vocabulario para predecir la palabra enmascarada.
+
+ El segundo objetivo de preentrenamiento es la predicción de próxima oración. El modelo debe predecir si la oración B sigue a la oración A. La mitad del tiempo, la oración B es la siguiente oración, y la otra mitad del tiempo, la oración B es una oración aleatoria. La predicción, ya sea que sea la próxima oración o no, se pasa a una red feedforward con una softmax sobre las dos clases (`EsSiguiente` y `NoSiguiente`).
+
+3. Las incrustaciones de entrada se pasan a través de múltiples capas codificadoras para producir algunos estados ocultos finales.
+
+Para usar el modelo preentrenado para clasificación de texto, se añade una cabecera de clasificación de secuencia encima del modelo base de BERT. La cabecera de clasificación de secuencia es una capa lineal que acepta los estados ocultos finales y realiza una transformación lineal para convertirlos en logits. Se calcula la pérdida de entropía cruzada entre los logits y el objetivo para encontrar la etiqueta más probable.
+
+¿Listo para probar la clasificación de texto? ¡Consulta nuestra guía completa de [clasificación de texto](https://huggingface.co/docs/transformers/tasks/sequence_classification) para aprender cómo ajustar DistilBERT y usarlo para inferencia!
+
+### Clasificación de tokens
+
+Para usar BERT en tareas de clasificación de tokens como el reconocimiento de entidades nombradas (NER), añade una cabecera de clasificación de tokens encima del modelo base de BERT. La cabecera de clasificación de tokens es una capa lineal que acepta los estados ocultos finales y realiza una transformación lineal para convertirlos en logits. Se calcula la pérdida de entropía cruzada entre los logits y cada token para encontrar la etiqueta más probable.
+
+¿Listo para probar la clasificación de tokens? ¡Consulta nuestra guía completa de [clasificación de tokens](https://huggingface.co/docs/transformers/tasks/token_classification) para aprender cómo ajustar DistilBERT y usarlo para inferencia!
+
+### Respuesta a preguntas
+
+Para usar BERT en la respuesta a preguntas, añade una cabecera de clasificación de span encima del modelo base de BERT. Esta capa lineal acepta los estados ocultos finales y realiza una transformación lineal para calcular los logits de inicio y fin del `span` correspondiente a la respuesta. Se calcula la pérdida de entropía cruzada entre los logits y la posición de la etiqueta para encontrar el span más probable de texto correspondiente a la respuesta.
+
+¿Listo para probar la respuesta a preguntas? ¡Consulta nuestra guía completa de [respuesta a preguntas](tasks/question_answering) para aprender cómo ajustar DistilBERT y usarlo para inferencia!
+
+
+
+💡 ¡Observa lo fácil que es usar BERT para diferentes tareas una vez que ha sido preentrenado! ¡Solo necesitas añadir una cabecera específica al modelo preentrenado para manipular los estados ocultos en tu salida deseada!
+
+
+
+### Generación de texto
+
+[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2) es un modelo que solo tiene decodificador y se preentrena en una gran cantidad de texto. Puede generar texto convincente (¡aunque no siempre verdadero!) dado un estímulo y completar otras tareas de procesamiento del lenguaje natural como responder preguntas, a pesar de no haber sido entrenado explícitamente para ello.
+
+
+
+
+
+1. GPT-2 utiliza [codificación de pares de bytes (BPE)](https://huggingface.co/docs/transformers/tokenizer_summary#bytepair-encoding-bpe) para tokenizar palabras y generar una incrustación de token. Se añaden incrustaciones posicionales a las incrustaciones de token para indicar la posición de cada token en la secuencia. Las incrustaciones de entrada se pasan a través de varios bloques decodificadores para producir algún estado oculto final. Dentro de cada bloque decodificador, GPT-2 utiliza una capa de *autoatención enmascarada*, lo que significa que GPT-2 no puede atender a los tokens futuros. Solo puede atender a los tokens a la izquierda. Esto es diferente al token [`mask`] de BERT porque, en la autoatención enmascarada, se utiliza una máscara de atención para establecer la puntuación en `0` para los tokens futuros.
+
+2. La salida del decodificador se pasa a una cabecera de modelado de lenguaje, que realiza una transformación lineal para convertir los estados ocultos en logits. La etiqueta es el siguiente token en la secuencia, que se crea desplazando los logits a la derecha en uno. Se calcula la pérdida de entropía cruzada entre los logits desplazados y las etiquetas para obtener el siguiente token más probable.
+
+El objetivo del preentrenamiento de GPT-2 se basa completamente en el [modelado de lenguaje causal](glossary#causal-language-modeling), prediciendo la siguiente palabra en una secuencia. Esto hace que GPT-2 sea especialmente bueno en tareas que implican la generación de texto.
+
+¿Listo para probar la generación de texto? ¡Consulta nuestra guía completa de [modelado de lenguaje causal](tasks/language_modeling#modelado-de-lenguaje-causal) para aprender cómo ajustar DistilGPT-2 y usarlo para inferencia!
+
+
+
+Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
+
+
+
+### Resumir
+
+Los modelos codificador-decodificador como [BART](https://huggingface.co/docs/transformers/model_doc/bart) y [T5](https://huggingface.co/docs/transformers/model_doc/t5) están diseñados para el patrón de secuencia a secuencia de una tarea de resumen. Explicaremos cómo funciona BART en esta sección, y luego podrás probar el ajuste fino de T5 al final.
+
+
+
+
+
+1. La arquitectura del codificador de BART es muy similar a la de BERT y acepta una incrustación de token y posicional del texto. BART se preentrena corrompiendo la entrada y luego reconstruyéndola con el decodificador. A diferencia de otros codificadores con estrategias específicas de corrupción, BART puede aplicar cualquier tipo de corrupción. Sin embargo, la estrategia de corrupción de *relleno de texto* funciona mejor. En el relleno de texto, varios fragmentos de texto se reemplazan con un **único** token [`mask`]. Esto es importante porque el modelo tiene que predecir los tokens enmascarados, y le enseña al modelo a predecir la cantidad de tokens faltantes. Las incrustaciones de entrada y los fragmentos enmascarados se pasan a través del codificador para producir algunos estados ocultos finales, pero a diferencia de BERT, BART no añade una red feedforward final al final para predecir una palabra.
+
+2. La salida del codificador se pasa al decodificador, que debe predecir los tokens enmascarados y cualquier token no corrompido de la salida del codificador. Esto proporciona un contexto adicional para ayudar al decodificador a restaurar el texto original. La salida del decodificador se pasa a una cabeza de modelado de lenguaje, que realiza una transformación lineal para convertir los estados ocultos en logits. Se calcula la pérdida de entropía cruzada entre los logits y la etiqueta, que es simplemente el token desplazado hacia la derecha.
+
+¿Listo para probar la sumarización? ¡Consulta nuestra guía completa de [Generación de resúmenes](tasks/summarization) para aprender cómo ajustar T5 y usarlo para inferencia!
+
+
+
+Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
+
+
+
+### Traducción
+
+La traducción es otro ejemplo de una tarea de secuencia a secuencia, lo que significa que puedes usar un modelo codificador-decodificador como [BART](https://huggingface.co/docs/transformers/model_doc/bart) o [T5](https://huggingface.co/docs/transformers/model_doc/t5) para hacerlo. Explicaremos cómo funciona BART en esta sección, y luego podrás probar el ajuste fino de T5 al final.
+
+BART se adapta a la traducción añadiendo un codificador separado inicializado aleatoriamente para mapear un idioma fuente a una entrada que pueda ser decodificada en el idioma objetivo. Las incrustaciones de este nuevo codificador se pasan al codificador preentrenado en lugar de las incrustaciones de palabras originales. El codificador de origen se entrena actualizando el codificador de origen, las incrustaciones posicionales y las incrustaciones de entrada con la pérdida de entropía cruzada de la salida del modelo. Los parámetros del modelo están congelados en este primer paso, y todos los parámetros del modelo se entrenan juntos en el segundo paso.
+
+Desde entonces, BART ha sido seguido por una versión multilingüe, mBART, destinada a la traducción y preentrenada en muchos idiomas diferentes.
+
+¿Listo para probar la traducción? ¡Consulta nuestra guía completa de [traducción](https://huggingface.co/docs/transformers/tasks/translation) para aprender cómo ajustar T5 y usarlo para inferencia!
+
+
+
+Para obtener más información sobre la generación de texto, ¡consulta la guía de [estrategias de generación de texto](https://huggingface.co/docs/transformers/generation_strategies)!
+
+
\ No newline at end of file
From b43340455dc59c67cdb25f08a23cfd650b4da7e7 Mon Sep 17 00:00:00 2001
From: Michael
Date: Tue, 27 Feb 2024 00:27:47 +0800
Subject: [PATCH 017/549] [i18n-zh] Translated torchscript.md into Chinese
(#29234)
Signed-off-by: windsonsea
---
docs/source/zh/_toctree.yml | 2 +
docs/source/zh/torchscript.md | 197 ++++++++++++++++++++++++++++++++++
2 files changed, 199 insertions(+)
create mode 100644 docs/source/zh/torchscript.md
diff --git a/docs/source/zh/_toctree.yml b/docs/source/zh/_toctree.yml
index dd3eb7c3afc121..44db5f815a34af 100644
--- a/docs/source/zh/_toctree.yml
+++ b/docs/source/zh/_toctree.yml
@@ -41,6 +41,8 @@
title: 导出为 ONNX
- local: tflite
title: 导出为 TFLite
+ - local: torchscript
+ title: 导出为 TorchScript
title: 开发者指南
- sections:
- local: performance
diff --git a/docs/source/zh/torchscript.md b/docs/source/zh/torchscript.md
new file mode 100644
index 00000000000000..d3106c5241808f
--- /dev/null
+++ b/docs/source/zh/torchscript.md
@@ -0,0 +1,197 @@
+
+
+# 导出为 TorchScript
+
+
+
+这是开始使用 TorchScript 进行实验的起点,我们仍在探索其在变量输入大小模型中的能力。
+这是我们关注的焦点,我们将在即将发布的版本中深入分析,提供更多的代码示例、更灵活的实现以及比较
+Python 代码与编译 TorchScript 的性能基准。
+
+
+
+根据 [TorchScript 文档](https://pytorch.org/docs/stable/jit.html):
+
+> TorchScript 是从 PyTorch 代码创建可序列化和可优化的模型的一种方式。
+
+有两个 PyTorch 模块:[JIT 和 TRACE](https://pytorch.org/docs/stable/jit.html)。
+这两个模块允许开发人员将其模型导出到其他程序中重用,比如面向效率的 C++ 程序。
+
+我们提供了一个接口,允许您将 🤗 Transformers 模型导出为 TorchScript,
+以便在与基于 PyTorch 的 Python 程序不同的环境中重用。
+本文解释如何使用 TorchScript 导出并使用我们的模型。
+
+导出模型需要两个步骤:
+
+- 使用 `torchscript` 参数实例化模型
+- 使用虚拟输入进行前向传递
+
+这些必要条件意味着开发人员应该注意以下详细信息。
+
+## TorchScript 参数和绑定权重
+
+`torchscript` 参数是必需的,因为大多数 🤗 Transformers 语言模型的 `Embedding` 层和
+`Decoding` 层之间有绑定权重。TorchScript 不允许导出具有绑定权重的模型,因此必须事先解绑和克隆权重。
+
+使用 `torchscript` 参数实例化的模型将其 `Embedding` 层和 `Decoding` 层分开,
+这意味着它们不应该在后续进行训练。训练将导致这两层不同步,产生意外结果。
+
+对于没有语言模型头部的模型,情况不同,因为这些模型没有绑定权重。
+这些模型可以安全地导出而无需 `torchscript` 参数。
+
+## 虚拟输入和标准长度
+
+虚拟输入用于模型的前向传递。当输入的值传播到各层时,PyTorch 会跟踪在每个张量上执行的不同操作。
+然后使用记录的操作来创建模型的 *trace* 。
+
+跟踪是相对于输入的维度创建的。因此,它受到虚拟输入的维度限制,对于任何其他序列长度或批量大小都不起作用。
+当尝试使用不同大小时,会引发以下错误:
+
+```text
+`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`
+```
+
+我们建议使用至少与推断期间将馈送到模型的最大输入一样大的虚拟输入大小进行跟踪。
+填充可以帮助填补缺失的值。然而,由于模型是使用更大的输入大小进行跟踪的,矩阵的维度也会很大,导致更多的计算。
+
+在每个输入上执行的操作总数要仔细考虑,并在导出不同序列长度模型时密切关注性能。
+
+## 在 Python 中使用 TorchScript
+
+本节演示了如何保存和加载模型以及如何使用 trace 进行推断。
+
+### 保存模型
+
+要使用 TorchScript 导出 `BertModel`,请从 `BertConfig` 类实例化 `BertModel`,
+然后将其保存到名为 `traced_bert.pt` 的磁盘文件中:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+
+enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
+
+# 对输入文本分词
+text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+tokenized_text = enc.tokenize(text)
+
+# 屏蔽一个输入 token
+masked_index = 8
+tokenized_text[masked_index] = "[MASK]"
+indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
+segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+
+# 创建虚拟输入
+tokens_tensor = torch.tensor([indexed_tokens])
+segments_tensors = torch.tensor([segments_ids])
+dummy_input = [tokens_tensor, segments_tensors]
+
+# 使用 torchscript 参数初始化模型
+# 即使此模型没有 LM Head,也将参数设置为 True。
+config = BertConfig(
+ vocab_size_or_config_json_file=32000,
+ hidden_size=768,
+ num_hidden_layers=12,
+ num_attention_heads=12,
+ intermediate_size=3072,
+ torchscript=True,
+)
+
+# 实例化模型
+model = BertModel(config)
+
+# 模型需要处于评估模式
+model.eval()
+
+# 如果您使用 *from_pretrained* 实例化模型,还可以轻松设置 TorchScript 参数
+model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True)
+
+# 创建 trace
+traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
+torch.jit.save(traced_model, "traced_bert.pt")
+```
+
+### 加载模型
+
+现在,您可以从磁盘加载先前保存的 `BertModel`、`traced_bert.pt`,并在先前初始化的 `dummy_input` 上使用:
+
+```python
+loaded_model = torch.jit.load("traced_bert.pt")
+loaded_model.eval()
+
+all_encoder_layers, pooled_output = loaded_model(*dummy_input)
+```
+
+### 使用 trace 模型进行推断
+
+通过使用其 `__call__` dunder 方法使用 trace 模型进行推断:
+
+```python
+traced_model(tokens_tensor, segments_tensors)
+```
+
+## 使用 Neuron SDK 将 Hugging Face TorchScript 模型部署到 AWS
+
+AWS 引入了用于云端低成本、高性能机器学习推理的
+[Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) 实例系列。
+Inf1 实例由 AWS Inferentia 芯片提供支持,这是一款专为深度学习推理工作负载而构建的定制硬件加速器。
+[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#) 是
+Inferentia 的 SDK,支持对 transformers 模型进行跟踪和优化,以便在 Inf1 上部署。Neuron SDK 提供:
+
+1. 简单易用的 API,只需更改一行代码即可为云端推理跟踪和优化 TorchScript 模型。
+2. 针对[改进的性能成本](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/)的即插即用性能优化。
+3. 支持使用 [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html)
+ 或 [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html)
+ 构建的 Hugging Face transformers 模型。
+
+### 影响
+
+基于 [BERT(来自 Transformers 的双向编码器表示)](https://huggingface.co/docs/transformers/main/model_doc/bert)架构的
+transformers 模型,或其变体,如 [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert)
+和 [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta) 在 Inf1 上运行最佳,
+可用于生成抽取式问答、序列分类和标记分类等任务。然而,文本生成任务仍可以适应在 Inf1 上运行,
+如这篇 [AWS Neuron MarianMT 教程](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html)所述。
+有关可以直接在 Inferentia 上转换的模型的更多信息,请参阅 Neuron 文档的[模型架构适配](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia)章节。
+
+### 依赖关系
+
+使用 AWS Neuron 将模型转换为模型需要一个
+[Neuron SDK 环境](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide),
+它已经预先配置在 [AWS 深度学习 AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html)上。
+
+### 将模型转换为 AWS Neuron
+
+使用与 [Python 中使用 TorchScript](torchscript#using-torchscript-in-python) 相同的代码来跟踪
+`BertModel` 以将模型转换为 AWS NEURON。导入 `torch.neuron` 框架扩展以通过 Python API 访问 Neuron SDK 的组件:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+import torch.neuron
+```
+
+您只需要修改下面这一行:
+
+```diff
+- torch.jit.trace(model, [tokens_tensor, segments_tensors])
++ torch.neuron.trace(model, [token_tensor, segments_tensors])
+```
+
+这样就能使 Neuron SDK 跟踪模型并对其进行优化,以在 Inf1 实例上运行。
+
+要了解有关 AWS Neuron SDK 功能、工具、示例教程和最新更新的更多信息,
+请参阅 [AWS NeuronSDK 文档](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html)。
From 734eb25476741d61773f622c1b1ed810e39927df Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ming=20Xu=20=28=E5=BE=90=E6=98=8E=29?=
Date: Tue, 27 Feb 2024 00:42:24 +0800
Subject: [PATCH 018/549] =?UTF-8?q?=F0=9F=8C=90=20[i18n-ZH]=20Translate=20?=
=?UTF-8?q?chat=5Ftemplating.md=20into=20Chinese=20(#28790)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* [Pix2struct] Simplify generation (#22527)
* Add model to doc tests
* Remove generate and replace by prepare_inputs_for_generation
* More fixes
* Remove print statements
* Update integration tests
* Fix generate
* Remove model from auto mapping
* Use auto processor
* Fix integration tests
* Fix test
* Add inference code snippet
* Remove is_encoder_decoder
* Update docs
* Remove notebook link
* Release: v4.28.0
* Revert (for now) the change on `Deta` in #22437 (#22750)
fix
Co-authored-by: ydshieh
* Patch release: v4.28.1
* update zh chat template.
* Update docs/source/zh/chat_templating.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/zh/_toctree.yml
Co-authored-by: Michael
* Update docs/source/zh/chat_templating.md
Co-authored-by: Michael
* Update docs/source/zh/chat_templating.md
Co-authored-by: Michael
* Update docs/source/zh/chat_templating.md
Co-authored-by: Michael
* Update docs/source/zh/chat_templating.md
Co-authored-by: Michael
* Update docs/source/zh/chat_templating.md
Co-authored-by: Michael
* Update docs/source/zh/chat_templating.md
Co-authored-by: Michael
---------
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Sylvain Gugger
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Michael
---
docs/source/en/model_doc/pix2struct.md | 2 +-
docs/source/zh/_toctree.yml | 2 +
docs/source/zh/chat_templating.md | 437 +++++++++++++++++++++++++
3 files changed, 440 insertions(+), 1 deletion(-)
create mode 100644 docs/source/zh/chat_templating.md
diff --git a/docs/source/en/model_doc/pix2struct.md b/docs/source/en/model_doc/pix2struct.md
index 8dc179f5f863c8..0c9baa18e02fc8 100644
--- a/docs/source/en/model_doc/pix2struct.md
+++ b/docs/source/en/model_doc/pix2struct.md
@@ -74,4 +74,4 @@ The original code can be found [here](https://github.com/google-research/pix2str
## Pix2StructForConditionalGeneration
[[autodoc]] Pix2StructForConditionalGeneration
- - forward
+ - forward
\ No newline at end of file
diff --git a/docs/source/zh/_toctree.yml b/docs/source/zh/_toctree.yml
index 44db5f815a34af..a92074fde47571 100644
--- a/docs/source/zh/_toctree.yml
+++ b/docs/source/zh/_toctree.yml
@@ -37,6 +37,8 @@
title: 使用特定于模型的 API
- local: custom_models
title: 共享自定义模型
+ - local: chat_templating
+ title: 聊天模型的模板
- local: serialization
title: 导出为 ONNX
- local: tflite
diff --git a/docs/source/zh/chat_templating.md b/docs/source/zh/chat_templating.md
new file mode 100644
index 00000000000000..72764bc71c5fda
--- /dev/null
+++ b/docs/source/zh/chat_templating.md
@@ -0,0 +1,437 @@
+
+
+# 聊天模型的模板
+
+## 介绍
+
+LLM 的一个常见应用场景是聊天。在聊天上下文中,不再是连续的文本字符串构成的语句(不同于标准的语言模型),
+聊天模型由一条或多条消息组成的对话组成,每条消息都有一个“用户”或“助手”等 **角色**,还包括消息文本。
+
+与`Tokenizer`类似,不同的模型对聊天的输入格式要求也不同。这就是我们添加**聊天模板**作为一个功能的原因。
+聊天模板是`Tokenizer`的一部分。用来把问答的对话内容转换为模型的输入`prompt`。
+
+
+让我们通过一个快速的示例来具体说明,使用`BlenderBot`模型。
+BlenderBot有一个非常简单的默认模板,主要是在对话轮之间添加空格:
+
+```python
+>>> from transformers import AutoTokenizer
+>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
+
+>>> chat = [
+... {"role": "user", "content": "Hello, how are you?"},
+... {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
+... {"role": "user", "content": "I'd like to show off how chat templating works!"},
+... ]
+
+>>> tokenizer.apply_chat_template(chat, tokenize=False)
+" Hello, how are you? I'm doing great. How can I help you today? I'd like to show off how chat templating works!"
+```
+
+注意,整个聊天对话内容被压缩成了一整个字符串。如果我们使用默认设置的`tokenize=True`,那么该字符串也将被tokenized处理。
+不过,为了看到更复杂的模板实际运行,让我们使用`mistralai/Mistral-7B-Instruct-v0.1`模型。
+
+```python
+>>> from transformers import AutoTokenizer
+>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
+
+>>> chat = [
+... {"role": "user", "content": "Hello, how are you?"},
+... {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
+... {"role": "user", "content": "I'd like to show off how chat templating works!"},
+... ]
+
+>>> tokenizer.apply_chat_template(chat, tokenize=False)
+"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]"
+```
+
+可以看到,这一次tokenizer已经添加了[INST]和[/INST]来表示用户消息的开始和结束。
+Mistral-instruct是有使用这些token进行训练的,但BlenderBot没有。
+
+## 我如何使用聊天模板?
+
+正如您在上面的示例中所看到的,聊天模板非常容易使用。只需构建一系列带有`role`和`content`键的消息,
+然后将其传递给[`~PreTrainedTokenizer.apply_chat_template`]方法。
+另外,在将聊天模板用作模型预测的输入时,还建议使用`add_generation_prompt=True`来添加[generation prompt](#什么是generation-prompts)。
+
+这是一个准备`model.generate()`的示例,使用`Zephyr`模型:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+checkpoint = "HuggingFaceH4/zephyr-7b-beta"
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here
+
+messages = [
+ {
+ "role": "system",
+ "content": "You are a friendly chatbot who always responds in the style of a pirate",
+ },
+ {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+ ]
+tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+print(tokenizer.decode(tokenized_chat[0]))
+```
+这将生成Zephyr期望的输入格式的字符串。它看起来像这样:
+```text
+<|system|>
+You are a friendly chatbot who always responds in the style of a pirate
+<|user|>
+How many helicopters can a human eat in one sitting?
+<|assistant|>
+```
+
+现在我们已经按照`Zephyr`的要求传入prompt了,我们可以使用模型来生成对用户问题的回复:
+
+```python
+outputs = model.generate(tokenized_chat, max_new_tokens=128)
+print(tokenizer.decode(outputs[0]))
+```
+
+输出结果是:
+
+```text
+<|system|>
+You are a friendly chatbot who always responds in the style of a pirate
+<|user|>
+How many helicopters can a human eat in one sitting?
+<|assistant|>
+Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.
+```
+啊,原来这么容易!
+
+## 有自动化的聊天`pipeline`吗?
+
+有的,[`ConversationalPipeline`]。这个`pipeline`的设计是为了方便使用聊天模型。让我们再试一次 Zephyr 的例子,但这次使用`pipeline`:
+
+```python
+from transformers import pipeline
+
+pipe = pipeline("conversational", "HuggingFaceH4/zephyr-7b-beta")
+messages = [
+ {
+ "role": "system",
+ "content": "You are a friendly chatbot who always responds in the style of a pirate",
+ },
+ {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+]
+print(pipe(messages))
+```
+
+```text
+Conversation id: 76d886a0-74bd-454e-9804-0467041a63dc
+system: You are a friendly chatbot who always responds in the style of a pirate
+user: How many helicopters can a human eat in one sitting?
+assistant: Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.
+```
+
+[`ConversationalPipeline`]将负责处理所有的`tokenized`并调用`apply_chat_template`,一旦模型有了聊天模板,您只需要初始化pipeline并传递消息列表!
+
+## 什么是"generation prompts"?
+
+您可能已经注意到`apply_chat_template`方法有一个`add_generation_prompt`参数。
+这个参数告诉模板添加模型开始答复的标记。例如,考虑以下对话:
+
+```python
+messages = [
+ {"role": "user", "content": "Hi there!"},
+ {"role": "assistant", "content": "Nice to meet you!"},
+ {"role": "user", "content": "Can I ask a question?"}
+]
+```
+
+这是`add_generation_prompt=False`的结果,使用ChatML模板:
+```python
+tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
+"""<|im_start|>user
+Hi there!<|im_end|>
+<|im_start|>assistant
+Nice to meet you!<|im_end|>
+<|im_start|>user
+Can I ask a question?<|im_end|>
+"""
+```
+
+下面这是`add_generation_prompt=True`的结果:
+
+```python
+tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+"""<|im_start|>user
+Hi there!<|im_end|>
+<|im_start|>assistant
+Nice to meet you!<|im_end|>
+<|im_start|>user
+Can I ask a question?<|im_end|>
+<|im_start|>assistant
+"""
+```
+
+这一次我们添加了模型开始答复的标记。这可以确保模型生成文本时只会给出答复,而不会做出意外的行为,比如继续用户的消息。
+记住,聊天模型只是语言模型,它们被训练来继续文本,而聊天对它们来说只是一种特殊的文本!
+你需要用适当的控制标记来引导它们,让它们知道自己应该做什么。
+
+并非所有模型都需要生成提示。一些模型,如BlenderBot和LLaMA,在模型回复之前没有任何特殊标记。
+在这些情况下,`add_generation_prompt`参数将不起作用。`add_generation_prompt`参数取决于你所使用的模板。
+
+## 我可以在训练中使用聊天模板吗?
+
+可以!我们建议您将聊天模板应用为数据集的预处理步骤。之后,您可以像进行任何其他语言模型训练任务一样继续。
+在训练时,通常应该设置`add_generation_prompt=False`,因为添加的助手标记在训练过程中并不会有帮助。
+让我们看一个例子:
+
+```python
+from transformers import AutoTokenizer
+from datasets import Dataset
+
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
+
+chat1 = [
+ {"role": "user", "content": "Which is bigger, the moon or the sun?"},
+ {"role": "assistant", "content": "The sun."}
+]
+chat2 = [
+ {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
+ {"role": "assistant", "content": "A bacterium."}
+]
+
+dataset = Dataset.from_dict({"chat": [chat1, chat2]})
+dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
+print(dataset['formatted_chat'][0])
+```
+结果是:
+```text
+<|user|>
+Which is bigger, the moon or the sun?
+<|assistant|>
+The sun.
+```
+
+这样,后面你可以使用`formatted_chat`列,跟标准语言建模任务中一样训练即可。
+## 高级:聊天模板是如何工作的?
+
+模型的聊天模板存储在`tokenizer.chat_template`属性上。如果没有设置,则将使用该模型的默认模板。
+让我们来看看`BlenderBot`的模板:
+```python
+
+>>> from transformers import AutoTokenizer
+>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
+
+>>> tokenizer.default_chat_template
+"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
+```
+
+这看着有点复杂。让我们添加一些换行和缩进,使其更易读。
+请注意,默认情况下忽略每个块后的第一个换行以及块之前的任何前导空格,
+使用Jinja的`trim_blocks`和`lstrip_blocks`标签。
+这里,请注意空格的使用。我们强烈建议您仔细检查模板是否打印了多余的空格!
+```
+{% for message in messages %}
+ {% if message['role'] == 'user' %}
+ {{ ' ' }}
+ {% endif %}
+ {{ message['content'] }}
+ {% if not loop.last %}
+ {{ ' ' }}
+ {% endif %}
+{% endfor %}
+{{ eos_token }}
+```
+
+如果你之前不了解[Jinja template](https://jinja.palletsprojects.com/en/3.1.x/templates/)。
+Jinja是一种模板语言,允许你编写简单的代码来生成文本。
+在许多方面,代码和语法类似于Python。在纯Python中,这个模板看起来会像这样:
+```python
+for idx, message in enumerate(messages):
+ if message['role'] == 'user':
+ print(' ')
+ print(message['content'])
+ if not idx == len(messages) - 1: # Check for the last message in the conversation
+ print(' ')
+print(eos_token)
+```
+
+这里使用Jinja模板处理如下三步:
+1. 对于每条消息,如果消息是用户消息,则在其前面添加一个空格,否则不打印任何内容
+2. 添加消息内容
+3. 如果消息不是最后一条,请在其后添加两个空格。在最后一条消息之后,打印`EOS`。
+
+这是一个简单的模板,它不添加任何控制tokens,也不支持`system`消息(常用于指导模型在后续对话中如何表现)。
+但 Jinja 给了你很大的灵活性来做这些事情!让我们看一个 Jinja 模板,
+它可以实现类似于LLaMA的prompt输入(请注意,真正的LLaMA模板包括`system`消息,请不要在实际代码中使用这个简单模板!)
+```
+{% for message in messages %}
+ {% if message['role'] == 'user' %}
+ {{ bos_token + '[INST] ' + message['content'] + ' [/INST]' }}
+ {% elif message['role'] == 'system' %}
+ {{ '<>\\n' + message['content'] + '\\n< >\\n\\n' }}
+ {% elif message['role'] == 'assistant' %}
+ {{ ' ' + message['content'] + ' ' + eos_token }}
+ {% endif %}
+{% endfor %}
+```
+
+这里稍微看一下,就能明白这个模板的作用:它根据每条消息的“角色”添加对应的消息。
+`user`、`assistant`、`system`的消息需要分别处理,因为它们代表不同的角色输入。
+
+## 高级:编辑聊天模板
+
+### 如何创建聊天模板?
+
+很简单,你只需编写一个jinja模板并设置`tokenizer.chat_template`。你也可以从一个现有模板开始,只需要简单编辑便可以!
+例如,我们可以采用上面的LLaMA模板,并在助手消息中添加"[ASST]"和"[/ASST]":
+```
+{% for message in messages %}
+ {% if message['role'] == 'user' %}
+ {{ bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }}
+ {% elif message['role'] == 'system' %}
+ {{ '<>\\n' + message['content'].strip() + '\\n< >\\n\\n' }}
+ {% elif message['role'] == 'assistant' %}
+ {{ '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }}
+ {% endif %}
+{% endfor %}
+```
+
+现在,只需设置`tokenizer.chat_template`属性。下次使用[`~PreTrainedTokenizer.apply_chat_template`]时,它将使用您的新模板!
+此属性将保存在`tokenizer_config.json`文件中,因此您可以使用[`~utils.PushToHubMixin.push_to_hub`]将新模板上传到 Hub,
+这样每个人都可以使用你模型的模板!
+
+```python
+template = tokenizer.chat_template
+template = template.replace("SYS", "SYSTEM") # Change the system token
+tokenizer.chat_template = template # Set the new template
+tokenizer.push_to_hub("model_name") # Upload your new template to the Hub!
+```
+
+由于[`~PreTrainedTokenizer.apply_chat_template`]方法是由[`ConversationalPipeline`]类调用,
+因此一旦你设置了聊天模板,您的模型将自动与[`ConversationalPipeline`]兼容。
+### “默认”模板是什么?
+
+在引入聊天模板(chat_template)之前,聊天prompt是在模型中通过硬编码处理的。为了向前兼容,我们保留了这种硬编码处理聊天prompt的方法。
+如果一个模型没有设置聊天模板,但其模型有默认模板,`ConversationalPipeline`类和`apply_chat_template`等方法将使用该模型的聊天模板。
+您可以通过检查`tokenizer.default_chat_template`属性来查找`tokenizer`的默认模板。
+
+这是我们纯粹为了向前兼容性而做的事情,以避免破坏任何现有的工作流程。即使默认的聊天模板适用于您的模型,
+我们强烈建议通过显式设置`chat_template`属性来覆盖默认模板,以便向用户清楚地表明您的模型已经正确的配置了聊天模板,
+并且为了未来防范默认模板被修改或弃用的情况。
+### 我应该使用哪个模板?
+
+在为已经训练过的聊天模型设置模板时,您应确保模板与模型在训练期间看到的消息格式完全匹配,否则可能会导致性能下降。
+即使您继续对模型进行训练,也应保持聊天模板不变,这样可能会获得最佳性能。
+这与`tokenization`非常类似,在推断时,你选用跟训练时一样的`tokenization`,通常会获得最佳性能。
+
+如果您从头开始训练模型,或者在微调基础语言模型进行聊天时,您有很大的自由选择适当的模板!
+LLMs足够聪明,可以学会处理许多不同的输入格式。我们为没有特定类别模板的模型提供一个默认模板,该模板遵循
+[ChatML format](https://github.com/openai/openai-python/blob/main/chatml.md)格式要求,对于许多用例来说,
+这是一个很好的、灵活的选择。
+
+默认模板看起来像这样:
+
+```
+{% for message in messages %}
+ {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}
+{% endfor %}
+```
+
+
+如果您喜欢这个模板,下面是一行代码的模板形式,它可以直接复制到您的代码中。这一行代码还包括了[generation prompts](#什么是"generation prompts"?),
+但请注意它不会添加`BOS`或`EOS`token。
+如果您的模型需要这些token,它们不会被`apply_chat_template`自动添加,换句话说,文本的默认处理参数是`add_special_tokens=False`。
+这是为了避免模板和`add_special_tokens`逻辑产生冲突,如果您的模型需要特殊tokens,请确保将它们添加到模板中!
+
+```
+tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
+```
+
+该模板将每条消息包装在`<|im_start|>`和`<|im_end|>`tokens里面,并将角色简单地写为字符串,这样可以灵活地训练角色。输出如下:
+```text
+<|im_start|>system
+You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|>
+<|im_start|>user
+How are you?<|im_end|>
+<|im_start|>assistant
+I'm doing great!<|im_end|>
+```
+
+`user`,`system`和`assistant`是对话助手模型的标准角色,如果您的模型要与[`ConversationalPipeline`]兼容,我们建议你使用这些角色。
+但您可以不局限于这些角色,模板非常灵活,任何字符串都可以成为角色。
+
+### 如何添加聊天模板?
+
+如果您有任何聊天模型,您应该设置它们的`tokenizer.chat_template`属性,并使用[`~PreTrainedTokenizer.apply_chat_template`]测试,
+然后将更新后的`tokenizer`推送到 Hub。
+即使您不是模型所有者,如果您正在使用一个空的聊天模板或者仍在使用默认的聊天模板,
+请发起一个[pull request](https://huggingface.co/docs/hub/repositories-pull-requests-discussions),以便正确设置该属性!
+
+一旦属性设置完成,就完成了!`tokenizer.apply_chat_template`现在将在该模型中正常工作,
+这意味着它也会自动支持在诸如`ConversationalPipeline`的地方!
+
+通过确保模型具有这一属性,我们可以确保整个社区都能充分利用开源模型的全部功能。
+格式不匹配已经困扰这个领域并悄悄地损害了性能太久了,是时候结束它们了!
+
+
+## 高级:模板写作技巧
+
+如果你对Jinja不熟悉,我们通常发现编写聊天模板的最简单方法是先编写一个简短的Python脚本,按照你想要的方式格式化消息,然后将该脚本转换为模板。
+
+请记住,模板处理程序将接收对话历史作为名为`messages`的变量。每条`message`都是一个带有两个键`role`和`content`的字典。
+您可以在模板中像在Python中一样访问`messages`,这意味着您可以使用`{% for message in messages %}`进行循环,
+或者例如使用`{{ messages[0] }}`访问单个消息。
+
+您也可以使用以下提示将您的代码转换为Jinja:
+### For循环
+
+在Jinja中,for循环看起来像这样:
+
+```
+{% for message in messages %}
+{{ message['content'] }}
+{% endfor %}
+```
+
+请注意,`{{ expression block }}`中的内容将被打印到输出。您可以在表达式块中使用像`+`这样的运算符来组合字符串。
+### If语句
+
+Jinja中的if语句如下所示:
+
+```
+{% if message['role'] == 'user' %}
+{{ message['content'] }}
+{% endif %}
+```
+注意Jinja使用`{% endfor %}`和`{% endif %}`来表示`for`和`if`的结束。
+
+### 特殊变量
+
+在您的模板中,您将可以访问`messages`列表,但您还可以访问其他几个特殊变量。
+这些包括特殊`token`,如`bos_token`和`eos_token`,以及我们上面讨论过的`add_generation_prompt`变量。
+您还可以使用`loop`变量来访问有关当前循环迭代的信息,例如使用`{% if loop.last %}`来检查当前消息是否是对话中的最后一条消息。
+
+以下是一个示例,如果`add_generation_prompt=True`需要在对话结束时添加`generate_prompt`:
+
+
+```
+{% if loop.last and add_generation_prompt %}
+{{ bos_token + 'Assistant:\n' }}
+{% endif %}
+```
+
+### 空格的注意事项
+
+我们已经尽可能尝试让Jinja忽略除`{{ expressions }}`之外的空格。
+然而,请注意Jinja是一个通用的模板引擎,它可能会将同一行文本块之间的空格视为重要,并将其打印到输出中。
+我们**强烈**建议在上传模板之前检查一下,确保模板没有在不应该的地方打印额外的空格!
From c29135046ab2c9c8a67fd56d92d7254ea13c794b Mon Sep 17 00:00:00 2001
From: David Nguyen
Date: Mon, 26 Feb 2024 23:42:46 +0700
Subject: [PATCH 019/549] [i18n-vi] Translate README.md to Vietnamese (#29229)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* Add Tiếng Việt language support
* Add Vietnamese translation link to README.md
* update README_vi.md
---
README.md | 1 +
README_de.md | 1 +
README_es.md | 1 +
README_fr.md | 1 +
README_hd.md | 1 +
README_ja.md | 1 +
README_ko.md | 1 +
README_pt-br.md | 1 +
README_ru.md | 1 +
README_te.md | 1 +
README_vi.md | 579 ++++++++++++++++++++++++++++++++++++++++++++++
README_zh-hans.md | 1 +
README_zh-hant.md | 1 +
13 files changed, 591 insertions(+)
create mode 100644 README_vi.md
diff --git a/README.md b/README.md
index b3426b64dd242c..8b688d8446e64e 100644
--- a/README.md
+++ b/README.md
@@ -57,6 +57,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_de.md b/README_de.md
index f21bebdc781120..71ff7ce4aa337c 100644
--- a/README_de.md
+++ b/README_de.md
@@ -57,6 +57,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_es.md b/README_es.md
index 9130f823b7d3ee..cebe43cb91ec7d 100644
--- a/README_es.md
+++ b/README_es.md
@@ -52,6 +52,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_fr.md b/README_fr.md
index 00a2afbf812262..39bd0f8df05c4d 100644
--- a/README_fr.md
+++ b/README_fr.md
@@ -57,6 +57,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_hd.md b/README_hd.md
index 3cbc90197d3e59..fee9a2c44bb1f0 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -77,6 +77,7 @@ checkpoint: जाँच बिंदु
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_ja.md b/README_ja.md
index c7c76591976610..b350abb6eaa6af 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -87,6 +87,7 @@ user: ユーザ
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_ko.md b/README_ko.md
index 8629b5a57c198d..4f714eaafbcf4c 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -52,6 +52,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_pt-br.md b/README_pt-br.md
index 40841bd82b9f8a..684d96366aaf17 100644
--- a/README_pt-br.md
+++ b/README_pt-br.md
@@ -57,6 +57,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_ru.md b/README_ru.md
index 1c0f4d41c75592..e552b5cd4f90f5 100644
--- a/README_ru.md
+++ b/README_ru.md
@@ -57,6 +57,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_te.md b/README_te.md
index 2c0b97dada67ed..8da790e1820460 100644
--- a/README_te.md
+++ b/README_te.md
@@ -59,6 +59,7 @@ limitations under the License.
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_vi.md b/README_vi.md
new file mode 100644
index 00000000000000..9ccd5118b6e4f4
--- /dev/null
+++ b/README_vi.md
@@ -0,0 +1,579 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ English |
+ 简体中文 |
+ 繁體中文 |
+ 한국어 |
+ Español |
+ 日本語 |
+ हिन्दी |
+ Русский |
+ Рortuguês |
+ తెలుగు |
+ Français |
+ Deutsch |
+ Tiếng việt |
+
+
+
+
+ Công nghệ Học máy tiên tiến cho JAX, PyTorch và TensorFlow
+
+
+
+
+
+
+🤗 Transformers cung cấp hàng ngàn mô hình được huấn luyện trước để thực hiện các nhiệm vụ trên các modalities khác nhau như văn bản, hình ảnh và âm thanh.
+
+Các mô hình này có thể được áp dụng vào:
+
+* 📝 Văn bản, cho các nhiệm vụ như phân loại văn bản, trích xuất thông tin, trả lời câu hỏi, tóm tắt, dịch thuật và sinh văn bản, trong hơn 100 ngôn ngữ.
+* 🖼️ Hình ảnh, cho các nhiệm vụ như phân loại hình ảnh, nhận diện đối tượng và phân đoạn.
+* 🗣️ Âm thanh, cho các nhiệm vụ như nhận dạng giọng nói và phân loại âm thanh.
+
+Các mô hình Transformer cũng có thể thực hiện các nhiệm vụ trên **nhiều modalities kết hợp**, như trả lời câu hỏi về bảng, nhận dạng ký tự quang học, trích xuất thông tin từ tài liệu quét, phân loại video và trả lời câu hỏi hình ảnh.
+
+🤗 Transformers cung cấp các API để tải xuống và sử dụng nhanh chóng các mô hình được huấn luyện trước đó trên văn bản cụ thể, điều chỉnh chúng trên tập dữ liệu của riêng bạn và sau đó chia sẻ chúng với cộng đồng trên [model hub](https://huggingface.co/models) của chúng tôi. Đồng thời, mỗi module python xác định một kiến trúc là hoàn toàn độc lập và có thể được sửa đổi để cho phép thực hiện nhanh các thí nghiệm nghiên cứu.
+
+🤗 Transformers được hỗ trợ bởi ba thư viện học sâu phổ biến nhất — [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) và [TensorFlow](https://www.tensorflow.org/) — với tích hợp mượt mà giữa chúng. Việc huấn luyện mô hình của bạn với một thư viện trước khi tải chúng để sử dụng trong suy luận với thư viện khác là rất dễ dàng.
+
+## Các demo trực tuyến
+
+Bạn có thể kiểm tra hầu hết các mô hình của chúng tôi trực tiếp trên trang của chúng từ [model hub](https://huggingface.co/models). Chúng tôi cũng cung cấp [dịch vụ lưu trữ mô hình riêng tư, phiên bản và API suy luận](https://huggingface.co/pricing) cho các mô hình công khai và riêng tư.
+
+Dưới đây là một số ví dụ:
+
+Trong Xử lý Ngôn ngữ Tự nhiên:
+- [Hoàn thành từ vụng về từ với BERT](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
+- [Nhận dạng thực thể đặt tên với Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city)
+- [Tạo văn bản tự nhiên với Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
+- [Suy luận Ngôn ngữ Tự nhiên với RoBERTa](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal)
+- [Tóm tắt văn bản với BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct)
+- [Trả lời câu hỏi với DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species)
+- [Dịch văn bản với T5](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin)
+
+Trong Thị giác Máy tính:
+- [Phân loại hình ảnh với ViT](https://huggingface.co/google/vit-base-patch16-224)
+- [Phát hiện đối tượng với DETR](https://huggingface.co/facebook/detr-resnet-50)
+- [Phân đoạn ngữ nghĩa với SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512)
+- [Phân đoạn toàn diện với Mask2Former](https://huggingface.co/facebook/mask2former-swin-large-coco-panoptic)
+- [Ước lượng độ sâu với Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)
+- [Phân loại video với VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)
+- [Phân đoạn toàn cầu với OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large)
+
+Trong âm thanh:
+- [Nhận dạng giọng nói tự động với Whisper](https://huggingface.co/openai/whisper-large-v3)
+- [Phát hiện từ khóa với Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks)
+- [Phân loại âm thanh với Audio Spectrogram Transformer](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593)
+
+Trong các nhiệm vụ đa phương thức:
+- [Trả lời câu hỏi về bảng với TAPAS](https://huggingface.co/google/tapas-base-finetuned-wtq)
+- [Trả lời câu hỏi hình ảnh với ViLT](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa)
+- [Mô tả hình ảnh với LLaVa](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
+- [Phân loại hình ảnh không cần nhãn với SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384)
+- [Trả lời câu hỏi văn bản tài liệu với LayoutLM](https://huggingface.co/impira/layoutlm-document-qa)
+- [Phân loại video không cần nhãn với X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)
+- [Phát hiện đối tượng không cần nhãn với OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2)
+- [Phân đoạn hình ảnh không cần nhãn với CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)
+- [Tạo mặt nạ tự động với SAM](https://huggingface.co/docs/transformers/model_doc/sam)
+
+
+## 100 dự án sử dụng Transformers
+
+Transformers không chỉ là một bộ công cụ để sử dụng các mô hình được huấn luyện trước: đó là một cộng đồng các dự án xây dựng xung quanh nó và Hugging Face Hub. Chúng tôi muốn Transformers giúp các nhà phát triển, nhà nghiên cứu, sinh viên, giáo sư, kỹ sư và bất kỳ ai khác xây dựng những dự án mơ ước của họ.
+
+Để kỷ niệm 100.000 sao của transformers, chúng tôi đã quyết định tập trung vào cộng đồng và tạo ra trang [awesome-transformers](./awesome-transformers.md) liệt kê 100 dự án tuyệt vời được xây dựng xung quanh transformers.
+
+Nếu bạn sở hữu hoặc sử dụng một dự án mà bạn tin rằng nên được thêm vào danh sách, vui lòng mở một PR để thêm nó!
+
+## Nếu bạn đang tìm kiếm hỗ trợ tùy chỉnh từ đội ngũ Hugging Face
+
+
+
+
+
+## Hành trình nhanh
+
+Để ngay lập tức sử dụng một mô hình trên một đầu vào cụ thể (văn bản, hình ảnh, âm thanh, ...), chúng tôi cung cấp API `pipeline`. Pipelines nhóm một mô hình được huấn luyện trước với quá trình tiền xử lý đã được sử dụng trong quá trình huấn luyện của mô hình đó. Dưới đây là cách sử dụng nhanh một pipeline để phân loại văn bản tích cực so với tiêu cực:
+
+```python
+>>> from transformers import pipeline
+
+# Cấp phát một pipeline cho phân tích cảm xúc
+>>> classifier = pipeline('sentiment-analysis')
+>>> classifier('We are very happy to introduce pipeline to the transformers repository.')
+[{'label': 'POSITIVE', 'score': 0.9996980428695679}]
+```
+
+Dòng code thứ hai tải xuống và lưu trữ bộ mô hình được huấn luyện được sử dụng bởi pipeline, trong khi dòng thứ ba đánh giá nó trên văn bản đã cho. Ở đây, câu trả lời là "tích cực" với độ tin cậy là 99,97%.
+
+Nhiều nhiệm vụ có sẵn một `pipeline` được huấn luyện trước, trong NLP nhưng cũng trong thị giác máy tính và giọng nói. Ví dụ, chúng ta có thể dễ dàng trích xuất các đối tượng được phát hiện trong một hình ảnh:
+
+``` python
+>>> import requests
+>>> from PIL import Image
+>>> from transformers import pipeline
+
+# Tải xuống một hình ảnh với những con mèo dễ thương
+>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png"
+>>> image_data = requests.get(url, stream=True).raw
+>>> image = Image.open(image_data)
+
+# Cấp phát một pipeline cho phát hiện đối tượng
+>>> object_detector = pipeline('object-detection')
+>>> object_detector(image)
+[{'score': 0.9982201457023621,
+ 'label': 'remote',
+ 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
+ {'score': 0.9960021376609802,
+ 'label': 'remote',
+ 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},
+ {'score': 0.9954745173454285,
+ 'label': 'couch',
+ 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},
+ {'score': 0.9988006353378296,
+ 'label': 'cat',
+ 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},
+ {'score': 0.9986783862113953,
+ 'label': 'cat',
+ 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]
+```
+
+Ở đây, chúng ta nhận được một danh sách các đối tượng được phát hiện trong hình ảnh, với một hộp bao quanh đối tượng và một điểm đánh giá độ tin cậy. Đây là hình ảnh gốc ở bên trái, với các dự đoán hiển thị ở bên phải:
+
+
+
+
+
+
+Bạn có thể tìm hiểu thêm về các nhiệm vụ được hỗ trợ bởi API `pipeline` trong [hướng dẫn này](https://huggingface.co/docs/transformers/task_summary).
+
+Ngoài `pipeline`, để tải xuống và sử dụng bất kỳ mô hình được huấn luyện trước nào cho nhiệm vụ cụ thể của bạn, chỉ cần ba dòng code. Đây là phiên bản PyTorch:
+```python
+>>> from transformers import AutoTokenizer, AutoModel
+
+>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
+>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
+
+>>> inputs = tokenizer("Hello world!", return_tensors="pt")
+>>> outputs = model(**inputs)
+```
+
+Và đây là mã tương đương cho TensorFlow:
+```python
+>>> from transformers import AutoTokenizer, TFAutoModel
+
+>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
+>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased")
+
+>>> inputs = tokenizer("Hello world!", return_tensors="tf")
+>>> outputs = model(**inputs)
+```
+
+Tokenizer là thành phần chịu trách nhiệm cho việc tiền xử lý mà mô hình được huấn luyện trước mong đợi và có thể được gọi trực tiếp trên một chuỗi đơn (như trong các ví dụ trên) hoặc một danh sách. Nó sẽ xuất ra một từ điển mà bạn có thể sử dụng trong mã phụ thuộc hoặc đơn giản là truyền trực tiếp cho mô hình của bạn bằng cách sử dụng toán tử ** để giải nén đối số.
+
+Chính mô hình là một [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) thông thường hoặc một [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (tùy thuộc vào backend của bạn) mà bạn có thể sử dụng như bình thường. [Hướng dẫn này](https://huggingface.co/docs/transformers/training) giải thích cách tích hợp một mô hình như vậy vào một vòng lặp huấn luyện cổ điển PyTorch hoặc TensorFlow, hoặc cách sử dụng API `Trainer` của chúng tôi để tinh chỉnh nhanh chóng trên một bộ dữ liệu mới.
+
+## Tại sao tôi nên sử dụng transformers?
+
+1. Các mô hình tiên tiến dễ sử dụng:
+ - Hiệu suất cao trong việc hiểu và tạo ra ngôn ngữ tự nhiên, thị giác máy tính và âm thanh.
+ - Ngưỡng vào thấp cho giảng viên và người thực hành.
+ - Ít trừu tượng dành cho người dùng với chỉ ba lớp học.
+ - Một API thống nhất để sử dụng tất cả các mô hình được huấn luyện trước của chúng tôi.
+
+2. Giảm chi phí tính toán, làm giảm lượng khí thải carbon:
+ - Các nhà nghiên cứu có thể chia sẻ các mô hình đã được huấn luyện thay vì luôn luôn huấn luyện lại.
+ - Người thực hành có thể giảm thời gian tính toán và chi phí sản xuất.
+ - Hàng chục kiến trúc với hơn 400.000 mô hình được huấn luyện trước trên tất cả các phương pháp.
+
+3. Lựa chọn framework phù hợp cho mọi giai đoạn của mô hình:
+ - Huấn luyện các mô hình tiên tiến chỉ trong 3 dòng code.
+ - Di chuyển một mô hình duy nhất giữa các framework TF2.0/PyTorch/JAX theo ý muốn.
+ - Dễ dàng chọn framework phù hợp cho huấn luyện, đánh giá và sản xuất.
+
+4. Dễ dàng tùy chỉnh một mô hình hoặc một ví dụ theo nhu cầu của bạn:
+ - Chúng tôi cung cấp các ví dụ cho mỗi kiến trúc để tái tạo kết quả được công bố bởi các tác giả gốc.
+ - Các thành phần nội tại của mô hình được tiết lộ một cách nhất quán nhất có thể.
+ - Các tệp mô hình có thể được sử dụng độc lập với thư viện để thực hiện các thử nghiệm nhanh chóng.
+
+## Tại sao tôi không nên sử dụng transformers?
+
+- Thư viện này không phải là một bộ công cụ modul cho các khối xây dựng mạng neural. Mã trong các tệp mô hình không được tái cấu trúc với các trừu tượng bổ sung một cách cố ý, để các nhà nghiên cứu có thể lặp nhanh trên từng mô hình mà không cần đào sâu vào các trừu tượng/tệp bổ sung.
+- API huấn luyện không được thiết kế để hoạt động trên bất kỳ mô hình nào, mà được tối ưu hóa để hoạt động với các mô hình được cung cấp bởi thư viện. Đối với vòng lặp học máy chung, bạn nên sử dụng một thư viện khác (có thể là [Accelerate](https://huggingface.co/docs/accelerate)).
+- Mặc dù chúng tôi cố gắng trình bày càng nhiều trường hợp sử dụng càng tốt, nhưng các tập lệnh trong thư mục [examples](https://github.com/huggingface/transformers/tree/main/examples) chỉ là ví dụ. Dự kiến rằng chúng sẽ không hoạt động ngay tức khắc trên vấn đề cụ thể của bạn và bạn sẽ phải thay đổi một số dòng mã để thích nghi với nhu cầu của bạn.
+
+## Cài đặt
+
+### Sử dụng pip
+
+Thư viện này được kiểm tra trên Python 3.8+, Flax 0.4.1+, PyTorch 1.11+ và TensorFlow 2.6+.
+
+Bạn nên cài đặt 🤗 Transformers trong một [môi trường ảo Python](https://docs.python.org/3/library/venv.html). Nếu bạn chưa quen với môi trường ảo Python, hãy xem [hướng dẫn sử dụng](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
+
+Trước tiên, tạo một môi trường ảo với phiên bản Python bạn sẽ sử dụng và kích hoạt nó.
+
+Sau đó, bạn sẽ cần cài đặt ít nhất một trong số các framework Flax, PyTorch hoặc TensorFlow.
+Vui lòng tham khảo [trang cài đặt TensorFlow](https://www.tensorflow.org/install/), [trang cài đặt PyTorch](https://pytorch.org/get-started/locally/#start-locally) và/hoặc [Flax](https://github.com/google/flax#quick-install) và [Jax](https://github.com/google/jax#installation) để biết lệnh cài đặt cụ thể cho nền tảng của bạn.
+
+Khi đã cài đặt một trong các backend đó, 🤗 Transformers có thể được cài đặt bằng pip như sau:
+
+```bash
+pip install transformers
+```
+
+Nếu bạn muốn thực hiện các ví dụ hoặc cần phiên bản mới nhất của mã và không thể chờ đợi cho một phiên bản mới, bạn phải [cài đặt thư viện từ nguồn](https://huggingface.co/docs/transformers/installation#installing-from-source).
+
+### Với conda
+
+🤗 Transformers có thể được cài đặt bằng conda như sau:
+
+```shell script
+conda install conda-forge::transformers
+```
+
+> **_GHI CHÚ:_** Cài đặt `transformers` từ kênh `huggingface` đã bị lỗi thời.
+
+Hãy làm theo trang cài đặt của Flax, PyTorch hoặc TensorFlow để xem cách cài đặt chúng bằng conda.
+
+> **_GHI CHÚ:_** Trên Windows, bạn có thể được yêu cầu kích hoạt Chế độ phát triển để tận dụng việc lưu cache. Nếu điều này không phải là một lựa chọn cho bạn, hãy cho chúng tôi biết trong [vấn đề này](https://github.com/huggingface/huggingface_hub/issues/1062).
+
+## Kiến trúc mô hình
+
+**[Tất cả các điểm kiểm tra mô hình](https://huggingface.co/models)** được cung cấp bởi 🤗 Transformers được tích hợp một cách mượt mà từ trung tâm mô hình huggingface.co [model hub](https://huggingface.co/models), nơi chúng được tải lên trực tiếp bởi [người dùng](https://huggingface.co/users) và [tổ chức](https://huggingface.co/organizations).
+
+Số lượng điểm kiểm tra hiện tại: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen)
+
+🤗 Transformers hiện đang cung cấp các kiến trúc sau đây (xem [ở đây](https://huggingface.co/docs/transformers/model_summary) để có một tóm tắt tổng quan về mỗi kiến trúc):
+
+1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (từ Google Research và Toyota Technological Institute tại Chicago) được phát hành với bài báo [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), của Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
+1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (từ Google Research) được phát hành với bài báo [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) của Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig.
+1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (từ BAAI) được phát hành với bài báo [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) của Chen, Zhongzhi và Liu, Guang và Zhang, Bo-Wen và Ye, Fulong và Yang, Qinghong và Wu, Ledell.
+1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (từ MIT) được phát hành với bài báo [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) của Yuan Gong, Yu-An Chung, James Glass.
+1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (từ Đại học Tsinghua) được phát hành với bài báo [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) của Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (từ Suno) được phát hành trong kho lưu trữ [suno-ai/bark](https://github.com/suno-ai/bark) bởi đội ngũ Suno AI.
+1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (từ Facebook) được phát hành với bài báo [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) của Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov và Luke Zettlemoyer.
+1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (từ École polytechnique) được phát hành với bài báo [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) của Moussa Kamal Eddine, Antoine J.-P. Tixier và Michalis Vazirgiannis.
+1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (từ VinAI Research) được phát hành với bài báo [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) của Nguyen Luong Tran, Duong Minh Le và Dat Quoc Nguyen.
+1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (từ Microsoft) được phát hành với bài báo [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) của Hangbo Bao, Li Dong, Furu Wei.
+1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (từ Google) được phát hành với bài báo [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) của Jacob Devlin, Ming-Wei Chang, Kenton Lee và Kristina Toutanova.
+1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (từ Google) được phát hành với bài báo [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) của Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (từ VinAI Research) được phát hành với bài báo [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) của Dat Quoc Nguyen, Thanh Vu và Anh Tuan Nguyen.
+1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (từ Google Research) được phát hành với bài báo [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) của Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang và Amr Ahmed.
+1. **[BigBird-RoBERTa](https://huggingface.co/docs/transformers/model_doc/big_bird)** (từ Google Research) được phát hành với bài báo [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) của Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang và Amr Ahmed.
+1. **[BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt)** (từ Microsoft Research AI4Science) được phát hành với bài báo [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.
+1. **[BiT](https://huggingface.co/docs/transformers/model_doc/bit)** (từ Google AI) được phát hành với bài báo [Big Transfer (BiT): Học biểu diễn hình ảnh tổng quát](https://arxiv.org/abs/1912.11370) của Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby.
+1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (từ Facebook) được phát hành với bài báo [Công thức xây dựng một chatbot miền mở](https://arxiv.org/abs/2004.13637) của Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
+1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (từ Facebook) được phát hành với bài báo [Công thức xây dựng một chatbot miền mở](https://arxiv.org/abs/2004.13637) của Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
+1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (từ Salesforce) được phát hành với bài báo [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) của Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
+1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (từ Salesforce) được phát hành với bài báo [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.
+1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (từ BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/).
+1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (từ Alexa) được phát hành với bài báo [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
+1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (từ Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) được phát hành với bài báo [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan.
+1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (từ NAVER CLOVA) được phát hành với bài báo [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park.
+1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (từ Google Research) được phát hành với bài báo [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
+1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (từ Inria/Facebook/Sorbonne) được phát hành với bài báo [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
+1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (từ Google Research) được phát hành với bài báo [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
+1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (từ OFA-Sys) được phát hành với bài báo [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
+1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (từ LAION-AI) được phát hành với bài báo [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
+1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (từ OpenAI) được phát hành với bài báo [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
+1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (từ University of Göttingen) được phát hành với bài báo [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
+1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** được phát hành với bài báo [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker.
+1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (từ Salesforce) được phát hành với bài báo [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
+1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (từ MetaAI) được phát hành với bài báo [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
+1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (từ Microsoft Research Asia) được phát hành với bài báo [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
+1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (từ YituTech) được phát hành với bài báo [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
+1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (từ Facebook AI) được phát hành với bài báo [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
+1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (từ Facebook AI) được phát hành với bài báo [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie.
+1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (từ Tsinghua University) được phát hành với bài báo [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
+1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (từ OpenBMB) released by the [OpenBMB](https://www.openbmb.org/).
+1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (từ Salesforce) được phát hành với bài báo [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
+1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (từ Microsoft) được phát hành với bài báo [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang.
+1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (từ Facebook) được phát hành với bài báo [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
+1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (từ Microsoft) được phát hành với bài báo [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (từ Microsoft) được phát hành với bài báo [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (từ Berkeley/Facebook/Google) được phát hành với bài báo [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
+1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (từ SenseTime Research) được phát hành với bài báo [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai.
+1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (từ Facebook) được phát hành với bài báo [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
+1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (từ Google AI) được phát hành với bài báo [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun.
+1. **[Depth Anything](https://huggingface.co/docs/transformers/model_doc/depth_anything)** (từ University of Hong Kong and TikTok) được phát hành với bài báo [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
+1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (từ The University of Texas at Austin) được phát hành với bài báo [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl.
+1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (từ Facebook) được phát hành với bài báo [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
+1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (từ Microsoft Research) được phát hành với bài báo [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
+1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (từ SHI Labs) được phát hành với bài báo [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi.
+1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (từ Meta AI) được phát hành với bài báo [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
+1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (từ HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
+1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (từ Microsoft Research) được phát hành với bài báo [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
+1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (từ NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
+1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (từ Facebook) được phát hành với bài báo [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
+1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (từ Intel Labs) được phát hành với bài báo [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
+1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (từ Snap Research) được phát hành với bài báo [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.
+1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (từ Google Brain) được phát hành với bài báo [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
+1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (từ Google Research/Stanford University) được phát hành với bài báo [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
+1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (từ Meta AI) được phát hành với bài báo [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.
+1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (từ Google Research) được phát hành với bài báo [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (từ Baidu) được phát hành với bài báo [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu.
+1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (từ Baidu) được phát hành với bài báo [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.
+1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (từ Meta AI) are transformer protein language models. **ESM-1b** was được phát hành với bài báo [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was được phát hành với bài báo [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were được phát hành với bài báo [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives.
+1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (từ Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme.
+1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (từ ESPnet) được phát hành với bài báo [Recent Developments On Espnet Toolkit Boosted By Conformer](https://arxiv.org/abs/2010.13956) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.
+1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (từ Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
+1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (từ Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
+1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (từ CNRS) được phát hành với bài báo [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
+1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (từ Facebook AI) được phát hành với bài báo [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
+1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (từ Google Research) được phát hành với bài báo [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
+1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (từ Microsoft Research) được phát hành với bài báo [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao.
+1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (từ CMU/Google Brain) được phát hành với bài báo [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
+1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (từ ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. được phát hành với bài báo [blog post](https://www.adept.ai/blog/fuyu-8b)
+1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (từ Google) được phát hành với bài báo [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
+1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (từ Microsoft Research) được phát hành với bài báo [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
+1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (từ KAIST) được phát hành với bài báo [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
+1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (từ OpenAI) được phát hành với bài báo [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (từ EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
+1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (từ EleutherAI) được phát hành với bài báo [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach
+1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (từ ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
+1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (từ OpenAI) được phát hành với bài báo [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.
+1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (từ EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
+1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (từ AI-Sweden) được phát hành với bài báo [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
+1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (từ BigCode) được phát hành với bài báo [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
+1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama).
+1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (từ Microsoft) được phát hành với bài báo [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
+1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (từ UCSD, NVIDIA) được phát hành với bài báo [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
+1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (từ Allegro.pl, AGH University of Science and Technology) được phát hành với bài báo [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
+1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (từ Facebook) được phát hành với bài báo [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
+1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (từ Berkeley) được phát hành với bài báo [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (từ HuggingFace) được phát hành với bài báo [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh.
+1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (từ OpenAI) được phát hành với bài báo [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
+1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (từ Beihang University, UC Berkeley, Rutgers University, SEDD Company) được phát hành với bài báo [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
+1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (từ Salesforce) được phát hành với bài báo [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
+1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (từ OpenAI) được phát hành với bài báo [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
+1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (từ Microsoft Research Asia) được phát hành với bài báo [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
+1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (từ Microsoft Research Asia) được phát hành với bài báo [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
+1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (từ Microsoft Research Asia) được phát hành với bài báo [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
+1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (từ Microsoft Research Asia) được phát hành với bài báo [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.
+1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** (từ Microsoft Research Asia) được phát hành với bài báo [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
+1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (từ AllenAI) được phát hành với bài báo [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (từ Meta AI) được phát hành với bài báo [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze.
+1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (từ South China University of Technology) được phát hành với bài báo [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding.
+1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (từ The FAIR team of Meta AI) được phát hành với bài báo [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample.
+1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (từ The FAIR team of Meta AI) được phát hành với bài báo [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom.
+1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (từ Microsoft Research & University of Wisconsin-Madison) được phát hành với bài báo [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.
+1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (từ AllenAI) được phát hành với bài báo [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (từ Google AI) được phát hành với bài báo [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
+1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (từ Studio Ousia) được phát hành với bài báo [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
+1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (từ UNC Chapel Hill) được phát hành với bài báo [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
+1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (từ Facebook) được phát hành với bài báo [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
+1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (từ Facebook) được phát hành với bài báo [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
+1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (từ Google) được phát hành với bài báo [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
+1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (từ Microsoft Research Asia) được phát hành với bài báo [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei.
+1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (từ FAIR and UIUC) được phát hành với bài báo [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
+1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (từ Meta and UIUC) được phát hành với bài báo [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
+1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (từ Google AI) được phát hành với bài báo [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos.
+1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (từ Facebook) được phát hành với bài báo [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (từ Facebook) được phát hành với bài báo [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
+1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (từ Meta/USC/CMU/SJTU) được phát hành với bài báo [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer.
+1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (từ NVIDIA) được phát hành với bài báo [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (từ NVIDIA) được phát hành với bài báo [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (từ Alibaba Research) được phát hành với bài báo [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
+1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (từ Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
+1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (từ Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
+1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (từ Studio Ousia) được phát hành với bài báo [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
+1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (từ Facebook) được phát hành với bài báo [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
+1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (từ CMU/Google Brain) được phát hành với bài báo [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
+1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (từ Google Inc.) được phát hành với bài báo [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
+1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (từ Google Inc.) được phát hành với bài báo [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.
+1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (từ Apple) được phát hành với bài báo [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
+1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (từ Apple) được phát hành với bài báo [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari.
+1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (từ Microsoft Research) được phát hành với bài báo [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
+1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (từ MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team.
+1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (từ the University of Wisconsin - Madison) được phát hành với bài báo [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
+1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (từ Google AI) được phát hành với bài báo [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
+1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (từ Meta) được phát hành với bài báo [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
+1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (từ RUC AI Box) được phát hành với bài báo [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
+1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (từ SHI Labs) được phát hành với bài báo [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
+1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (từ Huawei Noah’s Ark Lab) được phát hành với bài báo [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
+1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (từ Meta) được phát hành với bài báo [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
+1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (từ Meta) được phát hành với bài báo [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
+1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (từ Meta AI) được phát hành với bài báo [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
+1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (từ the University of Wisconsin - Madison) được phát hành với bài báo [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
+1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (từ SHI Labs) được phát hành với bài báo [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
+1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (từ [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed).
+1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (từ Meta AI) được phát hành với bài báo [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
+1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (từ Google AI) được phát hành với bài báo [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
+1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (từ Google AI) được phát hành với bài báo [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
+1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** (từ IBM Research) được phát hành với bài báo [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
+1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (từ IBM) được phát hành với bài báo [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
+1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (từ Google) được phát hành với bài báo [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (từ Google) được phát hành với bài báo [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu.
+1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (từ Deepmind) được phát hành với bài báo [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
+1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (từ ADEPT) released in a [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani.
+1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (từ Microsoft) được phát hành với bài báos - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
+1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (từ VinAI Research) được phát hành với bài báo [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
+1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (từ Google) được phát hành với bài báo [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
+1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (từ UCLA NLP) được phát hành với bài báo [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
+1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (từ Sea AI Labs) được phát hành với bài báo [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
+1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** được phát hành với bài báo [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee.
+1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (từ Microsoft Research) được phát hành với bài báo [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (từ Nanjing University, The University of Hong Kong etc.) được phát hành với bài báo [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
+1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (từ NVIDIA) được phát hành với bài báo [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
+1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (từ the Qwen team, Alibaba Group) được phát hành với bài báo [Qwen Technical Report](https://arxiv.org/abs/2309.16609) by Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu.
+1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (từ Facebook) được phát hành với bài báo [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
+1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (từ Google Research) được phát hành với bài báo [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.
+1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (từ Google Research) được phát hành với bài báo [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
+1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (từ META Platforms) được phát hành với bài báo [Designing Network Design Space](https://arxiv.org/abs/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár.
+1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (từ Google Research) được phát hành với bài báo [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
+1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (từ Microsoft Research) được phát hành với bài báo [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
+1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (từ Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (từ Facebook) được phát hành với bài báo [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli.
+1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (từ WeChatAI) được phát hành với bài báo [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
+1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (từ ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
+1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (từ Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (từ Meta AI) được phát hành với bài báo [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
+1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (từ Meta AI) được phát hành với bài báo [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
+1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (từ NVIDIA) được phát hành với bài báo [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (từ Meta AI) được phát hành với bài báo [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
+1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (từ ASAPP) được phát hành với bài báo [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
+1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (từ ASAPP) được phát hành với bài báo [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
+1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (từ Google AI) được phát hành với bài báo [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
+1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (từ Microsoft Research) được phát hành với bài báo [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
+1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (từ Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
+1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (từ Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
+1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (từ Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
+1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (từ Berkeley) được phát hành với bài báo [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
+1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (từ Stability AI) được phát hành với bài báo [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (từ MBZUAI) được phát hành với bài báo [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
+1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (từ Microsoft) được phát hành với bài báo [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
+1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (từ Microsoft) được phát hành với bài báo [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
+1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (từ University of Würzburg) được phát hành với bài báo [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
+1. **[SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers)** (từ Google) được phát hành với bài báo [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
+1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (từ Google AI) được phát hành với bài báo [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (từ Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (từ Microsoft Research) được phát hành với bài báo [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
+1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (từ Google AI) được phát hành với bài báo [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
+1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (từ Microsoft Research) được phát hành với bài báo [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
+1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (từ HuggingFace).
+1. **[TimeSformer](https://huggingface.co/docs/transformers/model_doc/timesformer)** (từ Facebook) được phát hành với bài báo [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) by Gedas Bertasius, Heng Wang, Lorenzo Torresani.
+1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (từ the University of California at Berkeley) được phát hành với bài báo [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine
+1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (từ Google/CMU) được phát hành với bài báo [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (từ Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
+1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (từ UNC Chapel Hill) được phát hành với bài báo [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal.
+1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (từ Intel) được phát hành với bài báo [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
+1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (từ Google Research) được phát hành với bài báo [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
+1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (từ Google Research) được phát hành với bài báo [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
+1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (từ Microsoft Research) được phát hành với bài báo [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
+1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (từ Microsoft Research) được phát hành với bài báo [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (từ Kakao Corporation) được phát hành với bài báo [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
+1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (từ Peking University) được phát hành với bài báo [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
+1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (từ Tsinghua University and Nankai University) được phát hành với bài báo [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
+1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (từ Multimedia Computing Group, Nanjing University) được phát hành với bài báo [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
+1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (từ NAVER AI Lab/Kakao Enterprise/Kakao Brain) được phát hành với bài báo [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
+1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (từ University of Wisconsin–Madison) được phát hành với bài báo [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
+1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (từ Google AI) được phát hành với bài báo [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (từ UCLA NLP) được phát hành với bài báo [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
+1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (từ Google AI) được phát hành với bài báo [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (từ Meta AI) được phát hành với bài báo [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
+1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (từ Meta AI) được phát hành với bài báo [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
+1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (từ HUST-VL) được phát hành với bài báo [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
+1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (từ Meta AI) được phát hành với bài báo [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
+1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (từ Kakao Enterprise) được phát hành với bài báo [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
+1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (từ Google Research) được phát hành với bài báo [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
+1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (từ Facebook AI) được phát hành với bài báo [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
+1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (từ Meta AI) được phát hành với bài báo [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
+1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (từ Facebook AI) được phát hành với bài báo [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
+1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (từ Facebook AI) được phát hành với bài báo [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
+1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (từ Microsoft Research) được phát hành với bài báo [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
+1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (từ OpenAI) được phát hành với bài báo [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
+1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (từ Microsoft Research) được phát hành với bài báo [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
+1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (từ Meta AI) được phát hành với bài báo [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe.
+1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (từ Facebook AI) được phát hành với bài báo [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
+1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (từ Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
+1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (từ Microsoft Research) được phát hành với bài báo [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (từ Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
+1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (từ Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
+1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (từ Meta AI) được phát hành với bài báo [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa.
+1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (từ Google/CMU) được phát hành với bài báo [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
+1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (từ Facebook AI) được phát hành với bài báo [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
+1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (từ Facebook AI) được phát hành với bài báo [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
+1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (từ Huazhong University of Science & Technology) được phát hành với bài báo [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.
+1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (từ the University of Wisconsin - Madison) được phát hành với bài báo [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.
+1. Muốn đóng góp một mô hình mới? Chúng tôi đã thêm một **hướng dẫn chi tiết và mẫu** để hướng dẫn bạn trong quá trình thêm một mô hình mới. Bạn có thể tìm thấy chúng trong thư mục [`templates`](./templates) của kho lưu trữ. Hãy chắc chắn kiểm tra [hướng dẫn đóng góp](./CONTRIBUTING.md) và liên hệ với người duy trì hoặc mở một vấn đề để thu thập phản hồi trước khi bắt đầu PR của bạn.
+
+Để kiểm tra xem mỗi mô hình có một phiên bản thực hiện trong Flax, PyTorch hoặc TensorFlow, hoặc có một tokenizer liên quan được hỗ trợ bởi thư viện 🤗 Tokenizers, vui lòng tham khảo [bảng này](https://huggingface.co/docs/transformers/index#supported-frameworks).
+
+Những phiên bản này đã được kiểm tra trên một số tập dữ liệu (xem các tập lệnh ví dụ) và nên tương đương với hiệu suất của các phiên bản gốc. Bạn có thể tìm thấy thêm thông tin về hiệu suất trong phần Ví dụ của [tài liệu](https://github.com/huggingface/transformers/tree/main/examples).
+
+
+## Tìm hiểu thêm
+
+| Phần | Mô tả |
+|-|-|
+| [Tài liệu](https://huggingface.co/docs/transformers/) | Toàn bộ tài liệu API và hướng dẫn |
+| [Tóm tắt nhiệm vụ](https://huggingface.co/docs/transformers/task_summary) | Các nhiệm vụ được hỗ trợ bởi 🤗 Transformers |
+| [Hướng dẫn tiền xử lý](https://huggingface.co/docs/transformers/preprocessing) | Sử dụng lớp `Tokenizer` để chuẩn bị dữ liệu cho các mô hình |
+| [Huấn luyện và điều chỉnh](https://huggingface.co/docs/transformers/training) | Sử dụng các mô hình được cung cấp bởi 🤗 Transformers trong vòng lặp huấn luyện PyTorch/TensorFlow và API `Trainer` |
+| [Hướng dẫn nhanh: Điều chỉnh/sử dụng các kịch bản](https://github.com/huggingface/transformers/tree/main/examples) | Các kịch bản ví dụ để điều chỉnh mô hình trên nhiều nhiệm vụ khác nhau |
+| [Chia sẻ và tải lên mô hình](https://huggingface.co/docs/transformers/model_sharing) | Tải lên và chia sẻ các mô hình đã điều chỉnh của bạn với cộng đồng |
+
+## Trích dẫn
+
+Bây giờ chúng ta có một [bài báo](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) mà bạn có thể trích dẫn cho thư viện 🤗 Transformers:
+```bibtex
+@inproceedings{wolf-etal-2020-transformers,
+ title = "Transformers: State-of-the-Art Natural Language Processing",
+ author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
+ booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
+ month = oct,
+ year = "2020",
+ address = "Online",
+ publisher = "Association for Computational Linguistics",
+ url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
+ pages = "38--45"
+}
+```
diff --git a/README_zh-hans.md b/README_zh-hans.md
index 08007a4e110d62..a3394b00a658ea 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -77,6 +77,7 @@ checkpoint: 检查点
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
diff --git a/README_zh-hant.md b/README_zh-hant.md
index 07c3f8a40b92a6..024fecdcc6d6fc 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -89,6 +89,7 @@ user: 使用者
తెలుగు |
Français |
Deutsch |
+ Tiếng Việt |
From a44d2dc3a94cbbb44eccb1a60e1bf4a998b4d2b6 Mon Sep 17 00:00:00 2001
From: Michael
Date: Tue, 27 Feb 2024 00:53:05 +0800
Subject: [PATCH 020/549] [i18n-zh] Translated task/asr.md into Chinese
(#29233)
* [zh] Translate a task: asr.md
Signed-off-by: windsonsea
* apply suggestions from Fan-Lin
---------
Signed-off-by: windsonsea
---
docs/source/zh/_toctree.yml | 5 +
docs/source/zh/tasks/asr.md | 398 ++++++++++++++++++++++++++++++++++++
2 files changed, 403 insertions(+)
create mode 100644 docs/source/zh/tasks/asr.md
diff --git a/docs/source/zh/_toctree.yml b/docs/source/zh/_toctree.yml
index a92074fde47571..7149e4c2f147da 100644
--- a/docs/source/zh/_toctree.yml
+++ b/docs/source/zh/_toctree.yml
@@ -28,6 +28,11 @@
- local: llm_tutorial
title: 使用LLMs进行生成
title: 教程
+- sections:
+ - isExpanded: false
+ sections:
+ - local: tasks/asr
+ title: 自动语音识别
- sections:
- local: fast_tokenizers
title: 使用 🤗 Tokenizers 中的分词器
diff --git a/docs/source/zh/tasks/asr.md b/docs/source/zh/tasks/asr.md
new file mode 100644
index 00000000000000..91fee0ab332ede
--- /dev/null
+++ b/docs/source/zh/tasks/asr.md
@@ -0,0 +1,398 @@
+
+
+# 自动语音识别
+
+[[open-in-colab]]
+
+
+
+自动语音识别(ASR)将语音信号转换为文本,将一系列音频输入映射到文本输出。
+Siri 和 Alexa 这类虚拟助手使用 ASR 模型来帮助用户日常生活,还有许多其他面向用户的有用应用,如会议实时字幕和会议纪要。
+
+本指南将向您展示如何:
+
+1. 在 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 数据集上对
+ [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) 进行微调,以将音频转录为文本。
+2. 使用微调后的模型进行推断。
+
+
+
+本教程中展示的任务受以下模型架构的支持:
+
+
+
+[Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [M-CTC-T](../model_doc/mctct), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-BERT](../model_doc/wav2vec2-bert), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm)
+
+
+
+
+
+在开始之前,请确保您已安装所有必要的库:
+
+```bash
+pip install transformers datasets evaluate jiwer
+```
+
+我们鼓励您登录自己的 Hugging Face 账户,这样您就可以上传并与社区分享您的模型。
+出现提示时,输入您的令牌登录:
+
+```py
+>>> from huggingface_hub import notebook_login
+
+>>> notebook_login()
+```
+
+## 加载 MInDS-14 数据集
+
+首先从🤗 Datasets 库中加载 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14)
+数据集的一个较小子集。这将让您有机会先进行实验,确保一切正常,然后再花更多时间在完整数据集上进行训练。
+
+```py
+>>> from datasets import load_dataset, Audio
+
+>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")
+```
+
+使用 [`~Dataset.train_test_split`] 方法将数据集的 `train` 拆分为训练集和测试集:
+
+```py
+>>> minds = minds.train_test_split(test_size=0.2)
+```
+
+然后看看数据集:
+
+```py
+>>> minds
+DatasetDict({
+ train: Dataset({
+ features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
+ num_rows: 16
+ })
+ test: Dataset({
+ features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
+ num_rows: 4
+ })
+})
+```
+
+虽然数据集包含 `lang_id `和 `english_transcription` 等许多有用的信息,但在本指南中,
+您将专注于 `audio` 和 `transcription`。使用 [`~datasets.Dataset.remove_columns`] 方法删除其他列:
+
+```py
+>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])
+```
+
+再看看示例:
+
+```py
+>>> minds["train"][0]
+{'audio': {'array': array([-0.00024414, 0. , 0. , ..., 0.00024414,
+ 0.00024414, 0.00024414], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
+ 'sampling_rate': 8000},
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
+ 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
+```
+
+有 2 个字段:
+
+- `audio`:由语音信号形成的一维 `array`,用于加载和重新采样音频文件。
+- `transcription`:目标文本。
+
+## 预处理
+
+下一步是加载一个 Wav2Vec2 处理器来处理音频信号:
+
+```py
+>>> from transformers import AutoProcessor
+
+>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
+```
+
+MInDS-14 数据集的采样率为 8000kHz(您可以在其[数据集卡片](https://huggingface.co/datasets/PolyAI/minds14)中找到此信息),
+这意味着您需要将数据集重新采样为 16000kHz 以使用预训练的 Wav2Vec2 模型:
+
+```py
+>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
+>>> minds["train"][0]
+{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
+ 2.78103951e-04, 2.38446111e-04, 1.18740834e-04], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
+ 'sampling_rate': 16000},
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
+ 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
+```
+
+如您在上面的 `transcription` 中所看到的,文本包含大小写字符的混合。
+Wav2Vec2 分词器仅训练了大写字符,因此您需要确保文本与分词器的词汇表匹配:
+
+```py
+>>> def uppercase(example):
+... return {"transcription": example["transcription"].upper()}
+
+
+>>> minds = minds.map(uppercase)
+```
+
+现在创建一个预处理函数,该函数应该:
+
+1. 调用 `audio` 列以加载和重新采样音频文件。
+2. 从音频文件中提取 `input_values` 并使用处理器对 `transcription` 列执行 tokenizer 操作。
+
+```py
+>>> def prepare_dataset(batch):
+... audio = batch["audio"]
+... batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
+... batch["input_length"] = len(batch["input_values"][0])
+... return batch
+```
+
+要在整个数据集上应用预处理函数,可以使用🤗 Datasets 的 [`~datasets.Dataset.map`] 函数。
+您可以通过增加 `num_proc` 参数来加速 `map` 的处理进程数量。
+使用 [`~datasets.Dataset.remove_columns`] 方法删除不需要的列:
+
+```py
+>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
+```
+
+🤗 Transformers 没有用于 ASR 的数据整理器,因此您需要调整 [`DataCollatorWithPadding`] 来创建一个示例批次。
+它还会动态地将您的文本和标签填充到其批次中最长元素的长度(而不是整个数据集),以使它们具有统一的长度。
+虽然可以通过在 `tokenizer` 函数中设置 `padding=True` 来填充文本,但动态填充更有效。
+
+与其他数据整理器不同,这个特定的数据整理器需要对 `input_values` 和 `labels `应用不同的填充方法:
+
+```py
+>>> import torch
+
+>>> from dataclasses import dataclass, field
+>>> from typing import Any, Dict, List, Optional, Union
+
+
+>>> @dataclass
+... class DataCollatorCTCWithPadding:
+... processor: AutoProcessor
+... padding: Union[bool, str] = "longest"
+
+... def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
+... # split inputs and labels since they have to be of different lengths and need
+... # different padding methods
+... input_features = [{"input_values": feature["input_values"][0]} for feature in features]
+... label_features = [{"input_ids": feature["labels"]} for feature in features]
+
+... batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")
+
+... labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")
+
+... # replace padding with -100 to ignore loss correctly
+... labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
+
+... batch["labels"] = labels
+
+... return batch
+```
+
+现在实例化您的 `DataCollatorForCTCWithPadding`:
+
+```py
+>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")
+```
+
+## 评估
+
+在训练过程中包含一个指标通常有助于评估模型的性能。
+您可以通过🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 库快速加载一个评估方法。
+对于这个任务,加载 [word error rate](https://huggingface.co/spaces/evaluate-metric/wer)(WER)指标
+(请参阅🤗 Evaluate [快速上手](https://huggingface.co/docs/evaluate/a_quick_tour)以了解如何加载和计算指标):
+
+```py
+>>> import evaluate
+
+>>> wer = evaluate.load("wer")
+```
+
+然后创建一个函数,将您的预测和标签传递给 [`~evaluate.EvaluationModule.compute`] 来计算 WER:
+
+```py
+>>> import numpy as np
+
+
+>>> def compute_metrics(pred):
+... pred_logits = pred.predictions
+... pred_ids = np.argmax(pred_logits, axis=-1)
+
+... pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
+
+... pred_str = processor.batch_decode(pred_ids)
+... label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
+
+... wer = wer.compute(predictions=pred_str, references=label_str)
+
+... return {"wer": wer}
+```
+
+您的 `compute_metrics` 函数现在已经准备就绪,当您设置好训练时将返回给此函数。
+
+## 训练
+
+
+
+
+
+如果您不熟悉使用[`Trainer`]微调模型,请查看这里的基本教程[here](../training#train-with-pytorch-trainer)!
+
+
+
+现在您已经准备好开始训练您的模型了!使用 [`AutoModelForCTC`] 加载 Wav2Vec2。
+使用 `ctc_loss_reduction` 参数指定要应用的减少方式。通常最好使用平均值而不是默认的求和:
+
+```py
+>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer
+
+>>> model = AutoModelForCTC.from_pretrained(
+... "facebook/wav2vec2-base",
+... ctc_loss_reduction="mean",
+... pad_token_id=processor.tokenizer.pad_token_id,
+)
+```
+
+此时,只剩下 3 个步骤:
+
+1. 在 [`TrainingArguments`] 中定义您的训练参数。唯一必需的参数是 `output_dir`,用于指定保存模型的位置。
+ 您可以通过设置 `push_to_hub=True` 将此模型推送到 Hub(您需要登录到 Hugging Face 才能上传您的模型)。
+ 在每个 epoch 结束时,[`Trainer`] 将评估 WER 并保存训练检查点。
+2. 将训练参数与模型、数据集、分词器、数据整理器和 `compute_metrics` 函数一起传递给 [`Trainer`]。
+3. 调用 [`~Trainer.train`] 来微调您的模型。
+
+```py
+>>> training_args = TrainingArguments(
+... output_dir="my_awesome_asr_mind_model",
+... per_device_train_batch_size=8,
+... gradient_accumulation_steps=2,
+... learning_rate=1e-5,
+... warmup_steps=500,
+... max_steps=2000,
+... gradient_checkpointing=True,
+... fp16=True,
+... group_by_length=True,
+... evaluation_strategy="steps",
+... per_device_eval_batch_size=8,
+... save_steps=1000,
+... eval_steps=1000,
+... logging_steps=25,
+... load_best_model_at_end=True,
+... metric_for_best_model="wer",
+... greater_is_better=False,
+... push_to_hub=True,
+... )
+
+>>> trainer = Trainer(
+... model=model,
+... args=training_args,
+... train_dataset=encoded_minds["train"],
+... eval_dataset=encoded_minds["test"],
+... tokenizer=processor,
+... data_collator=data_collator,
+... compute_metrics=compute_metrics,
+... )
+
+>>> trainer.train()
+```
+
+训练完成后,使用 [`~transformers.Trainer.push_to_hub`] 方法将您的模型分享到 Hub,方便大家使用您的模型:
+
+```py
+>>> trainer.push_to_hub()
+```
+
+
+
+
+
+要深入了解如何微调模型进行自动语音识别,
+请查看这篇博客[文章](https://huggingface.co/blog/fine-tune-wav2vec2-english)以了解英语 ASR,
+还可以参阅[这篇文章](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)以了解多语言 ASR。
+
+
+
+## 推断
+
+很好,现在您已经微调了一个模型,您可以用它进行推断了!
+
+加载您想要运行推断的音频文件。请记住,如果需要,将音频文件的采样率重新采样为与模型匹配的采样率!
+
+```py
+>>> from datasets import load_dataset, Audio
+
+>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
+>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
+>>> sampling_rate = dataset.features["audio"].sampling_rate
+>>> audio_file = dataset[0]["audio"]["path"]
+```
+
+尝试使用微调后的模型进行推断的最简单方法是使用 [`pipeline`]。
+使用您的模型实例化一个用于自动语音识别的 `pipeline`,并将您的音频文件传递给它:
+
+```py
+>>> from transformers import pipeline
+
+>>> transcriber = pipeline("automatic-speech-recognition", model="stevhliu/my_awesome_asr_minds_model")
+>>> transcriber(audio_file)
+{'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}
+```
+
+
+
+转录结果还不错,但可以更好!尝试用更多示例微调您的模型,以获得更好的结果!
+
+
+
+如果您愿意,您也可以手动复制 `pipeline` 的结果:
+
+
+
+
+加载一个处理器来预处理音频文件和转录,并将 `input` 返回为 PyTorch 张量:
+
+```py
+>>> from transformers import AutoProcessor
+
+>>> processor = AutoProcessor.from_pretrained("stevhliu/my_awesome_asr_mind_model")
+>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+```
+
+将您的输入传递给模型并返回 logits:
+
+```py
+>>> from transformers import AutoModelForCTC
+
+>>> model = AutoModelForCTC.from_pretrained("stevhliu/my_awesome_asr_mind_model")
+>>> with torch.no_grad():
+... logits = model(**inputs).logits
+```
+
+获取具有最高概率的预测 `input_ids`,并使用处理器将预测的 `input_ids` 解码回文本:
+
+```py
+>>> import torch
+
+>>> predicted_ids = torch.argmax(logits, dim=-1)
+>>> transcription = processor.batch_decode(predicted_ids)
+>>> transcription
+['I WOUL LIKE O SET UP JOINT ACOUNT WTH Y PARTNER']
+```
+
+
\ No newline at end of file
From 3b8c053631a2088d74fbb6ef4db47dbed8fa1470 Mon Sep 17 00:00:00 2001
From: Eduardo Pacheco <69953243+EduardoPach@users.noreply.github.com>
Date: Mon, 26 Feb 2024 18:24:30 +0100
Subject: [PATCH 021/549] Fixed Deformable Detr typo when loading cuda kernels
for MSDA (#29294)
---
.../models/deformable_detr/modeling_deformable_detr.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/models/deformable_detr/modeling_deformable_detr.py b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
index 640c05257cc967..e9252167e7b4b1 100755
--- a/src/transformers/models/deformable_detr/modeling_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
@@ -60,7 +60,7 @@ def load_cuda_kernels():
global MultiScaleDeformableAttention
- root = Path(__file__).resolve().parent.parent.parent / "kernels" / "deta"
+ root = Path(__file__).resolve().parent.parent.parent / "kernels" / "deformable_detr"
src_files = [
root / filename
for filename in [
From 3fcfbe7549d9694f96e1f19630add4adf99dd421 Mon Sep 17 00:00:00 2001
From: Eduardo Pacheco <69953243+EduardoPach@users.noreply.github.com>
Date: Mon, 26 Feb 2024 19:17:19 +0100
Subject: [PATCH 022/549] Adding SegGPT (#27735)
* First commit
* Improvements
* More improvements
* Converted original checkpoint to HF checkpoint
* Fix style
* Fixed forward
* More improvements
* More improvements
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Remove asserts
* Remove unnecessary attributes
* Changed model name to camel case
* Improve forward doc
* Improve tests
* More improvements
* Fix copies
* Fix doc
* Make SegGptImageProcessor more flexible
* Added few-shot test
* Fix style
* Update READMEs and docs
* Update READMEs
* Make inputs required
* Add SegGptForImageSegmentation
* Make tests pass
* Rename to out_indicies
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Fixed naming convention
* Copying SegGptMlp from modeling_sam.py
* Some minor improvements
* Remove mlp_ratio
* Fix docstrings
* Fixed docstring match
* Objects defined before use
* Storing only patch_size and beta for SegGptLoss
* removed _prepare_inputs method
* Removed modified from headers
* Renamed to output_indicies
* Removed unnecessary einsums
* Update tests/models/seggpt/test_modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update tests/models/seggpt/test_modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update tests/models/seggpt/test_modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Fixing issues
* Raise error as soon as possible
* More fixes
* Fix merge
* Added palette to SegGptImageProcessor
* Fixed typo
* Fixed shape typo
* Added permute before doing palette to class mapping
* Fixed style
* Fixed and added tests
* Fixed docstrings
* Matching SegFormer API for post_processing_semantic_segmentation
* Fixed copies
* Fixed SegGptImageProcessor to handle both binary and RGB masks
* Updated docstrings of SegGptImageProcessor
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update docs/source/en/model_doc/seggpt.md
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/configuration_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/convert_seggpt_to_hf.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update tests/models/seggpt/test_image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update tests/models/seggpt/test_modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Object definitions above & fix style
* Renamed output_indices to intermediate_feature_indices
* Removed unnecessary check on bool_masked_pos
* Loss first in the outputs
* Added validation for do_normalize
* Improved SegGptImageProcessor and added new tests
* Added comment
* Added docstrings to SegGptLoss
* Reimplemented ensemble condition logic in SegGptEncoder
* Update src/transformers/models/seggpt/__init__.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Update src/transformers/models/seggpt/convert_seggpt_to_hf.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Update src/transformers/models/seggpt/configuration_seggpt.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Updated docstrings to use post_process_semantic_segmentation
* Fixed typo on docstrings
* moved pixel values test to test_image_processing_seggpt
* Addressed comments
* Update src/transformers/models/seggpt/configuration_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/image_processing_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/configuration_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Updated docstrings for SegGptLoss
* Address comments
* Added SegGpt example to model docs
* Update src/transformers/models/seggpt/modeling_seggpt.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* moved patchify and unpatchify
* Rename checkpoint
* Renamed intermediate_features to intermediate_hidden_states for consistency
* Update src/transformers/models/seggpt/configuration_seggpt.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
* Replaced post_process_masks for post_process_semantic_segmentation in the docs
---------
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Niels
Co-authored-by: Eduardo Pacheco
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---
README.md | 1 +
README_es.md | 1 +
README_fr.md | 1 +
README_hd.md | 1 +
README_ja.md | 1 +
README_ko.md | 1 +
README_zh-hans.md | 1 +
README_zh-hant.md | 1 +
docs/source/en/_toctree.yml | 2 +
docs/source/en/index.md | 1 +
docs/source/en/model_doc/seggpt.md | 90 ++
src/transformers/__init__.py | 23 +-
src/transformers/models/__init__.py | 1 +
.../models/auto/configuration_auto.py | 3 +
.../models/auto/image_processing_auto.py | 1 +
src/transformers/models/auto/modeling_auto.py | 1 +
src/transformers/models/seggpt/__init__.py | 71 ++
.../models/seggpt/configuration_seggpt.py | 145 +++
.../models/seggpt/convert_seggpt_to_hf.py | 222 ++++
.../models/seggpt/image_processing_seggpt.py | 626 ++++++++++
.../models/seggpt/modeling_seggpt.py | 1014 +++++++++++++++++
src/transformers/utils/dummy_pt_objects.py | 24 +
.../utils/dummy_vision_objects.py | 7 +
tests/models/seggpt/__init__.py | 0
.../seggpt/test_image_processing_seggpt.py | 231 ++++
tests/models/seggpt/test_modeling_seggpt.py | 339 ++++++
tests/test_modeling_common.py | 10 +
utils/check_repo.py | 1 +
28 files changed, 2816 insertions(+), 4 deletions(-)
create mode 100644 docs/source/en/model_doc/seggpt.md
create mode 100644 src/transformers/models/seggpt/__init__.py
create mode 100644 src/transformers/models/seggpt/configuration_seggpt.py
create mode 100644 src/transformers/models/seggpt/convert_seggpt_to_hf.py
create mode 100644 src/transformers/models/seggpt/image_processing_seggpt.py
create mode 100644 src/transformers/models/seggpt/modeling_seggpt.py
create mode 100644 tests/models/seggpt/__init__.py
create mode 100644 tests/models/seggpt/test_image_processing_seggpt.py
create mode 100644 tests/models/seggpt/test_modeling_seggpt.py
diff --git a/README.md b/README.md
index 8b688d8446e64e..8d9dc398573c9c 100644
--- a/README.md
+++ b/README.md
@@ -482,6 +482,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (from Beijing Academy of Artificial Intelligence (BAAI)) released with the paper [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang.
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
diff --git a/README_es.md b/README_es.md
index cebe43cb91ec7d..e8b85812f73eb4 100644
--- a/README_es.md
+++ b/README_es.md
@@ -455,6 +455,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (from Beijing Academy of Artificial Intelligence (BAAI) released with the paper [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang.
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
diff --git a/README_fr.md b/README_fr.md
index 39bd0f8df05c4d..9ff23f6025b226 100644
--- a/README_fr.md
+++ b/README_fr.md
@@ -476,6 +476,7 @@ Nombre actuel de points de contrôle : ![](https://img.shields.io/endpoint?url=h
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (de Meta AI) a été publié dans l'article [SeamlessM4T — Traduction multimodale et massivement multilingue](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) par l'équipe de communication transparente.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (de Meta AI) a été publié dans l'article [Seamless: Traduction de la parole multilingue, expressive et en continu](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) par l'équipe de communication transparente.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (de NVIDIA) a été publié dans l'article [SegFormer : Conception simple et efficace pour la segmentation sémantique avec des transformateurs](https://arxiv.org/abs/2105.15203) par Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (de Beijing Academy of Artificial Intelligence (BAAI) publié dans l'article [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) parXinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang.
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (de Meta AI) a été publié dans l'article [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) par Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (de ASAPP) a été publié dans l'article [Compromis entre performances et efficacité dans l'entraînement non supervisé pour la reconnaissance vocale](https://arxiv.org/abs/2109.06870) par Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (de ASAPP) a été publié dans l'article [Compromis entre performances et efficacité dans l'entraînement non supervisé pour la reconnaissance vocale](https://arxiv.org/abs/2109.06870) par Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
diff --git a/README_hd.md b/README_hd.md
index fee9a2c44bb1f0..081d2d3e206484 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -429,6 +429,7 @@ conda install conda-forge::transformers
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (Beijing Academy of Artificial Intelligence (BAAI से) Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang. द्वाराअनुसंधान पत्र [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) के साथ जारी किया गया
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI से) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. द्वाराअनुसंधान पत्र [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) के साथ जारी किया गया
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP से) साथ देने वाला पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योव आर्टज़ी द्वारा।
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP से) साथ में पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योआव आर्टज़ी द्वारा पोस्ट किया गया।
diff --git a/README_ja.md b/README_ja.md
index b350abb6eaa6af..69e8a05fe5d4bb 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -489,6 +489,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA から) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo から公開された研究論文: [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203)
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (Beijing Academy of Artificial Intelligence (BAAI から) Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang. から公開された研究論文 [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284)
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI から) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. から公開された研究論文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)
diff --git a/README_ko.md b/README_ko.md
index 4f714eaafbcf4c..daa13f8635a907 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -404,6 +404,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA 에서) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 의 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 논문과 함께 발표했습니다.
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (Beijing Academy of Artificial Intelligence (BAAI 에서 제공)은 Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang.의 [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284)논문과 함께 발표했습니다.
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI 에서 제공)은 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.의 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)논문과 함께 발표했습니다.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index a3394b00a658ea..8cd63a9c91c14c 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -428,6 +428,7 @@ conda install conda-forge::transformers
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (来自 NVIDIA) 伴随论文 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 由 Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 发布。
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (来自 Beijing Academy of Artificial Intelligence (BAAI) 伴随论文 [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) 由 Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang 发布。
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (来自 Meta AI) 伴随论文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) 由 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick 发布。
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index 024fecdcc6d6fc..ce345a702656b1 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -440,6 +440,7 @@ conda install conda-forge::transformers
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+1. **[SegGPT](https://huggingface.co/docs/transformers/main/model_doc/seggpt)** (from Beijing Academy of Artificial Intelligence (BAAI) released with the paper [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang.
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 18dad03d9b1b1d..976a104294c9c9 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -583,6 +583,8 @@
title: ResNet
- local: model_doc/segformer
title: SegFormer
+ - local: model_doc/seggpt
+ title: SegGpt
- local: model_doc/swiftformer
title: SwiftFormer
- local: model_doc/swin
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
index d6b46ace97e120..ae5e21d3b59a56 100644
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -251,6 +251,7 @@ Flax), PyTorch, and/or TensorFlow.
| [SeamlessM4T](model_doc/seamless_m4t) | ✅ | ❌ | ❌ |
| [SeamlessM4Tv2](model_doc/seamless_m4t_v2) | ✅ | ❌ | ❌ |
| [SegFormer](model_doc/segformer) | ✅ | ✅ | ❌ |
+| [SegGPT](model_doc/seggpt) | ✅ | ❌ | ❌ |
| [SEW](model_doc/sew) | ✅ | ❌ | ❌ |
| [SEW-D](model_doc/sew-d) | ✅ | ❌ | ❌ |
| [SigLIP](model_doc/siglip) | ✅ | ❌ | ❌ |
diff --git a/docs/source/en/model_doc/seggpt.md b/docs/source/en/model_doc/seggpt.md
new file mode 100644
index 00000000000000..a7f41630e408bc
--- /dev/null
+++ b/docs/source/en/model_doc/seggpt.md
@@ -0,0 +1,90 @@
+
+
+# SegGPT
+
+## Overview
+
+The SegGPT model was proposed in [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang. SegGPT employs a decoder-only Transformer that can generate a segmentation mask given an input image, a prompt image and its corresponding prompt mask. The model achieves remarkable one-shot results with 56.1 mIoU on COCO-20 and 85.6 mIoU on FSS-1000.
+
+The abstract from the paper is the following:
+
+*We present SegGPT, a generalist model for segmenting everything in context. We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images. The training of SegGPT is formulated as an in-context coloring problem with random color mapping for each data sample. The objective is to accomplish diverse tasks according to the context, rather than relying on specific colors. After training, SegGPT can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks, including few-shot semantic segmentation, video object segmentation, semantic segmentation, and panoptic segmentation. Our results show strong capabilities in segmenting in-domain and out-of*
+
+Tips:
+- One can use [`SegGptImageProcessor`] to prepare image input, prompt and mask to the model.
+- It's highly advisable to pass `num_labels` (not considering background) during preprocessing and postprocessing with [`SegGptImageProcessor`] for your use case.
+- When doing infenrece with [`SegGptForImageSegmentation`] if your `batch_size` is greater than 1 you can use feature ensemble across your images by passing `feature_ensemble=True` in the forward method.
+
+Here's how to use the model for one-shot semantic segmentation:
+
+```python
+import torch
+from datasets import load_dataset
+from transformers import SegGptImageProcessor, SegGptForImageSegmentation
+
+model_id = "BAAI/seggpt-vit-large"
+image_processor = SegGptImageProcessor.from_pretrained(checkpoint)
+model = SegGptForImageSegmentation.from_pretrained(checkpoint)
+
+dataset_id = "EduardoPacheco/FoodSeg103"
+ds = load_dataset(dataset_id, split="train")
+# Number of labels in FoodSeg103 (not including background)
+num_labels = 103
+
+image_input = ds[4]["image"]
+ground_truth = ds[4]["label"]
+image_prompt = ds[29]["image"]
+mask_prompt = ds[29]["label"]
+
+inputs = image_processor(
+ images=image_input,
+ prompt_images=image_prompt,
+ prompt_masks=mask_prompt,
+ num_labels=num_labels,
+ return_tensors="pt"
+)
+
+with torch.no_grad():
+ outputs = model(**inputs)
+
+target_sizes = [image_input.size[::-1]]
+mask = image_processor.post_process_semantic_segmentation(outputs, target_sizes, num_labels=num_labels)[0]
+```
+
+This model was contributed by [EduardoPacheco](https://huggingface.co/EduardoPacheco).
+The original code can be found [here]([(https://github.com/baaivision/Painter/tree/main)).
+
+
+## SegGptConfig
+
+[[autodoc]] SegGptConfig
+
+## SegGptImageProcessor
+
+[[autodoc]] SegGptImageProcessor
+ - preprocess
+ - post_process_semantic_segmentation
+
+## SegGptModel
+
+[[autodoc]] SegGptModel
+ - forward
+
+## SegGptForImageSegmentation
+
+[[autodoc]] SegGptForImageSegmentation
+ - forward
\ No newline at end of file
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index f427c4be7b3c76..bc1be5842d0260 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -767,6 +767,7 @@
"SeamlessM4Tv2Config",
],
"models.segformer": ["SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegformerConfig"],
+ "models.seggpt": ["SEGGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegGptConfig"],
"models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
"models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
"models.siglip": [
@@ -1316,6 +1317,7 @@
_import_structure["models.pvt"].extend(["PvtImageProcessor"])
_import_structure["models.sam"].extend(["SamImageProcessor"])
_import_structure["models.segformer"].extend(["SegformerFeatureExtractor", "SegformerImageProcessor"])
+ _import_structure["models.seggpt"].extend(["SegGptImageProcessor"])
_import_structure["models.siglip"].append("SiglipImageProcessor")
_import_structure["models.swin2sr"].append("Swin2SRImageProcessor")
_import_structure["models.tvlt"].append("TvltImageProcessor")
@@ -3192,6 +3194,14 @@
"SegformerPreTrainedModel",
]
)
+ _import_structure["models.seggpt"].extend(
+ [
+ "SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST",
+ "SegGptForImageSegmentation",
+ "SegGptModel",
+ "SegGptPreTrainedModel",
+ ]
+ )
_import_structure["models.sew"].extend(
[
"SEW_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -5531,10 +5541,8 @@
SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP,
SeamlessM4Tv2Config,
)
- from .models.segformer import (
- SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
- SegformerConfig,
- )
+ from .models.segformer import SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, SegformerConfig
+ from .models.seggpt import SEGGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, SegGptConfig
from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
from .models.siglip import (
@@ -6080,6 +6088,7 @@
from .models.pvt import PvtImageProcessor
from .models.sam import SamImageProcessor
from .models.segformer import SegformerFeatureExtractor, SegformerImageProcessor
+ from .models.seggpt import SegGptImageProcessor
from .models.siglip import SiglipImageProcessor
from .models.swin2sr import Swin2SRImageProcessor
from .models.tvlt import TvltImageProcessor
@@ -7635,6 +7644,12 @@
SegformerModel,
SegformerPreTrainedModel,
)
+ from .models.seggpt import (
+ SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST,
+ SegGptForImageSegmentation,
+ SegGptModel,
+ SegGptPreTrainedModel,
+ )
from .models.sew import (
SEW_PRETRAINED_MODEL_ARCHIVE_LIST,
SEWForCTC,
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index 5d59756f91ac1b..df5496f09d01d7 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -194,6 +194,7 @@
seamless_m4t,
seamless_m4t_v2,
segformer,
+ seggpt,
sew,
sew_d,
siglip,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index 282007836a06f2..ab24b8a332662f 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -202,6 +202,7 @@
("seamless_m4t", "SeamlessM4TConfig"),
("seamless_m4t_v2", "SeamlessM4Tv2Config"),
("segformer", "SegformerConfig"),
+ ("seggpt", "SegGptConfig"),
("sew", "SEWConfig"),
("sew-d", "SEWDConfig"),
("siglip", "SiglipConfig"),
@@ -428,6 +429,7 @@
("seamless_m4t", "SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("seamless_m4t_v2", "SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+ ("seggpt", "SEGGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("siglip", "SIGLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -680,6 +682,7 @@
("seamless_m4t", "SeamlessM4T"),
("seamless_m4t_v2", "SeamlessM4Tv2"),
("segformer", "SegFormer"),
+ ("seggpt", "SegGPT"),
("sew", "SEW"),
("sew-d", "SEW-D"),
("siglip", "SigLIP"),
diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
index c9cd6fca69d661..aef894a425bae1 100644
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -98,6 +98,7 @@
("resnet", "ConvNextImageProcessor"),
("sam", "SamImageProcessor"),
("segformer", "SegformerImageProcessor"),
+ ("seggpt", "SegGptImageProcessor"),
("siglip", "SiglipImageProcessor"),
("swiftformer", "ViTImageProcessor"),
("swin", "ViTImageProcessor"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 50534c58e8aaf4..9a2aaaca01dbc5 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -193,6 +193,7 @@
("seamless_m4t", "SeamlessM4TModel"),
("seamless_m4t_v2", "SeamlessM4Tv2Model"),
("segformer", "SegformerModel"),
+ ("seggpt", "SegGptModel"),
("sew", "SEWModel"),
("sew-d", "SEWDModel"),
("siglip", "SiglipModel"),
diff --git a/src/transformers/models/seggpt/__init__.py b/src/transformers/models/seggpt/__init__.py
new file mode 100644
index 00000000000000..49649c92865da6
--- /dev/null
+++ b/src/transformers/models/seggpt/__init__.py
@@ -0,0 +1,71 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+
+
+_import_structure = {
+ "configuration_seggpt": ["SEGGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegGptConfig", "SegGptOnnxConfig"]
+}
+
+try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["modeling_seggpt"] = [
+ "SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST",
+ "SegGptModel",
+ "SegGptPreTrainedModel",
+ "SegGptForImageSegmentation",
+ ]
+
+try:
+ if not is_vision_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["image_processing_seggpt"] = ["SegGptImageProcessor"]
+
+if TYPE_CHECKING:
+ from .configuration_seggpt import SEGGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, SegGptConfig, SegGptOnnxConfig
+
+ try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .modeling_seggpt import (
+ SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST,
+ SegGptForImageSegmentation,
+ SegGptModel,
+ SegGptPreTrainedModel,
+ )
+
+ try:
+ if not is_vision_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .image_processing_seggpt import SegGptImageProcessor
+
+else:
+ import sys
+
+ sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/seggpt/configuration_seggpt.py b/src/transformers/models/seggpt/configuration_seggpt.py
new file mode 100644
index 00000000000000..37c81f10323a2f
--- /dev/null
+++ b/src/transformers/models/seggpt/configuration_seggpt.py
@@ -0,0 +1,145 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" SegGpt model configuration"""
+
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+SEGGPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+ "BAAI/seggpt-vit-large": "https://huggingface.co/BAAI/seggpt-vit-large/resolve/main/config.json",
+}
+
+
+class SegGptConfig(PretrainedConfig):
+ r"""
+ This is the configuration class to store the configuration of a [`SegGptModel`]. It is used to instantiate a SegGPT
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+ defaults will yield a similar configuration to that of the SegGPT
+ [BAAI/seggpt-vit-large](https://huggingface.co/BAAI/seggpt-vit-large) architecture.
+
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+ documentation from [`PretrainedConfig`] for more information.
+
+ Args:
+ hidden_size (`int`, *optional*, defaults to 1024):
+ Dimensionality of the encoder layers and the pooler layer.
+ num_hidden_layers (`int`, *optional*, defaults to 24):
+ Number of hidden layers in the Transformer encoder.
+ num_attention_heads (`int`, *optional*, defaults to 16):
+ Number of attention heads for each attention layer in the Transformer encoder.
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+ `"relu"`, `"selu"` and `"gelu_new"` are supported.
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+ initializer_range (`float`, *optional*, defaults to 0.02):
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+ layer_norm_eps (`float`, *optional*, defaults to 1e-06):
+ The epsilon used by the layer normalization layers.
+ image_size (`List[int]`, *optional*, defaults to `[896, 448]`):
+ The size (resolution) of each image.
+ patch_size (`int`, *optional*, defaults to 16):
+ The size (resolution) of each patch.
+ num_channels (`int`, *optional*, defaults to 3):
+ The number of input channels.
+ qkv_bias (`bool`, *optional*, defaults to `True`):
+ Whether to add a bias to the queries, keys and values.
+ mlp_dim (`int`, *optional*):
+ The dimensionality of the MLP layer in the Transformer encoder. If unset, defaults to
+ `hidden_size` * 4.
+ drop_path_rate (`float`, *optional*, defaults to 0.1):
+ The drop path rate for the dropout layers.
+ pretrain_image_size (`int`, *optional*, defaults to 224):
+ The pretrained size of the absolute position embeddings.
+ decoder_hidden_size (`int`, *optional*, defaults to 64):
+ Hidden size for decoder.
+ use_relative_position_embeddings (`bool`, *optional*, defaults to `True`):
+ Whether to use relative position embeddings in the attention layers.
+ merge_index (`int`, *optional*, defaults to 2):
+ The index of the encoder layer to merge the embeddings.
+ intermediate_hidden_state_indices (`List[int]`, *optional*, defaults to `[5, 11, 17, 23]`):
+ The indices of the encoder layers which we store as features for the decoder.
+ beta (`float`, *optional*, defaults to 0.01):
+ Regularization factor for SegGptLoss (smooth-l1 loss).
+
+ Example:
+
+ ```python
+ >>> from transformers import SegGptConfig, SegGptModel
+
+ >>> # Initializing a SegGPT seggpt-vit-large style configuration
+ >>> configuration = SegGptConfig()
+
+ >>> # Initializing a model (with random weights) from the seggpt-vit-large style configuration
+ >>> model = SegGptModel(configuration)
+
+ >>> # Accessing the model configuration
+ >>> configuration = model.config
+ ```"""
+
+ model_type = "seggpt"
+
+ def __init__(
+ self,
+ hidden_size=1024,
+ num_hidden_layers=24,
+ num_attention_heads=16,
+ hidden_act="gelu",
+ hidden_dropout_prob=0.0,
+ initializer_range=0.02,
+ layer_norm_eps=1e-6,
+ image_size=[896, 448],
+ patch_size=16,
+ num_channels=3,
+ qkv_bias=True,
+ mlp_dim=None,
+ drop_path_rate=0.1,
+ pretrain_image_size=224,
+ decoder_hidden_size=64,
+ use_relative_position_embeddings=True,
+ merge_index=2,
+ intermediate_hidden_state_indices=[5, 11, 17, 23],
+ beta=0.01,
+ **kwargs,
+ ):
+ super().__init__(**kwargs)
+
+ if merge_index > min(intermediate_hidden_state_indices):
+ raise ValueError(
+ f"Merge index must be less than the minimum encoder output index, but got {merge_index=} and {intermediate_hidden_state_indices=}"
+ )
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.hidden_act = hidden_act
+ self.hidden_dropout_prob = hidden_dropout_prob
+ self.initializer_range = initializer_range
+ self.layer_norm_eps = layer_norm_eps
+ self.image_size = image_size
+ self.patch_size = patch_size
+ self.num_channels = num_channels
+ self.qkv_bias = qkv_bias
+ self.drop_path_rate = drop_path_rate
+ self.pretrain_image_size = pretrain_image_size
+ self.decoder_hidden_size = decoder_hidden_size
+ self.use_relative_position_embeddings = use_relative_position_embeddings
+ self.merge_index = merge_index
+ self.intermediate_hidden_state_indices = intermediate_hidden_state_indices
+ self.beta = beta
+ self.mlp_dim = int(hidden_size * 4) if mlp_dim is None else mlp_dim
diff --git a/src/transformers/models/seggpt/convert_seggpt_to_hf.py b/src/transformers/models/seggpt/convert_seggpt_to_hf.py
new file mode 100644
index 00000000000000..a13372dfbb1db1
--- /dev/null
+++ b/src/transformers/models/seggpt/convert_seggpt_to_hf.py
@@ -0,0 +1,222 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert SegGPT checkpoints from the original repository.
+
+URL: https://github.com/baaivision/Painter/tree/main/SegGPT
+"""
+
+
+import argparse
+
+import requests
+import torch
+from PIL import Image
+
+from transformers import SegGptConfig, SegGptForImageSegmentation, SegGptImageProcessor
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+# here we list all keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys(config):
+ rename_keys = []
+
+ # fmt: off
+
+ # rename embedding and its parameters
+ rename_keys.append(("patch_embed.proj.weight", "model.embeddings.patch_embeddings.projection.weight"))
+ rename_keys.append(("patch_embed.proj.bias", "model.embeddings.patch_embeddings.projection.bias"))
+ rename_keys.append(("mask_token", "model.embeddings.mask_token"))
+ rename_keys.append(("segment_token_x", "model.embeddings.segment_token_input"))
+ rename_keys.append(("segment_token_y", "model.embeddings.segment_token_prompt"))
+ rename_keys.append(("type_token_cls", "model.embeddings.type_token_semantic"))
+ rename_keys.append(("type_token_ins", "model.embeddings.type_token_instance"))
+ rename_keys.append(("pos_embed", "model.embeddings.position_embeddings"))
+
+ # rename decoder and other
+ rename_keys.append(("norm.weight", "model.encoder.layernorm.weight"))
+ rename_keys.append(("norm.bias", "model.encoder.layernorm.bias"))
+ rename_keys.append(("decoder_embed.weight", "decoder.decoder_embed.weight"))
+ rename_keys.append(("decoder_embed.bias", "decoder.decoder_embed.bias"))
+ rename_keys.append(("decoder_pred.0.weight", "decoder.decoder_pred.conv.weight"))
+ rename_keys.append(("decoder_pred.0.bias", "decoder.decoder_pred.conv.bias"))
+ rename_keys.append(("decoder_pred.1.weight", "decoder.decoder_pred.layernorm.weight"))
+ rename_keys.append(("decoder_pred.1.bias", "decoder.decoder_pred.layernorm.bias"))
+ rename_keys.append(("decoder_pred.3.weight", "decoder.decoder_pred.head.weight"))
+ rename_keys.append(("decoder_pred.3.bias", "decoder.decoder_pred.head.bias"))
+
+ # rename blocks
+ for i in range(config.num_hidden_layers):
+ rename_keys.append((f"blocks.{i}.attn.qkv.weight", f"model.encoder.layers.{i}.attention.qkv.weight"))
+ rename_keys.append((f"blocks.{i}.attn.qkv.bias", f"model.encoder.layers.{i}.attention.qkv.bias"))
+ rename_keys.append((f"blocks.{i}.attn.proj.weight", f"model.encoder.layers.{i}.attention.proj.weight"))
+ rename_keys.append((f"blocks.{i}.attn.proj.bias", f"model.encoder.layers.{i}.attention.proj.bias"))
+ rename_keys.append((f"blocks.{i}.attn.rel_pos_h", f"model.encoder.layers.{i}.attention.rel_pos_h"))
+ rename_keys.append((f"blocks.{i}.attn.rel_pos_w", f"model.encoder.layers.{i}.attention.rel_pos_w"))
+
+ rename_keys.append((f"blocks.{i}.mlp.fc1.weight", f"model.encoder.layers.{i}.mlp.lin1.weight"))
+ rename_keys.append((f"blocks.{i}.mlp.fc1.bias", f"model.encoder.layers.{i}.mlp.lin1.bias"))
+ rename_keys.append((f"blocks.{i}.mlp.fc2.weight", f"model.encoder.layers.{i}.mlp.lin2.weight"))
+ rename_keys.append((f"blocks.{i}.mlp.fc2.bias", f"model.encoder.layers.{i}.mlp.lin2.bias"))
+
+ rename_keys.append((f"blocks.{i}.norm1.weight", f"model.encoder.layers.{i}.layernorm_before.weight"))
+ rename_keys.append((f"blocks.{i}.norm1.bias", f"model.encoder.layers.{i}.layernorm_before.bias"))
+ rename_keys.append((f"blocks.{i}.norm2.weight", f"model.encoder.layers.{i}.layernorm_after.weight"))
+ rename_keys.append((f"blocks.{i}.norm2.bias", f"model.encoder.layers.{i}.layernorm_after.bias"))
+
+ # fmt: on
+
+ return rename_keys
+
+
+def rename_key(dct, old, new):
+ val = dct.pop(old)
+ dct[new] = val
+
+
+# We will verify our results on spongebob images
+def prepare_input():
+ image_input_url = (
+ "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_2.jpg"
+ )
+ image_prompt_url = (
+ "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_1.jpg"
+ )
+ mask_prompt_url = (
+ "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_1_target.png"
+ )
+
+ image_input = Image.open(requests.get(image_input_url, stream=True).raw)
+ image_prompt = Image.open(requests.get(image_prompt_url, stream=True).raw)
+ mask_prompt = Image.open(requests.get(mask_prompt_url, stream=True).raw)
+
+ return image_input, image_prompt, mask_prompt
+
+
+@torch.no_grad()
+def convert_seggpt_checkpoint(args):
+ model_name = args.model_name
+ pytorch_dump_folder_path = args.pytorch_dump_folder_path
+ verify_logits = args.verify_logits
+ push_to_hub = args.push_to_hub
+
+ # Define default GroundingDINO configuation
+ config = SegGptConfig()
+
+ # Load original checkpoint
+ checkpoint_url = "https://huggingface.co/BAAI/SegGpt/blob/main/seggpt_vit_large.pth"
+ original_state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")["model"]
+
+ # # Rename keys
+ new_state_dict = original_state_dict.copy()
+ rename_keys = create_rename_keys(config)
+
+ for src, dest in rename_keys:
+ rename_key(new_state_dict, src, dest)
+
+ # Load HF model
+ model = SegGptForImageSegmentation(config)
+ model.eval()
+ missing_keys, unexpected_keys = model.load_state_dict(new_state_dict, strict=False)
+ print("Missing keys:", missing_keys)
+ print("Unexpected keys:", unexpected_keys)
+
+ input_img, prompt_img, prompt_mask = prepare_input()
+ image_processor = SegGptImageProcessor()
+ inputs = image_processor(images=input_img, prompt_images=prompt_img, prompt_masks=prompt_mask, return_tensors="pt")
+
+ expected_prompt_pixel_values = torch.tensor(
+ [
+ [[-0.6965, -0.6965, -0.6965], [-0.6965, -0.6965, -0.6965], [-0.6965, -0.6965, -0.6965]],
+ [[1.6583, 1.6583, 1.6583], [1.6583, 1.6583, 1.6583], [1.6583, 1.6583, 1.6583]],
+ [[2.3088, 2.3088, 2.3088], [2.3088, 2.3088, 2.3088], [2.3088, 2.3088, 2.3088]],
+ ]
+ )
+
+ expected_pixel_values = torch.tensor(
+ [
+ [[1.6324, 1.6153, 1.5810], [1.6153, 1.5982, 1.5810], [1.5810, 1.5639, 1.5639]],
+ [[1.2731, 1.2556, 1.2206], [1.2556, 1.2381, 1.2031], [1.2206, 1.2031, 1.1681]],
+ [[1.6465, 1.6465, 1.6465], [1.6465, 1.6465, 1.6465], [1.6291, 1.6291, 1.6291]],
+ ]
+ )
+
+ expected_prompt_masks = torch.tensor(
+ [
+ [[-2.1179, -2.1179, -2.1179], [-2.1179, -2.1179, -2.1179], [-2.1179, -2.1179, -2.1179]],
+ [[-2.0357, -2.0357, -2.0357], [-2.0357, -2.0357, -2.0357], [-2.0357, -2.0357, -2.0357]],
+ [[-1.8044, -1.8044, -1.8044], [-1.8044, -1.8044, -1.8044], [-1.8044, -1.8044, -1.8044]],
+ ]
+ )
+
+ assert torch.allclose(inputs.pixel_values[0, :, :3, :3], expected_pixel_values, atol=1e-4)
+ assert torch.allclose(inputs.prompt_pixel_values[0, :, :3, :3], expected_prompt_pixel_values, atol=1e-4)
+ assert torch.allclose(inputs.prompt_masks[0, :, :3, :3], expected_prompt_masks, atol=1e-4)
+
+ torch.manual_seed(2)
+ outputs = model(**inputs)
+ print(outputs)
+
+ if verify_logits:
+ expected_output = torch.tensor(
+ [
+ [[-2.1208, -2.1190, -2.1198], [-2.1237, -2.1228, -2.1227], [-2.1232, -2.1226, -2.1228]],
+ [[-2.0405, -2.0396, -2.0403], [-2.0434, -2.0434, -2.0433], [-2.0428, -2.0432, -2.0434]],
+ [[-1.8102, -1.8088, -1.8099], [-1.8131, -1.8126, -1.8129], [-1.8130, -1.8128, -1.8131]],
+ ]
+ )
+ assert torch.allclose(outputs.pred_masks[0, :, :3, :3], expected_output, atol=1e-4)
+ print("Looks good!")
+ else:
+ print("Converted without verifying logits")
+
+ if pytorch_dump_folder_path is not None:
+ print(f"Saving model and processor for {model_name} to {pytorch_dump_folder_path}")
+ model.save_pretrained(pytorch_dump_folder_path)
+ image_processor.save_pretrained(pytorch_dump_folder_path)
+
+ if push_to_hub:
+ print(f"Pushing model and processor for {model_name} to hub")
+ model.push_to_hub(f"EduardoPacheco/{model_name}")
+ image_processor.push_to_hub(f"EduardoPacheco/{model_name}")
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ # Required parameters
+ parser.add_argument(
+ "--model_name",
+ default="seggpt-vit-large",
+ type=str,
+ choices=["seggpt-vit-large"],
+ help="Name of the SegGpt model you'd like to convert.",
+ )
+ parser.add_argument(
+ "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
+ )
+ parser.add_argument(
+ "--verify_logits",
+ action="store_false",
+ help="Whether or not to verify the logits against the original implementation.",
+ )
+ parser.add_argument(
+ "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+ )
+
+ args = parser.parse_args()
+ convert_seggpt_checkpoint(args)
diff --git a/src/transformers/models/seggpt/image_processing_seggpt.py b/src/transformers/models/seggpt/image_processing_seggpt.py
new file mode 100644
index 00000000000000..80fb94cdc7aaf4
--- /dev/null
+++ b/src/transformers/models/seggpt/image_processing_seggpt.py
@@ -0,0 +1,626 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for SegGPT."""
+
+from typing import Dict, List, Optional, Tuple, Union
+
+import numpy as np
+
+from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ...image_transforms import resize, to_channel_dimension_format
+from ...image_utils import (
+ IMAGENET_DEFAULT_MEAN,
+ IMAGENET_DEFAULT_STD,
+ ChannelDimension,
+ ImageInput,
+ PILImageResampling,
+ get_channel_dimension_axis,
+ infer_channel_dimension_format,
+ is_scaled_image,
+ make_list_of_images,
+ to_numpy_array,
+ valid_images,
+)
+from ...utils import TensorType, is_torch_available, logging, requires_backends
+
+
+if is_torch_available():
+ import torch
+
+
+logger = logging.get_logger(__name__)
+
+
+# See https://arxiv.org/pdf/2212.02499.pdf at 3.1 Redefining Output Spaces as "Images" - Semantic Segmentation from PAINTER paper
+# Taken from https://github.com/Abdullah-Meda/Painter/blob/main/Painter/data/coco_semseg/gen_color_coco_panoptic_segm.py#L31
+def build_palette(num_labels: int) -> List[Tuple[int, int]]:
+ base = int(num_labels ** (1 / 3)) + 1
+ margin = 256 // base
+
+ # we assume that class_idx 0 is the background which is mapped to black
+ color_list = [(0, 0, 0)]
+ for location in range(num_labels):
+ num_seq_r = location // base**2
+ num_seq_g = (location % base**2) // base
+ num_seq_b = location % base
+
+ R = 255 - num_seq_r * margin
+ G = 255 - num_seq_g * margin
+ B = 255 - num_seq_b * margin
+
+ color_list.append((R, G, B))
+
+ return color_list
+
+
+def get_num_channels(image: np.ndarray, input_data_format: ChannelDimension) -> int:
+ if image.ndim == 2:
+ return 0
+
+ channel_idx = get_channel_dimension_axis(image, input_data_format)
+ return image.shape[channel_idx]
+
+
+def mask_to_rgb(
+ mask: np.ndarray,
+ palette: Optional[List[Tuple[int, int]]] = None,
+ input_data_format: Optional[ChannelDimension] = None,
+ data_format: Optional[ChannelDimension] = None,
+) -> np.ndarray:
+ if input_data_format is None and mask.ndim > 2:
+ input_data_format = infer_channel_dimension_format(mask)
+
+ data_format = data_format if data_format is not None else input_data_format
+
+ num_channels = get_num_channels(mask, input_data_format)
+
+ if num_channels == 3:
+ return to_channel_dimension_format(mask, data_format, input_data_format) if data_format is not None else mask
+
+ if palette is not None:
+ height, width = mask.shape
+
+ rgb_mask = np.zeros((3, height, width), dtype=np.uint8)
+
+ classes_in_mask = np.unique(mask)
+
+ for class_idx in classes_in_mask:
+ rgb_value = palette[class_idx]
+ class_mask = (mask == class_idx).astype(np.uint8)
+ class_mask = np.expand_dims(class_mask, axis=-1)
+ class_rgb_mask = class_mask * np.array(rgb_value)
+ class_rgb_mask = np.moveaxis(class_rgb_mask, -1, 0)
+ rgb_mask += class_rgb_mask.astype(np.uint8)
+
+ rgb_mask = np.clip(rgb_mask, 0, 255).astype(np.uint8)
+
+ else:
+ rgb_mask = np.repeat(mask[None, ...], 3, axis=0)
+
+ return (
+ to_channel_dimension_format(rgb_mask, data_format, input_data_format) if data_format is not None else rgb_mask
+ )
+
+
+class SegGptImageProcessor(BaseImageProcessor):
+ r"""
+ Constructs a SegGpt image processor.
+
+ Args:
+ do_resize (`bool`, *optional*, defaults to `True`):
+ Whether to resize the image's (height, width) dimensions to the specified `(size["height"],
+ size["width"])`. Can be overridden by the `do_resize` parameter in the `preprocess` method.
+ size (`dict`, *optional*, defaults to `{"height": 448, "width": 448}`):
+ Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
+ method.
+ resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
+ Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the
+ `preprocess` method.
+ do_rescale (`bool`, *optional*, defaults to `True`):
+ Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
+ parameter in the `preprocess` method.
+ rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+ Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
+ `preprocess` method.
+ do_normalize (`bool`, *optional*, defaults to `True`):
+ Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+ method.
+ image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_MEAN`):
+ Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+ channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+ image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
+ Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+ number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+ """
+
+ model_input_names = ["pixel_values"]
+
+ def __init__(
+ self,
+ do_resize: bool = True,
+ size: Optional[Dict[str, int]] = None,
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
+ do_rescale: bool = True,
+ rescale_factor: Union[int, float] = 1 / 255,
+ do_normalize: bool = True,
+ image_mean: Optional[Union[float, List[float]]] = None,
+ image_std: Optional[Union[float, List[float]]] = None,
+ **kwargs,
+ ) -> None:
+ super().__init__(**kwargs)
+ size = size if size is not None else {"height": 448, "width": 448}
+ size = get_size_dict(size)
+ self.do_resize = do_resize
+ self.do_rescale = do_rescale
+ self.do_normalize = do_normalize
+ self.size = size
+ self.resample = resample
+ self.rescale_factor = rescale_factor
+ self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
+ self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
+
+ def get_palette(self, num_labels: int) -> List[Tuple[int, int]]:
+ """Build a palette to map the prompt mask from a single channel to a 3 channel RGB.
+
+ Args:
+ num_labels (`int`):
+ Number of classes in the segmentation task (excluding the background).
+
+ Returns:
+ `List[Tuple[int, int]]`: Palette to map the prompt mask from a single channel to a 3 channel RGB.
+ """
+ return build_palette(num_labels)
+
+ def mask_to_rgb(
+ self,
+ image: np.ndarray,
+ palette: Optional[List[Tuple[int, int]]] = None,
+ data_format: Optional[Union[str, ChannelDimension]] = None,
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
+ ) -> np.ndarray:
+ """Convert a mask to RGB format.
+
+ Args:
+ image (`np.ndarray`):
+ Mask to convert to RGB format. If the mask is already in RGB format, it will be passed through.
+ palette (`List[Tuple[int, int]]`, *optional*, defaults to `None`):
+ Palette to use to convert the mask to RGB format. If unset, the mask is duplicated across the channel
+ dimension.
+ data_format (`ChannelDimension` or `str`, *optional*):
+ The channel dimension format for the output image. If unset, the channel dimension format of the input
+ image is used. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+ input_data_format (`ChannelDimension` or `str`, *optional*):
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
+ from the input image. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+
+ Returns:
+ `np.ndarray`: The mask in RGB format.
+ """
+ return mask_to_rgb(
+ image,
+ palette=palette,
+ data_format=data_format,
+ input_data_format=input_data_format,
+ )
+
+ # Copied from transformers.models.vit.image_processing_vit.ViTImageProcessor.resize with PILImageResampling.BILINEAR->PILImageResampling.BICUBIC
+ def resize(
+ self,
+ image: np.ndarray,
+ size: Dict[str, int],
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
+ data_format: Optional[Union[str, ChannelDimension]] = None,
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
+ **kwargs,
+ ) -> np.ndarray:
+ """
+ Resize an image to `(size["height"], size["width"])`.
+
+ Args:
+ image (`np.ndarray`):
+ Image to resize.
+ size (`Dict[str, int]`):
+ Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
+ resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+ `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BICUBIC`.
+ data_format (`ChannelDimension` or `str`, *optional*):
+ The channel dimension format for the output image. If unset, the channel dimension format of the input
+ image is used. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+ input_data_format (`ChannelDimension` or `str`, *optional*):
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
+ from the input image. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+
+ Returns:
+ `np.ndarray`: The resized image.
+ """
+ size = get_size_dict(size)
+ if "height" not in size or "width" not in size:
+ raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}")
+ output_size = (size["height"], size["width"])
+ return resize(
+ image,
+ size=output_size,
+ resample=resample,
+ data_format=data_format,
+ input_data_format=input_data_format,
+ **kwargs,
+ )
+
+ def _preprocess_step(
+ self,
+ images: ImageInput,
+ is_mask: bool = False,
+ do_resize: Optional[bool] = None,
+ size: Dict[str, int] = None,
+ resample: PILImageResampling = None,
+ do_rescale: Optional[bool] = None,
+ rescale_factor: Optional[float] = None,
+ do_normalize: Optional[bool] = None,
+ image_mean: Optional[Union[float, List[float]]] = None,
+ image_std: Optional[Union[float, List[float]]] = None,
+ data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
+ num_labels: Optional[int] = None,
+ **kwargs,
+ ):
+ """
+ Preprocess an image or batch of images.
+
+ Args:
+ images (`ImageInput`):
+ Image to _preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
+ is_mask (`bool`, *optional*, defaults to `False`):
+ Whether the image is a mask. If True, the image is converted to RGB using the palette if
+ `self.num_labels` is specified otherwise RGB is achieved by duplicating the channel.
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+ Whether to resize the image.
+ size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+ Dictionary in the format `{"height": h, "width": w}` specifying the size of the output image after
+ resizing.
+ resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
+ `PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BICUBIC`. Only has
+ an effect if `do_resize` is set to `True`.
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+ Whether to rescale the image values between [0 - 1].
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+ Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+ Whether to normalize the image.
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+ Image mean to use if `do_normalize` is set to `True`.
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+ Image standard deviation to use if `do_normalize` is set to `True`.
+ return_tensors (`str` or `TensorType`, *optional*):
+ The type of tensors to return. Can be one of:
+ - Unset: Return a list of `np.ndarray`.
+ - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+ - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+ - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+ - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+ The channel dimension format for the output image. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+ - Unset: Use the channel dimension format of the input image.
+ input_data_format (`ChannelDimension` or `str`, *optional*):
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
+ from the input image. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+ num_labels: (`int`, *optional*):
+ Number of classes in the segmentation task (excluding the background). If specified, a palette will be
+ built, assuming that class_idx 0 is the background, to map the prompt mask from a single class_idx
+ channel to a 3 channel RGB. Not specifying this will result in the prompt mask either being passed
+ through as is if it is already in RGB format or being duplicated across the channel dimension.
+ """
+ do_resize = do_resize if do_resize is not None else self.do_resize
+ do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+ do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+ resample = resample if resample is not None else self.resample
+ rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+ image_mean = image_mean if image_mean is not None else self.image_mean
+ image_std = image_std if image_std is not None else self.image_std
+
+ size = size if size is not None else self.size
+ size_dict = get_size_dict(size)
+
+ images = make_list_of_images(images)
+
+ if not valid_images(images):
+ raise ValueError(
+ "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+ "torch.Tensor, tf.Tensor or jax.ndarray."
+ )
+
+ if do_resize and size is None:
+ raise ValueError("Size must be specified if do_resize is True.")
+
+ if do_rescale and rescale_factor is None:
+ raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+ if do_normalize and (image_mean is None or image_std is None):
+ raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+ # All transformations expect numpy arrays.
+ images = [to_numpy_array(image) for image in images]
+
+ if is_scaled_image(images[0]) and do_rescale:
+ logger.warning_once(
+ "It looks like you are trying to rescale already rescaled images. If the input"
+ " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+ )
+
+ if input_data_format is None and not is_mask:
+ # We assume that all images have the same channel dimension format.
+ input_data_format = infer_channel_dimension_format(images[0])
+
+ if is_mask:
+ palette = self.get_palette(num_labels) if num_labels is not None else None
+ # Since this is the input for the next transformations its format should be the same as the input_data_format
+ images = [
+ self.mask_to_rgb(image=image, palette=palette, data_format=ChannelDimension.FIRST) for image in images
+ ]
+ input_data_format = ChannelDimension.FIRST
+
+ if do_resize:
+ images = [
+ self.resize(image=image, size=size_dict, resample=resample, input_data_format=input_data_format)
+ for image in images
+ ]
+
+ if do_rescale:
+ images = [
+ self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+ for image in images
+ ]
+
+ if do_normalize:
+ images = [
+ self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+ for image in images
+ ]
+
+ images = [
+ to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+ ]
+
+ return images
+
+ def preprocess(
+ self,
+ images: Optional[ImageInput] = None,
+ prompt_images: Optional[ImageInput] = None,
+ prompt_masks: Optional[ImageInput] = None,
+ do_resize: Optional[bool] = None,
+ size: Dict[str, int] = None,
+ resample: PILImageResampling = None,
+ do_rescale: Optional[bool] = None,
+ rescale_factor: Optional[float] = None,
+ do_normalize: Optional[bool] = None,
+ image_mean: Optional[Union[float, List[float]]] = None,
+ image_std: Optional[Union[float, List[float]]] = None,
+ num_labels: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
+ **kwargs,
+ ):
+ """
+ Preprocess an image or batch of images.
+
+ Args:
+ images (`ImageInput`):
+ Image to _preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
+ prompt_images (`ImageInput`):
+ Prompt image to _preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
+ prompt_masks (`ImageInput`):
+ Prompt mask from prompt image to _preprocess. Expects a single or batch of masks. If the mask masks are
+ a single channel then it will be converted to RGB using the palette if `self.num_labels` is specified
+ or by just repeating the channel if not. If the mask is already in RGB format, it will be passed through.
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+ Whether to resize the image.
+ size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+ Dictionary in the format `{"height": h, "width": w}` specifying the size of the output image after
+ resizing.
+ resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
+ `PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BICUBIC`. Only has
+ an effect if `do_resize` is set to `True`. Doesn't apply to prompt mask as it is resized using nearest.
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+ Whether to rescale the image values between [0 - 1].
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+ Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+ Whether to normalize the image.
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+ Image mean to use if `do_normalize` is set to `True`.
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+ Image standard deviation to use if `do_normalize` is set to `True`.
+ return_tensors (`str` or `TensorType`, *optional*):
+ The type of tensors to return. Can be one of:
+ - Unset: Return a list of `np.ndarray`.
+ - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+ - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+ - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+ - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+ data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+ The channel dimension format for the output image. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+ - Unset: Use the channel dimension format of the input image.
+ input_data_format (`ChannelDimension` or `str`, *optional*):
+ The channel dimension format for the input image. If unset, the channel dimension format is inferred
+ from the input image. Can be one of:
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+ num_labels: (`int`, *optional*):
+ Number of classes in the segmentation task (excluding the background). If specified, a palette will be
+ built, assuming that class_idx 0 is the background, to map the prompt mask from a single class_idx
+ channel to a 3 channel RGB. Not specifying this will result in the prompt mask either being passed
+ through as is if it is already in RGB format or being duplicated across the channel dimension.
+ """
+ if all(v is None for v in [images, prompt_images, prompt_masks]):
+ raise ValueError("At least one of images, prompt_images, prompt_masks must be specified.")
+
+ data = {}
+
+ if images is not None:
+ images = self._preprocess_step(
+ images,
+ is_mask=False,
+ do_resize=do_resize,
+ size=size,
+ resample=resample,
+ do_rescale=do_rescale,
+ rescale_factor=rescale_factor,
+ do_normalize=do_normalize,
+ image_mean=image_mean,
+ image_std=image_std,
+ data_format=data_format,
+ input_data_format=input_data_format,
+ **kwargs,
+ )
+
+ data["pixel_values"] = images
+
+ if prompt_images is not None:
+ prompt_images = self._preprocess_step(
+ prompt_images,
+ is_mask=False,
+ do_resize=do_resize,
+ size=size,
+ resample=resample,
+ do_rescale=do_rescale,
+ rescale_factor=rescale_factor,
+ do_normalize=do_normalize,
+ image_mean=image_mean,
+ image_std=image_std,
+ data_format=data_format,
+ input_data_format=input_data_format,
+ **kwargs,
+ )
+
+ data["prompt_pixel_values"] = prompt_images
+
+ if prompt_masks is not None:
+ prompt_masks = self._preprocess_step(
+ prompt_masks,
+ is_mask=True,
+ do_resize=do_resize,
+ size=size,
+ resample=PILImageResampling.NEAREST,
+ do_rescale=do_rescale,
+ rescale_factor=rescale_factor,
+ do_normalize=do_normalize,
+ image_mean=image_mean,
+ image_std=image_std,
+ data_format=data_format,
+ input_data_format=input_data_format,
+ num_labels=num_labels,
+ **kwargs,
+ )
+
+ data["prompt_masks"] = prompt_masks
+
+ return BatchFeature(data=data, tensor_type=return_tensors)
+
+ def post_process_semantic_segmentation(
+ self, outputs, target_sizes: Optional[List[Tuple[int, int]]] = None, num_labels: Optional[int] = None
+ ):
+ """
+ Converts the output of [`SegGptImageSegmentationOutput`] into segmentation maps. Only supports
+ PyTorch.
+
+ Args:
+ outputs ([`SegGptImageSegmentationOutput`]):
+ Raw outputs of the model.
+ target_sizes (`List[Tuple[int, int]]`, *optional*):
+ List of length (batch_size), where each list item (`Tuple[int, int]`) corresponds to the requested
+ final size (height, width) of each prediction. If left to None, predictions will not be resized.
+ num_labels (`int`, *optional*):
+ Number of classes in the segmentation task (excluding the background). If specified, a palette will be
+ built, assuming that class_idx 0 is the background, to map prediction masks from RGB values to class
+ indices. This value should be the same used when preprocessing inputs.
+ Returns:
+ semantic_segmentation: `List[torch.Tensor]` of length `batch_size`, where each item is a semantic
+ segmentation map of shape (height, width) corresponding to the target_sizes entry (if `target_sizes` is
+ specified). Each entry of each `torch.Tensor` correspond to a semantic class id.
+ """
+ requires_backends(self, ["torch"])
+ # batch_size x num_channels x 2*height x width
+ masks = outputs.pred_masks
+
+ # Predicted mask and prompt are concatenated in the height dimension
+ # batch_size x num_channels x height x width
+ masks = masks[:, :, masks.shape[2] // 2 :, :]
+
+ # To unnormalize we need to permute to channel last
+ # batch_size x height x width x num_channels
+ std = torch.tensor(self.image_std).to(masks.device)
+ mean = torch.tensor(self.image_mean).to(masks.device)
+
+ masks = masks.permute(0, 2, 3, 1) * std + mean
+
+ # batch_size x num_channels x height x width
+ masks = masks.permute(0, 3, 1, 2)
+
+ # Clip to match with palette if specified
+ masks = torch.clip(masks * 255, 0, 255)
+
+ semantic_segmentation = []
+ palette_tensor = None
+ palette = self.get_palette(num_labels) if num_labels is not None else None
+ if palette is not None:
+ palette_tensor = torch.tensor(palette).float().to(masks.device)
+ _, num_channels, _, _ = masks.shape
+ palette_tensor = palette_tensor.view(1, 1, num_labels + 1, num_channels)
+
+ for idx, mask in enumerate(masks):
+ if target_sizes is not None:
+ mask = torch.nn.functional.interpolate(
+ mask.unsqueeze(0),
+ size=target_sizes[idx],
+ mode="nearest",
+ )[0]
+
+ if num_labels is not None:
+ channels, height, width = mask.shape
+ dist = mask.permute(1, 2, 0).view(height, width, 1, channels)
+ dist = dist - palette_tensor
+ dist = torch.pow(dist, 2)
+ dist = torch.sum(dist, dim=-1)
+ pred = dist.argmin(dim=-1)
+
+ else:
+ # If no palette is specified SegGpt will try to paint using the mask class idx as RGB
+ pred = mask.mean(dim=0).int()
+
+ semantic_segmentation.append(pred)
+
+ return semantic_segmentation
diff --git a/src/transformers/models/seggpt/modeling_seggpt.py b/src/transformers/models/seggpt/modeling_seggpt.py
new file mode 100644
index 00000000000000..87175fdf38ce6e
--- /dev/null
+++ b/src/transformers/models/seggpt/modeling_seggpt.py
@@ -0,0 +1,1014 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch SegGpt model."""
+
+
+import collections.abc
+import math
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import functional as F
+
+from ...activations import ACT2FN
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+ ModelOutput,
+ add_start_docstrings,
+ add_start_docstrings_to_model_forward,
+ logging,
+ replace_return_docstrings,
+)
+from .configuration_seggpt import SegGptConfig
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "SegGptConfig"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "BAAI/seggpt-vit-large"
+_EXPECTED_OUTPUT_SHAPE = [3, 896, 448]
+
+
+SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+ "BAAI/seggpt-vit-large",
+ # See all SegGpt models at https://huggingface.co/models?filter=seggpt
+]
+
+
+@dataclass
+class SegGptEncoderOutput(ModelOutput):
+ """
+ Output type of [`SegGptEncoderOutput`].
+ Args:
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, patch_height, patch_width, hidden_size)`):
+ Sequence of hidden-states at the output of the last layer of the model.
+ hidden_states (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+ of shape `(batch_size, patch_height, patch_width, hidden_size)`.
+ attentions (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.output_attentions=True`):
+ Tuple of *torch.FloatTensor* (one for each layer) of shape
+ `(batch_size, num_heads, seq_len, seq_len)`.
+ intermediate_hidden_states (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.intermediate_hidden_state_indices` is set):
+ Tuple of `torch.FloatTensor` of shape `(batch_size, patch_height, patch_width, hidden_size)`.
+ Each element in the Tuple corresponds to the output of the layer specified in `config.intermediate_hidden_state_indices`.
+ Additionaly, each feature passes through a LayerNorm.
+ """
+
+ last_hidden_state: torch.FloatTensor
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
+ intermediate_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class SegGptImageSegmentationOutput(ModelOutput):
+ """
+ Output type of [`SegGptImageSegmentationOutput`].
+
+ Args:
+ loss (`torch.FloatTensor`, `optional`, returned when `labels` is provided):
+ The loss value.
+ pred_masks (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ The predicted masks.
+ hidden_states (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+ of shape `(batch_size, patch_height, patch_width, hidden_size)`.
+ attentions (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.output_attentions=True`):
+ Tuple of `torch.FloatTensor` (one for each layer) of shape
+ `(batch_size, num_heads, seq_len, seq_len)`.
+ """
+
+ loss: Optional[torch.FloatTensor] = None
+ pred_masks: Optional[torch.FloatTensor] = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+# Copied from transformers.models.sam.modeling_sam.SamPatchEmbeddings with Sam->SegGpt
+class SegGptPatchEmbeddings(nn.Module):
+ """
+ This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial
+ `hidden_states` (patch embeddings) of shape `(batch_size, seq_length, hidden_size)` to be consumed by a
+ Transformer.
+ """
+
+ def __init__(self, config):
+ super().__init__()
+ image_size, patch_size = config.image_size, config.patch_size
+ num_channels, hidden_size = config.num_channels, config.hidden_size
+ image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size)
+ patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
+ num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+ self.image_size = image_size
+ self.patch_size = patch_size
+ self.num_channels = num_channels
+ self.num_patches = num_patches
+
+ self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size)
+
+ def forward(self, pixel_values):
+ batch_size, num_channels, height, width = pixel_values.shape
+ if num_channels != self.num_channels:
+ raise ValueError(
+ "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
+ )
+ if height != self.image_size[0] or width != self.image_size[1]:
+ raise ValueError(
+ f"Input image size ({height}*{width}) doesn't match model ({self.image_size[0]}*{self.image_size[1]})."
+ )
+ embeddings = self.projection(pixel_values).permute(0, 2, 3, 1)
+ return embeddings
+
+
+class SegGptEmbeddings(nn.Module):
+ """
+ Construct the embeddings from patch, position embeddings for input and prompt.
+ """
+
+ def __init__(self, config: SegGptConfig) -> None:
+ super().__init__()
+
+ self.mask_token = nn.Parameter(torch.zeros(1, 1, 1, config.hidden_size))
+ self.segment_token_input = nn.Parameter(torch.zeros(1, 1, 1, config.hidden_size))
+ self.segment_token_prompt = nn.Parameter(torch.zeros(1, 1, 1, config.hidden_size))
+ # token for seg types
+ self.type_token_semantic = nn.Parameter(torch.zeros(1, 1, 1, config.hidden_size))
+ self.type_token_instance = nn.Parameter(torch.zeros(1, 1, 1, config.hidden_size))
+
+ self.patch_embeddings = SegGptPatchEmbeddings(config)
+
+ num_positions = (config.pretrain_image_size // config.patch_size) ** 2 + 1
+ self.position_embeddings = nn.Parameter(torch.randn(1, num_positions, config.hidden_size))
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+ def interpolate_pos_encoding(self, height: int, width: int) -> torch.Tensor:
+ patch_pos_embed = self.position_embeddings[:, 1:]
+ num_patches = patch_pos_embed.shape[1]
+ pretrain_patch_size = int(math.sqrt(num_patches))
+
+ if pretrain_patch_size != height or pretrain_patch_size != width:
+ patch_pos_embed = F.interpolate(
+ patch_pos_embed.reshape(1, pretrain_patch_size, pretrain_patch_size, -1).permute(0, 3, 1, 2),
+ size=(height, width),
+ mode="bicubic",
+ align_corners=False,
+ )
+
+ return patch_pos_embed.permute(0, 2, 3, 1)
+ else:
+ return patch_pos_embed.reshape(1, height, width, -1)
+
+ def forward(
+ self,
+ pixel_values: torch.Tensor,
+ prompt_pixel_values: torch.Tensor,
+ bool_masked_pos: Optional[torch.BoolTensor] = None,
+ embedding_type: Optional[str] = None,
+ ) -> torch.Tensor:
+ input_embeddings = self.patch_embeddings(pixel_values)
+ prompt_embeddings = self.patch_embeddings(prompt_pixel_values)
+
+ batch_size, patch_height, patch_width, _ = input_embeddings.shape
+
+ mask_token = self.mask_token.expand(batch_size, patch_height, patch_width, -1)
+ # replace the masked visual tokens by mask_token
+ w = bool_masked_pos.unsqueeze(-1).type_as(mask_token).reshape(-1, patch_height, patch_width, 1)
+ prompt_embeddings = prompt_embeddings * (1 - w) + mask_token * w
+
+ embedding_type = embedding_type if embedding_type is not None else "instance"
+
+ # add positional encoding to each token
+ pos_embed = self.interpolate_pos_encoding(patch_height, patch_width)
+
+ # add segment token
+ input_embeddings = input_embeddings + self.segment_token_input
+ prompt_embeddings = prompt_embeddings + self.segment_token_prompt
+
+ # add position embedding skipping CLS
+ input_embeddings = input_embeddings + pos_embed
+ prompt_embeddings = prompt_embeddings + pos_embed
+
+ # add type embedding to each token
+ if embedding_type == "semantic":
+ type_embedding = self.type_token_semantic
+ elif embedding_type == "instance":
+ type_embedding = self.type_token_instance
+ else:
+ raise ValueError(f"Embedding type should be either 'semantic' or 'instance', but got {embedding_type}")
+
+ input_embeddings = input_embeddings + type_embedding
+ prompt_embeddings = prompt_embeddings + type_embedding
+
+ embeddings = torch.cat((input_embeddings, prompt_embeddings), dim=0)
+
+ return embeddings
+
+
+class SegGptAttention(nn.Module):
+ """Multi-head Attention block with relative position embeddings."""
+
+ def __init__(self, config):
+ super().__init__()
+ image_size, patch_size = config.image_size, config.patch_size
+ image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size)
+ patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
+
+ input_size = (image_size[0] // config.patch_size, image_size[1] // config.patch_size)
+ head_dim = config.hidden_size // config.num_attention_heads
+
+ self.num_attention_heads = config.num_attention_heads
+ self.scale = head_dim**-0.5
+
+ self.qkv = nn.Linear(config.hidden_size, config.hidden_size * 3, bias=config.qkv_bias)
+ self.proj = nn.Linear(config.hidden_size, config.hidden_size)
+
+ self.use_relative_position_embeddings = config.use_relative_position_embeddings
+ if self.use_relative_position_embeddings:
+ if input_size is None:
+ raise ValueError("Input size must be provided if using relative positional encoding.")
+
+ # initialize relative positional embeddings
+ self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))
+ self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))
+
+ def get_rel_pos(self, q_size: int, k_size: int, rel_pos: torch.Tensor) -> torch.Tensor:
+ """
+ Get relative positional embeddings according to the relative positions of
+ query and key sizes.
+
+ Args:
+ q_size (int):
+ size of the query.
+ k_size (int):
+ size of key k.
+ rel_pos (`torch.Tensor`):
+ relative position embeddings (L, channel).
+
+ Returns:
+ Extracted positional embeddings according to relative positions.
+ """
+ max_rel_dist = int(2 * max(q_size, k_size) - 1)
+ # Interpolate rel pos.
+ rel_pos_resized = F.interpolate(
+ rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
+ size=max_rel_dist,
+ mode="linear",
+ )
+ rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)
+
+ # Scale the coords with short length if shapes for q and k are different.
+ q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
+ k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
+ relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)
+
+ return rel_pos_resized[relative_coords.long()]
+
+ def add_decomposed_rel_pos(
+ self,
+ attn: torch.Tensor,
+ query: torch.Tensor,
+ rel_pos_h: torch.Tensor,
+ rel_pos_w: torch.Tensor,
+ q_size: Tuple[int, int],
+ k_size: Tuple[int, int],
+ ) -> torch.Tensor:
+ """
+ Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.
+ https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py
+
+ Args:
+ attn (`torch.Tensor`):
+ attention map.
+ query (`torch.Tensor`):
+ query q in the attention layer with shape (batch_size, query_height * query_width, channel).
+ rel_pos_h (`torch.Tensor`):
+ relative position embeddings (Lh, channel) for height axis.
+ rel_pos_w (`torch.Tensor`):
+ relative position embeddings (Lw, channel) for width axis.
+ q_size (tuple):
+ spatial sequence size of query q with (query_height, query_width).
+ k_size (tuple):
+ spatial sequence size of key k with (key_height, key_width).
+
+ Returns:
+ attn (`torch.Tensor`):
+ attention map with added relative positional embeddings.
+ """
+ query_height, query_width = q_size
+ key_height, key_width = k_size
+ relative_position_height = self.get_rel_pos(query_height, key_height, rel_pos_h)
+ relative_position_width = self.get_rel_pos(query_width, key_width, rel_pos_w)
+
+ batch_size, _, dim = query.shape
+ reshaped_query = query.reshape(batch_size, query_height, query_width, dim)
+ rel_h = torch.einsum("bhwc,hkc->bhwk", reshaped_query, relative_position_height)
+ rel_w = torch.einsum("bhwc,wkc->bhwk", reshaped_query, relative_position_width)
+ attn = attn.reshape(batch_size, query_height, query_width, key_height, key_width)
+ attn = attn + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]
+ attn = attn.reshape(batch_size, query_height * query_width, key_height * key_width)
+ return attn
+
+ def forward(self, hidden_states: torch.Tensor, output_attentions=False) -> torch.Tensor:
+ batch_size, height, width, _ = hidden_states.shape
+ # qkv with shape (3, batch_size, nHead, height * width, channel)
+ qkv = (
+ self.qkv(hidden_states)
+ .reshape(batch_size, height * width, 3, self.num_attention_heads, -1)
+ .permute(2, 0, 3, 1, 4)
+ )
+ # q, k, v with shape (batch_size * nHead, height * width, channel)
+ query, key, value = qkv.reshape(3, batch_size * self.num_attention_heads, height * width, -1).unbind(0)
+
+ attn_weights = (query * self.scale) @ key.transpose(-2, -1)
+
+ if self.use_relative_position_embeddings:
+ attn_weights = self.add_decomposed_rel_pos(
+ attn_weights, query, self.rel_pos_h, self.rel_pos_w, (height, width), (height, width)
+ )
+
+ attn_weights = torch.nn.functional.softmax(attn_weights, dtype=torch.float32, dim=-1).to(query.dtype)
+
+ if output_attentions:
+ # this operation is a bit awkward, but it's required to
+ # make sure that attn_weights keeps its gradient.
+ # In order to do so, attn_weights have to reshaped
+ # twice and have to be reused in the following
+ attn_weights_reshaped = attn_weights.view(batch_size, self.num_attention_heads, height * width, -1)
+ attn_weights = attn_weights_reshaped.view(batch_size * self.num_attention_heads, height * width, -1)
+ else:
+ attn_weights_reshaped = None
+
+ attn_output = (attn_weights @ value).reshape(batch_size, self.num_attention_heads, height, width, -1)
+ attn_output = attn_output.permute(0, 2, 3, 1, 4).reshape(batch_size, height, width, -1)
+
+ attn_output = self.proj(attn_output)
+
+ return (attn_output, attn_weights_reshaped)
+
+
+# Copied from transformers.models.sam.modeling_sam.SamMLPBlock with SamMLPBlock->SegGptMlp
+class SegGptMlp(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.lin1 = nn.Linear(config.hidden_size, config.mlp_dim)
+ self.lin2 = nn.Linear(config.mlp_dim, config.hidden_size)
+ self.act = ACT2FN[config.hidden_act]
+
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+ hidden_states = self.lin1(hidden_states)
+ hidden_states = self.act(hidden_states)
+ hidden_states = self.lin2(hidden_states)
+ return hidden_states
+
+
+# Copied from transformers.models.beit.modeling_beit.drop_path
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
+ """
+ Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+
+ Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
+ however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
+ See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
+ layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
+ argument.
+ """
+ if drop_prob == 0.0 or not training:
+ return input
+ keep_prob = 1 - drop_prob
+ shape = (input.shape[0],) + (1,) * (input.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
+ random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device)
+ random_tensor.floor_() # binarize
+ output = input.div(keep_prob) * random_tensor
+ return output
+
+
+# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->SegGpt
+class SegGptDropPath(nn.Module):
+ """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+
+ def __init__(self, drop_prob: Optional[float] = None) -> None:
+ super().__init__()
+ self.drop_prob = drop_prob
+
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+ return drop_path(hidden_states, self.drop_prob, self.training)
+
+ def extra_repr(self) -> str:
+ return "p={}".format(self.drop_prob)
+
+
+class SegGptLayer(nn.Module):
+ def __init__(self, config: SegGptConfig, drop_path_rate: float) -> None:
+ super().__init__()
+ self.attention = SegGptAttention(config)
+ self.mlp = SegGptMlp(config)
+ self.drop_path = SegGptDropPath(drop_path_rate) if drop_path_rate > 0.0 else nn.Identity()
+ self.layernorm_before = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+ self.layernorm_after = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ ensemble_cond: int,
+ feature_ensemble: bool = False,
+ output_attentions: bool = False,
+ ) -> Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor]]:
+ self_attention_outputs = self.attention(
+ self.layernorm_before(hidden_states), # in SegGpt, layernorm is applied before self-attention
+ output_attentions=output_attentions,
+ )
+ attention_output = self_attention_outputs[0]
+ outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
+
+ if feature_ensemble and attention_output.shape[0] // 2 >= ensemble_cond:
+ prompt, inputs = attention_output.split(attention_output.shape[1] // 2, dim=1)
+ if ensemble_cond == 2:
+ num_prompts = attention_output.shape[0] // 2
+ inputs = inputs.reshape(2, num_prompts, -1)
+ inputs = inputs.mean(dim=1, keepdim=True).expand_as(inputs)
+ inputs = inputs.reshape(*prompt.shape)
+ else:
+ inputs = inputs.mean(dim=0, keepdim=True).expand_as(inputs)
+ attention_output = torch.cat([prompt, inputs], dim=1)
+
+ # first residual connection
+ hidden_states = self.drop_path(attention_output) + hidden_states
+ residual = hidden_states
+
+ hidden_states = self.layernorm_after(hidden_states)
+ hidden_states = self.mlp(hidden_states)
+ hidden_states = residual + self.drop_path(hidden_states)
+
+ outputs = (hidden_states,) + outputs
+
+ return outputs
+
+
+class SegGptEncoder(nn.Module):
+ def __init__(self, config: SegGptConfig) -> None:
+ super().__init__()
+ self.config = config
+ dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
+ self.layers = nn.ModuleList([SegGptLayer(config, dpr[i]) for i in range(config.num_hidden_layers)])
+ self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+ self.gradient_checkpointing = False
+
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ feature_ensemble: bool = False,
+ output_attentions: bool = False,
+ output_hidden_states: bool = False,
+ return_dict: bool = True,
+ ) -> Union[tuple, SegGptEncoderOutput]:
+ all_hidden_states = () if output_hidden_states else None
+ all_self_attentions = () if output_attentions else None
+ intermediate_hidden_states = []
+
+ for i, layer_module in enumerate(self.layers):
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (hidden_states,)
+
+ # Condition to check if we have the appropriate number of prompts to ensemble
+ ensemble_cond = 2 if self.config.merge_index > i else 1
+
+ if self.gradient_checkpointing and self.training:
+ layer_outputs = self._gradient_checkpointing_func(
+ layer_module.__call__,
+ hidden_states,
+ ensemble_cond,
+ feature_ensemble,
+ output_attentions,
+ )
+ else:
+ layer_outputs = layer_module(hidden_states, ensemble_cond, feature_ensemble, output_attentions)
+
+ hidden_states = layer_outputs[0]
+
+ if i == self.config.merge_index:
+ hidden_states = (
+ hidden_states[: hidden_states.shape[0] // 2] + hidden_states[hidden_states.shape[0] // 2 :]
+ ) * 0.5
+
+ if i in self.config.intermediate_hidden_state_indices:
+ intermediate_hidden_states.append(self.layernorm(hidden_states))
+
+ if output_attentions:
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (hidden_states,)
+
+ if not return_dict:
+ return tuple(
+ v
+ for v in [hidden_states, all_hidden_states, all_self_attentions, intermediate_hidden_states]
+ if v is not None
+ )
+ return SegGptEncoderOutput(
+ last_hidden_state=hidden_states,
+ hidden_states=all_hidden_states,
+ attentions=all_self_attentions,
+ intermediate_hidden_states=intermediate_hidden_states,
+ )
+
+
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextLayerNorm with ConvNext->SegGpt
+class SegGptLayerNorm(nn.Module):
+ r"""LayerNorm that supports two data formats: channels_last (default) or channels_first.
+ The ordering of the dimensions in the inputs. channels_last corresponds to inputs with shape (batch_size, height,
+ width, channels) while channels_first corresponds to inputs with shape (batch_size, channels, height, width).
+ """
+
+ def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
+ super().__init__()
+ self.weight = nn.Parameter(torch.ones(normalized_shape))
+ self.bias = nn.Parameter(torch.zeros(normalized_shape))
+ self.eps = eps
+ self.data_format = data_format
+ if self.data_format not in ["channels_last", "channels_first"]:
+ raise NotImplementedError(f"Unsupported data format: {self.data_format}")
+ self.normalized_shape = (normalized_shape,)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ if self.data_format == "channels_last":
+ x = torch.nn.functional.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
+ elif self.data_format == "channels_first":
+ input_dtype = x.dtype
+ x = x.float()
+ u = x.mean(1, keepdim=True)
+ s = (x - u).pow(2).mean(1, keepdim=True)
+ x = (x - u) / torch.sqrt(s + self.eps)
+ x = x.to(dtype=input_dtype)
+ x = self.weight[:, None, None] * x + self.bias[:, None, None]
+ return x
+
+
+class SegGptDecoderHead(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.conv = nn.Conv2d(
+ config.decoder_hidden_size,
+ config.decoder_hidden_size,
+ kernel_size=3,
+ padding=1,
+ )
+ self.layernorm = SegGptLayerNorm(
+ normalized_shape=config.decoder_hidden_size, eps=config.layer_norm_eps, data_format="channels_first"
+ )
+ self.act_fct = ACT2FN[config.hidden_act]
+ self.head = nn.Conv2d(config.decoder_hidden_size, 3, kernel_size=1, bias=True) # decoder to patch
+
+ def forward(self, hidden_states: torch.FloatTensor):
+ hidden_states = self.conv(hidden_states)
+ hidden_states = self.layernorm(hidden_states)
+ hidden_states = self.act_fct(hidden_states)
+ hidden_states = self.head(hidden_states)
+
+ return hidden_states
+
+
+class SegGptDecoder(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.decoder_embed = nn.Linear(
+ config.hidden_size * len(config.intermediate_hidden_state_indices),
+ config.patch_size**2 * config.decoder_hidden_size,
+ bias=True,
+ )
+ self.decoder_pred = SegGptDecoderHead(config)
+ self.patch_size = config.patch_size
+ self.decoder_hidden_size = config.decoder_hidden_size
+ self.config = config
+
+ def _reshape_hidden_states(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+ batch_size, patch_height, patch_width, _ = hidden_states.shape
+ hidden_states = hidden_states.reshape(
+ batch_size, patch_height, patch_width, self.patch_size, self.patch_size, self.decoder_hidden_size
+ )
+ hidden_states = hidden_states.permute(0, 5, 1, 3, 2, 4)
+ hidden_states = hidden_states.reshape(
+ shape=(batch_size, -1, patch_height * self.patch_size, patch_width * self.patch_size)
+ )
+
+ return hidden_states
+
+ def forward(self, hidden_states: torch.FloatTensor):
+ hidden_states = self.decoder_embed(hidden_states)
+ hidden_states = self._reshape_hidden_states(hidden_states)
+ hidden_states = self.decoder_pred(hidden_states)
+
+ return hidden_states
+
+
+class SegGptPreTrainedModel(PreTrainedModel):
+ """
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+ models.
+ """
+
+ config_class = SegGptConfig
+ base_model_prefix = "model"
+ main_input_name = "pixel_values"
+ supports_gradient_checkpointing = True
+ _no_split_modules = ["SegGptEmbeddings", "SegGptLayer"]
+
+ def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None:
+ """Initialize the weights"""
+ std = self.config.initializer_range
+ if isinstance(module, (nn.Linear, nn.Conv2d)):
+ # Upcast the input in `fp32` and cast it back to desired `dtype` to avoid
+ # `trunc_normal_cpu` not implemented in `half` issues
+ module.weight.data = nn.init.trunc_normal_(module.weight.data.to(torch.float32), mean=0.0, std=std).to(
+ module.weight.dtype
+ )
+ if module.bias is not None:
+ module.bias.data.zero_()
+ elif isinstance(module, nn.LayerNorm):
+ module.bias.data.zero_()
+ module.weight.data.fill_(1.0)
+ elif isinstance(module, SegGptAttention):
+ module.rel_pos_h.data = nn.init.trunc_normal_(
+ module.rel_pos_h.data.to(torch.float32),
+ mean=0.0,
+ std=std,
+ ).to(module.rel_pos_h.dtype)
+
+ module.rel_pos_w.data = nn.init.trunc_normal_(
+ module.rel_pos_w.data.to(torch.float32),
+ mean=0.0,
+ std=std,
+ ).to(module.rel_pos_w.dtype)
+
+ elif isinstance(module, SegGptEmbeddings):
+ module.position_embeddings.data = nn.init.trunc_normal_(
+ module.position_embeddings.data.to(torch.float32),
+ mean=0.0,
+ std=std,
+ ).to(module.position_embeddings.dtype)
+
+ torch.nn.init.normal_(module.mask_token, std=std)
+ torch.nn.init.normal_(module.segment_token_input, std=std)
+ torch.nn.init.normal_(module.segment_token_prompt, std=std)
+ torch.nn.init.normal_(module.type_token_semantic, std=std)
+ torch.nn.init.normal_(module.type_token_instance, std=std)
+
+
+SEGGPT_START_DOCSTRING = r"""
+ This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
+ as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+ behavior.
+
+ Parameters:
+ config ([`SegGptConfig`]): Model configuration class with all the parameters of the model.
+ Initializing with a config file does not load the weights associated with the model, only the
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+SEGGPT_INPUTS_DOCSTRING = r"""
+ Args:
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See [`SegGptImageProcessor.__call__`]
+ for details.
+
+ prompt_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ Prompt pixel values. Prompt pixel values can be obtained using [`AutoImageProcessor`]. See
+ [`SegGptImageProcessor.__call__`] for details.
+
+ prompt_masks (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ Prompt mask. Prompt mask can be obtained using [`AutoImageProcessor`]. See [`SegGptImageProcessor.__call__`] for
+ details.
+
+ bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, num_patches)`, *optional*):
+ Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
+
+ feature_ensemble (`bool`, *optional*):
+ Boolean indicating whether to use feature ensemble or not. If `True`, the model will use feature ensemble
+ if we have at least two prompts. If `False`, the model will not use feature ensemble. This argument should
+ be considered when doing few-shot inference on an input image i.e. more than one prompt for the same image.
+
+ embedding_type (`str`, *optional*):
+ Embedding type. Indicates whether the prompt is a semantic or instance embedding. Can be either
+ instance or semantic.
+
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+ tensors for more detail.
+ output_hidden_states (`bool`, *optional*):
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+ more detail.
+ return_dict (`bool`, *optional*):
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+ "The bare SegGpt Model transformer outputting raw hidden-states without any specific head on top.",
+ SEGGPT_START_DOCSTRING,
+)
+class SegGptModel(SegGptPreTrainedModel):
+ def __init__(self, config: SegGptConfig):
+ super().__init__(config)
+ self.config = config
+
+ self.embeddings = SegGptEmbeddings(config)
+ self.encoder = SegGptEncoder(config)
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self) -> SegGptPatchEmbeddings:
+ return self.embeddings.patch_embeddings
+
+ def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None:
+ """
+ Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+ class PreTrainedModel
+ """
+ for layer, heads in heads_to_prune.items():
+ self.encoder.layer[layer].attention.prune_heads(heads)
+
+ @add_start_docstrings_to_model_forward(SEGGPT_INPUTS_DOCSTRING)
+ @replace_return_docstrings(output_type=SegGptEncoderOutput, config_class=_CONFIG_FOR_DOC)
+ def forward(
+ self,
+ pixel_values: torch.Tensor,
+ prompt_pixel_values: torch.Tensor,
+ prompt_masks: torch.Tensor,
+ bool_masked_pos: Optional[torch.BoolTensor] = None,
+ feature_ensemble: Optional[bool] = None,
+ embedding_type: Optional[str] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ ) -> Union[Tuple, SegGptEncoderOutput]:
+ r"""
+ Returns:
+
+ Examples:
+
+ ```python
+ >>> from transformers import SegGptImageProcessor, SegGptModel
+ >>> from PIL import Image
+ >>> import requests
+
+ >>> image_input_url = "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_2.jpg"
+ >>> image_prompt_url = "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_1.jpg"
+ >>> mask_prompt_url = "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_1_target.png"
+
+ >>> image_input = Image.open(requests.get(image_input_url, stream=True).raw)
+ >>> image_prompt = Image.open(requests.get(image_prompt_url, stream=True).raw)
+ >>> mask_prompt = Image.open(requests.get(mask_prompt_url, stream=True).raw).convert("L")
+
+ >>> checkpoint = "BAAI/seggpt-vit-large"
+ >>> model = SegGptModel.from_pretrained(checkpoint)
+ >>> image_processor = SegGptImageProcessor.from_pretrained(checkpoint)
+
+ >>> inputs = image_processor(images=image_input, prompt_images=image_prompt, prompt_masks=mask_prompt, return_tensors="pt")
+
+ >>> outputs = model(**inputs)
+ >>> list(outputs.last_hidden_state.shape)
+ [1, 56, 28, 1024]
+ ```
+ """
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+ feature_ensemble = feature_ensemble if feature_ensemble is not None else False
+
+ expected_dtype = self.embeddings.patch_embeddings.projection.weight.dtype
+ pixel_values = pixel_values.to(expected_dtype)
+ prompt_pixel_values = prompt_pixel_values.to(expected_dtype)
+
+ # Prepare inputs
+ pixel_values = torch.cat((prompt_pixel_values, pixel_values), dim=2)
+ prompt_pixel_values = torch.cat((prompt_masks, prompt_masks), dim=2)
+
+ # We concat on height axis so SegGPT can handle as a single image, hence we need to mask the portion
+ # of the prompt pixels that will be destinated to the prediction as they don't add any information.
+ if bool_masked_pos is None:
+ num_patches = self.embeddings.patch_embeddings.num_patches
+ bool_masked_pos = torch.zeros(num_patches, dtype=torch.bool).to(pixel_values.device)
+ bool_masked_pos[num_patches // 2 :] = 1
+ bool_masked_pos = bool_masked_pos.unsqueeze(0)
+
+ embedding_output = self.embeddings(
+ pixel_values, prompt_pixel_values, embedding_type=embedding_type, bool_masked_pos=bool_masked_pos
+ )
+
+ encoder_outputs = self.encoder(
+ embedding_output,
+ feature_ensemble=feature_ensemble,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ return encoder_outputs
+
+
+def patchify(tensor: torch.Tensor, patch_size: int) -> torch.Tensor:
+ batch_size, num_channels, height, width = tensor.shape
+ patch_height = height // patch_size
+ patch_width = width // patch_size
+
+ tensor = tensor.reshape(shape=(batch_size, num_channels, patch_height, patch_size, patch_width, patch_size))
+ tensor = tensor.permute(0, 2, 4, 3, 5, 1)
+ tensor = tensor.reshape(shape=(batch_size, patch_height * patch_width, patch_size**2 * 3))
+
+ return tensor
+
+
+def unpatchify(tensor: torch.Tensor, patch_height: int, patch_width: int) -> torch.Tensor:
+ batch_size = tensor.shape[0]
+ patch_size = int((tensor.shape[-1] / 3) ** 0.5)
+ if patch_height * patch_width != tensor.shape[1]:
+ raise ValueError(f"Number of patches {tensor.shape[1]} does not match patch height and width.")
+
+ tensor = tensor.reshape(shape=(batch_size, patch_height, patch_width, patch_size, patch_size, 3))
+ tensor = tensor.permute(0, 5, 1, 3, 2, 4)
+ tensor = tensor.reshape(shape=(batch_size, 3, patch_height * patch_size, patch_width * patch_size))
+
+ return tensor
+
+
+class SegGptLoss(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.beta = config.beta
+ self.patch_size = config.patch_size
+
+ def forward(
+ self,
+ pixel_values: torch.FloatTensor,
+ prompt_pixel_values: torch.FloatTensor,
+ pred_masks: torch.FloatTensor,
+ labels: torch.FloatTensor,
+ bool_masked_pos: torch.BoolTensor,
+ ):
+ """Computes the L1 loss between the predicted masks and the ground truth masks.
+
+ Args:
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, 2*height, width)`):
+ Concatenated pixel values from prompt and input images.
+
+ prompt_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, 2*height, width)`):
+ Concatenated pixel values from mask prompt.
+
+ pred_masks (`torch.FloatTensor` of shape `(batch_size, num_channels, 2*height, width)`):
+ Predicted masks.
+
+ labels (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ Ground truth mask for input images.
+
+ bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, num_patches)`):
+ Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
+
+ Returns:
+ `torch.FloatTensor`: The mean L1 loss between the predicted masks and the ground truth masks.
+ """
+ mask = bool_masked_pos[:, :, None].repeat(1, 1, self.patch_size**2 * 3)
+ mask = unpatchify(mask, pixel_values.shape[1] // self.patch_size, pixel_values.shape[2] // self.patch_size)
+ # Changing dummy mask in prompt_pixel_values to labels values
+ prompt_pixel_values = prompt_pixel_values.clone()
+ prompt_pixel_values[:, :, prompt_pixel_values.shape[2] // 2 :, :] = labels
+ loss = F.smooth_l1_loss(pred_masks, prompt_pixel_values, reduction="none", beta=self.beta)
+ loss = (loss * mask).sum() / mask.sum() # mean loss on removed patches
+
+ return loss
+
+
+@add_start_docstrings(
+ "SegGpt model with a decoder on top for one-shot image segmentation.",
+ SEGGPT_START_DOCSTRING,
+)
+class SegGptForImageSegmentation(SegGptPreTrainedModel):
+ def __init__(self, config: SegGptConfig):
+ super().__init__(config)
+ self.config = config
+
+ self.model = SegGptModel(config)
+ self.decoder = SegGptDecoder(config)
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ @add_start_docstrings_to_model_forward(SEGGPT_INPUTS_DOCSTRING)
+ @replace_return_docstrings(output_type=SegGptImageSegmentationOutput, config_class=_CONFIG_FOR_DOC)
+ def forward(
+ self,
+ pixel_values: torch.Tensor,
+ prompt_pixel_values: torch.Tensor,
+ prompt_masks: torch.Tensor,
+ bool_masked_pos: Optional[torch.BoolTensor] = None,
+ feature_ensemble: Optional[bool] = None,
+ embedding_type: Optional[str] = None,
+ labels: Optional[torch.FloatTensor] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ ) -> Union[Tuple, SegGptImageSegmentationOutput]:
+ r"""
+ labels (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`, `optional`):
+ Ground truth mask for input images.
+
+ Returns:
+
+ Examples:
+
+ ```python
+ >>> from transformers import SegGptImageProcessor, SegGptForImageSegmentation
+ >>> from PIL import Image
+ >>> import requests
+
+ >>> image_input_url = "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_2.jpg"
+ >>> image_prompt_url = "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_1.jpg"
+ >>> mask_prompt_url = "https://raw.githubusercontent.com/baaivision/Painter/main/SegGPT/SegGPT_inference/examples/hmbb_1_target.png"
+
+ >>> image_input = Image.open(requests.get(image_input_url, stream=True).raw)
+ >>> image_prompt = Image.open(requests.get(image_prompt_url, stream=True).raw)
+ >>> mask_prompt = Image.open(requests.get(mask_prompt_url, stream=True).raw).convert("L")
+
+ >>> checkpoint = "BAAI/seggpt-vit-large"
+ >>> model = SegGptForImageSegmentation.from_pretrained(checkpoint)
+ >>> image_processor = SegGptImageProcessor.from_pretrained(checkpoint)
+
+ >>> inputs = image_processor(images=image_input, prompt_images=image_prompt, prompt_masks=mask_prompt, return_tensors="pt")
+ >>> outputs = model(**inputs)
+ >>> result = image_processor.post_process_semantic_segmentation(outputs, target_sizes=[image_input.size[::-1]])[0]
+ >>> print(list(result.shape))
+ [170, 297]
+ ```
+ """
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ if bool_masked_pos is None:
+ num_patches = self.model.embeddings.patch_embeddings.num_patches
+ bool_masked_pos = torch.zeros(num_patches, dtype=torch.bool).to(pixel_values.device)
+ bool_masked_pos[num_patches // 2 :] = 1
+ bool_masked_pos = bool_masked_pos.unsqueeze(0)
+
+ outputs = self.model(
+ pixel_values=pixel_values,
+ prompt_pixel_values=prompt_pixel_values,
+ prompt_masks=prompt_masks,
+ bool_masked_pos=bool_masked_pos,
+ feature_ensemble=feature_ensemble,
+ embedding_type=embedding_type,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ intermediate_hidden_states = outputs.intermediate_hidden_states if return_dict else outputs[-1]
+ intermediate_hidden_states = torch.cat(intermediate_hidden_states, dim=-1)
+ pred_masks = self.decoder(intermediate_hidden_states)
+
+ loss = None
+ if labels is not None:
+ loss_fn = SegGptLoss(self.config)
+ loss = loss_fn(pixel_values, prompt_pixel_values, pred_masks, labels, bool_masked_pos)
+
+ if not return_dict:
+ output = (pred_masks,)
+ if output_hidden_states:
+ output = output + (outputs[1],)
+
+ if output_attentions:
+ idx = 2 if output_hidden_states else 1
+ output = output + (outputs[idx],)
+
+ if loss is not None:
+ output = (loss,) + output
+ return output
+
+ return SegGptImageSegmentationOutput(
+ loss=loss,
+ pred_masks=pred_masks,
+ hidden_states=outputs.hidden_states,
+ attentions=outputs.attentions,
+ )
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index dd2e50c67d0e3f..3ba08016855cb3 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -7556,6 +7556,30 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
+SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class SegGptForImageSegmentation(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class SegGptModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class SegGptPreTrainedModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
SEW_PRETRAINED_MODEL_ARCHIVE_LIST = None
diff --git a/src/transformers/utils/dummy_vision_objects.py b/src/transformers/utils/dummy_vision_objects.py
index 89366aba5081cd..25a35558fe9c63 100644
--- a/src/transformers/utils/dummy_vision_objects.py
+++ b/src/transformers/utils/dummy_vision_objects.py
@@ -471,6 +471,13 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
+class SegGptImageProcessor(metaclass=DummyObject):
+ _backends = ["vision"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["vision"])
+
+
class SiglipImageProcessor(metaclass=DummyObject):
_backends = ["vision"]
diff --git a/tests/models/seggpt/__init__.py b/tests/models/seggpt/__init__.py
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/tests/models/seggpt/test_image_processing_seggpt.py b/tests/models/seggpt/test_image_processing_seggpt.py
new file mode 100644
index 00000000000000..46694d6636ea05
--- /dev/null
+++ b/tests/models/seggpt/test_image_processing_seggpt.py
@@ -0,0 +1,231 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+from datasets import load_dataset
+
+from transformers.testing_utils import require_torch, require_vision, slow
+from transformers.utils import is_torch_available, is_vision_available
+
+from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
+
+
+if is_torch_available():
+ import torch
+
+ from transformers.models.seggpt.modeling_seggpt import SegGptImageSegmentationOutput
+
+if is_vision_available():
+ from transformers import SegGptImageProcessor
+
+
+class SegGptImageProcessingTester(unittest.TestCase):
+ def __init__(
+ self,
+ parent,
+ batch_size=7,
+ num_channels=3,
+ image_size=18,
+ min_resolution=30,
+ max_resolution=400,
+ do_resize=True,
+ size=None,
+ do_normalize=True,
+ image_mean=[0.5, 0.5, 0.5],
+ image_std=[0.5, 0.5, 0.5],
+ ):
+ size = size if size is not None else {"height": 18, "width": 18}
+ self.parent = parent
+ self.batch_size = batch_size
+ self.num_channels = num_channels
+ self.image_size = image_size
+ self.min_resolution = min_resolution
+ self.max_resolution = max_resolution
+ self.do_resize = do_resize
+ self.size = size
+ self.do_normalize = do_normalize
+ self.image_mean = image_mean
+ self.image_std = image_std
+
+ def prepare_image_processor_dict(self):
+ return {
+ "image_mean": self.image_mean,
+ "image_std": self.image_std,
+ "do_normalize": self.do_normalize,
+ "do_resize": self.do_resize,
+ "size": self.size,
+ }
+
+ def expected_output_image_shape(self, images):
+ return self.num_channels, self.size["height"], self.size["width"]
+
+ def expected_post_processed_shape(self):
+ return self.size["height"] // 2, self.size["width"]
+
+ def get_fake_image_segmentation_output(self):
+ torch.manual_seed(42)
+ return SegGptImageSegmentationOutput(
+ pred_masks=torch.rand(self.batch_size, self.num_channels, self.size["height"], self.size["width"])
+ )
+
+ def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
+ return prepare_image_inputs(
+ batch_size=self.batch_size,
+ num_channels=self.num_channels,
+ min_resolution=self.min_resolution,
+ max_resolution=self.max_resolution,
+ equal_resolution=equal_resolution,
+ numpify=numpify,
+ torchify=torchify,
+ )
+
+
+def prepare_mask():
+ ds = load_dataset("EduardoPacheco/seggpt-example-data")["train"]
+ return ds[0]["mask"].convert("L")
+
+
+def prepare_img():
+ ds = load_dataset("EduardoPacheco/seggpt-example-data")["train"]
+ images = [image.convert("RGB") for image in ds["image"]]
+ masks = [image.convert("RGB") for image in ds["mask"]]
+ return images, masks
+
+
+@require_torch
+@require_vision
+class SegGptImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
+ image_processing_class = SegGptImageProcessor if is_vision_available() else None
+
+ def setUp(self):
+ self.image_processor_tester = SegGptImageProcessingTester(self)
+
+ @property
+ def image_processor_dict(self):
+ return self.image_processor_tester.prepare_image_processor_dict()
+
+ def test_image_processor_properties(self):
+ image_processing = self.image_processing_class(**self.image_processor_dict)
+ self.assertTrue(hasattr(image_processing, "image_mean"))
+ self.assertTrue(hasattr(image_processing, "image_std"))
+ self.assertTrue(hasattr(image_processing, "do_normalize"))
+ self.assertTrue(hasattr(image_processing, "do_resize"))
+ self.assertTrue(hasattr(image_processing, "size"))
+
+ def test_image_processor_from_dict_with_kwargs(self):
+ image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
+ self.assertEqual(image_processor.size, {"height": 18, "width": 18})
+
+ image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42)
+ self.assertEqual(image_processor.size, {"height": 42, "width": 42})
+
+ def test_image_processor_palette(self):
+ num_labels = 3
+ image_processing = self.image_processing_class(**self.image_processor_dict)
+ palette = image_processing.get_palette(num_labels)
+ self.assertEqual(len(palette), num_labels + 1)
+ self.assertEqual(palette[0], (0, 0, 0))
+
+ def test_mask_equivalence(self):
+ image_processor = SegGptImageProcessor()
+
+ mask_binary = prepare_mask()
+ mask_rgb = mask_binary.convert("RGB")
+
+ inputs_binary = image_processor(images=None, prompt_masks=mask_binary, return_tensors="pt")
+ inputs_rgb = image_processor(images=None, prompt_masks=mask_rgb, return_tensors="pt")
+
+ self.assertTrue((inputs_binary["prompt_masks"] == inputs_rgb["prompt_masks"]).all().item())
+
+ def test_mask_to_rgb(self):
+ image_processing = self.image_processing_class(**self.image_processor_dict)
+ mask = prepare_mask()
+ mask = np.array(mask)
+ mask = (mask > 0).astype(np.uint8)
+
+ def check_two_colors(image, color1=(0, 0, 0), color2=(255, 255, 255)):
+ pixels = image.transpose(1, 2, 0).reshape(-1, 3)
+ unique_colors = np.unique(pixels, axis=0)
+ if len(unique_colors) == 2 and (color1 in unique_colors) and (color2 in unique_colors):
+ return True
+ else:
+ return False
+
+ num_labels = 1
+ palette = image_processing.get_palette(num_labels)
+
+ # Should only duplicate repeat class indices map, hence only (0,0,0) and (1,1,1)
+ mask_duplicated = image_processing.mask_to_rgb(mask)
+ # Mask using palette, since only 1 class is present we have colors (0,0,0) and (255,255,255)
+ mask_painted = image_processing.mask_to_rgb(mask, palette=palette)
+
+ self.assertTrue(check_two_colors(mask_duplicated, color2=(1, 1, 1)))
+ self.assertTrue(check_two_colors(mask_painted, color2=(255, 255, 255)))
+
+ def test_post_processing_semantic_segmentation(self):
+ image_processor = self.image_processing_class(**self.image_processor_dict)
+ outputs = self.image_processor_tester.get_fake_image_segmentation_output()
+ post_processed = image_processor.post_process_semantic_segmentation(outputs)
+
+ self.assertEqual(len(post_processed), self.image_processor_tester.batch_size)
+
+ expected_semantic_map_shape = self.image_processor_tester.expected_post_processed_shape()
+ self.assertEqual(post_processed[0].shape, expected_semantic_map_shape)
+
+ @slow
+ def test_pixel_values(self):
+ images, masks = prepare_img()
+ input_image = images[1]
+ prompt_image = images[0]
+ prompt_mask = masks[0]
+
+ image_processor = SegGptImageProcessor.from_pretrained("BAAI/seggpt-vit-large")
+
+ inputs = image_processor(
+ images=input_image, prompt_images=prompt_image, prompt_masks=prompt_mask, return_tensors="pt"
+ )
+
+ # Verify pixel values
+ expected_prompt_pixel_values = torch.tensor(
+ [
+ [[-0.6965, -0.6965, -0.6965], [-0.6965, -0.6965, -0.6965], [-0.6965, -0.6965, -0.6965]],
+ [[1.6583, 1.6583, 1.6583], [1.6583, 1.6583, 1.6583], [1.6583, 1.6583, 1.6583]],
+ [[2.3088, 2.3088, 2.3088], [2.3088, 2.3088, 2.3088], [2.3088, 2.3088, 2.3088]],
+ ]
+ )
+
+ expected_pixel_values = torch.tensor(
+ [
+ [[1.6324, 1.6153, 1.5810], [1.6153, 1.5982, 1.5810], [1.5810, 1.5639, 1.5639]],
+ [[1.2731, 1.2556, 1.2206], [1.2556, 1.2381, 1.2031], [1.2206, 1.2031, 1.1681]],
+ [[1.6465, 1.6465, 1.6465], [1.6465, 1.6465, 1.6465], [1.6291, 1.6291, 1.6291]],
+ ]
+ )
+
+ expected_prompt_masks = torch.tensor(
+ [
+ [[-2.1179, -2.1179, -2.1179], [-2.1179, -2.1179, -2.1179], [-2.1179, -2.1179, -2.1179]],
+ [[-2.0357, -2.0357, -2.0357], [-2.0357, -2.0357, -2.0357], [-2.0357, -2.0357, -2.0357]],
+ [[-1.8044, -1.8044, -1.8044], [-1.8044, -1.8044, -1.8044], [-1.8044, -1.8044, -1.8044]],
+ ]
+ )
+
+ self.assertTrue(torch.allclose(inputs.pixel_values[0, :, :3, :3], expected_pixel_values, atol=1e-4))
+ self.assertTrue(
+ torch.allclose(inputs.prompt_pixel_values[0, :, :3, :3], expected_prompt_pixel_values, atol=1e-4)
+ )
+ self.assertTrue(torch.allclose(inputs.prompt_masks[0, :, :3, :3], expected_prompt_masks, atol=1e-4))
diff --git a/tests/models/seggpt/test_modeling_seggpt.py b/tests/models/seggpt/test_modeling_seggpt.py
new file mode 100644
index 00000000000000..0cb36ea534a7f0
--- /dev/null
+++ b/tests/models/seggpt/test_modeling_seggpt.py
@@ -0,0 +1,339 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the PyTorch SegGpt model. """
+
+
+import inspect
+import unittest
+
+from datasets import load_dataset
+
+from transformers import SegGptConfig
+from transformers.testing_utils import (
+ require_torch,
+ require_vision,
+ slow,
+ torch_device,
+)
+from transformers.utils import cached_property, is_torch_available, is_vision_available
+
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, floats_tensor
+from ...test_pipeline_mixin import PipelineTesterMixin
+
+
+if is_torch_available():
+ import torch
+ from torch import nn
+
+ from transformers import SegGptForImageSegmentation, SegGptModel
+ from transformers.models.seggpt.modeling_seggpt import SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST
+
+
+if is_vision_available():
+ from transformers import SegGptImageProcessor
+
+
+class SegGptModelTester:
+ def __init__(
+ self,
+ parent,
+ batch_size=2,
+ image_size=30,
+ patch_size=2,
+ num_channels=3,
+ is_training=False,
+ use_labels=True,
+ hidden_size=32,
+ num_hidden_layers=2,
+ num_attention_heads=4,
+ hidden_act="gelu",
+ hidden_dropout_prob=0.1,
+ attention_probs_dropout_prob=0.1,
+ initializer_range=0.02,
+ mlp_ratio=2.0,
+ merge_index=0,
+ intermediate_hidden_state_indices=[1],
+ pretrain_image_size=10,
+ decoder_hidden_size=10,
+ ):
+ self.parent = parent
+ self.batch_size = batch_size
+ self.image_size = image_size
+ self.patch_size = patch_size
+ self.num_channels = num_channels
+ self.is_training = is_training
+ self.use_labels = use_labels
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.hidden_act = hidden_act
+ self.hidden_dropout_prob = hidden_dropout_prob
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
+ self.initializer_range = initializer_range
+ self.mlp_ratio = mlp_ratio
+ self.merge_index = merge_index
+ self.intermediate_hidden_state_indices = intermediate_hidden_state_indices
+ self.pretrain_image_size = pretrain_image_size
+ self.decoder_hidden_size = decoder_hidden_size
+
+ # in SegGpt, the seq length equals the number of patches (we don't use the [CLS] token)
+ num_patches = (image_size // patch_size) ** 2
+ self.seq_length = num_patches
+
+ def prepare_config_and_inputs(self):
+ pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size // 2, self.image_size])
+ prompt_pixel_values = floats_tensor(
+ [self.batch_size, self.num_channels, self.image_size // 2, self.image_size]
+ )
+ prompt_masks = floats_tensor([self.batch_size, self.num_channels, self.image_size // 2, self.image_size])
+
+ labels = None
+ if self.use_labels:
+ labels = floats_tensor([self.batch_size, self.num_channels, self.image_size // 2, self.image_size])
+
+ config = self.get_config()
+
+ return config, pixel_values, prompt_pixel_values, prompt_masks, labels
+
+ def get_config(self):
+ return SegGptConfig(
+ image_size=self.image_size,
+ patch_size=self.patch_size,
+ num_channels=self.num_channels,
+ hidden_size=self.hidden_size,
+ num_hidden_layers=self.num_hidden_layers,
+ num_attention_heads=self.num_attention_heads,
+ hidden_act=self.hidden_act,
+ hidden_dropout_prob=self.hidden_dropout_prob,
+ initializer_range=self.initializer_range,
+ mlp_ratio=self.mlp_ratio,
+ merge_index=self.merge_index,
+ intermediate_hidden_state_indices=self.intermediate_hidden_state_indices,
+ pretrain_image_size=self.pretrain_image_size,
+ decoder_hidden_size=self.decoder_hidden_size,
+ )
+
+ def create_and_check_model(self, config, pixel_values, prompt_pixel_values, prompt_masks, labels):
+ model = SegGptModel(config=config)
+ model.to(torch_device)
+ model.eval()
+ result = model(pixel_values, prompt_pixel_values, prompt_masks)
+ self.parent.assertEqual(
+ result.last_hidden_state.shape,
+ (
+ self.batch_size,
+ self.image_size // self.patch_size,
+ self.image_size // self.patch_size,
+ self.hidden_size,
+ ),
+ )
+
+ def prepare_config_and_inputs_for_common(self):
+ config_and_inputs = self.prepare_config_and_inputs()
+ (
+ config,
+ pixel_values,
+ prompt_pixel_values,
+ prompt_masks,
+ labels,
+ ) = config_and_inputs
+ inputs_dict = {
+ "pixel_values": pixel_values,
+ "prompt_pixel_values": prompt_pixel_values,
+ "prompt_masks": prompt_masks,
+ }
+ return config, inputs_dict
+
+
+@require_torch
+class SegGptModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ """
+ Here we also overwrite some of the tests of test_modeling_common.py, as SegGpt does not use input_ids, inputs_embeds,
+ attention_mask and seq_length.
+ """
+
+ all_model_classes = (SegGptModel, SegGptForImageSegmentation) if is_torch_available() else ()
+ fx_compatible = False
+
+ test_pruning = False
+ test_resize_embeddings = False
+ test_head_masking = False
+ test_torchscript = False
+ pipeline_model_mapping = (
+ {"feature-extraction": SegGptModel, "mask-generation": SegGptModel} if is_torch_available() else {}
+ )
+
+ def setUp(self):
+ self.model_tester = SegGptModelTester(self)
+ self.config_tester = ConfigTester(self, config_class=SegGptConfig, has_text_modality=False)
+
+ def test_config(self):
+ self.config_tester.run_common_tests()
+
+ @unittest.skip(reason="SegGpt does not use inputs_embeds")
+ def test_inputs_embeds(self):
+ pass
+
+ def test_model_common_attributes(self):
+ config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+ for model_class in self.all_model_classes:
+ model = model_class(config)
+ self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+
+ def test_forward_signature(self):
+ config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+ for model_class in self.all_model_classes:
+ model = model_class(config)
+ signature = inspect.signature(model.forward)
+ # signature.parameters is an OrderedDict => so arg_names order is deterministic
+ arg_names = [*signature.parameters.keys()]
+
+ expected_arg_names = ["pixel_values", "prompt_pixel_values", "prompt_masks"]
+ self.assertListEqual(arg_names[:3], expected_arg_names)
+
+ def test_model(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_model(*config_and_inputs)
+
+ def test_hidden_states_output(self):
+ def check_hidden_states_output(inputs_dict, config, model_class):
+ model = model_class(config)
+ model.to(torch_device)
+ model.eval()
+
+ with torch.no_grad():
+ outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+ hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
+
+ expected_num_layers = getattr(
+ self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
+ )
+ self.assertEqual(len(hidden_states), expected_num_layers)
+
+ patch_height = patch_width = config.image_size // config.patch_size
+
+ self.assertListEqual(
+ list(hidden_states[0].shape[-3:]),
+ [patch_height, patch_width, self.model_tester.hidden_size],
+ )
+
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+ for model_class in self.all_model_classes:
+ inputs_dict["output_hidden_states"] = True
+ check_hidden_states_output(inputs_dict, config, model_class)
+
+ # check that output_hidden_states also work using config
+ del inputs_dict["output_hidden_states"]
+ config.output_hidden_states = True
+
+ check_hidden_states_output(inputs_dict, config, model_class)
+
+ @slow
+ def test_model_from_pretrained(self):
+ for model_name in SEGGPT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+ model = SegGptModel.from_pretrained(model_name)
+ self.assertIsNotNone(model)
+
+
+def prepare_img():
+ ds = load_dataset("EduardoPacheco/seggpt-example-data")["train"]
+ images = [image.convert("RGB") for image in ds["image"]]
+ masks = [image.convert("RGB") for image in ds["mask"]]
+ return images, masks
+
+
+@require_torch
+@require_vision
+class SegGptModelIntegrationTest(unittest.TestCase):
+ @cached_property
+ def default_image_processor(self):
+ return SegGptImageProcessor.from_pretrained("BAAI/seggpt-vit-large") if is_vision_available() else None
+
+ @slow
+ def test_one_shot_inference(self):
+ model = SegGptForImageSegmentation.from_pretrained("BAAI/seggpt-vit-large").to(torch_device)
+
+ image_processor = self.default_image_processor
+
+ images, masks = prepare_img()
+ input_image = images[1]
+ prompt_image = images[0]
+ prompt_mask = masks[0]
+
+ inputs = image_processor(
+ images=input_image, prompt_images=prompt_image, prompt_masks=prompt_mask, return_tensors="pt"
+ )
+
+ inputs = inputs.to(torch_device)
+ # forward pass
+ with torch.no_grad():
+ outputs = model(**inputs)
+
+ # verify the logits
+ expected_shape = torch.Size((1, 3, 896, 448))
+ self.assertEqual(outputs.pred_masks.shape, expected_shape)
+
+ expected_slice = torch.tensor(
+ [
+ [[-2.1208, -2.1190, -2.1198], [-2.1237, -2.1228, -2.1227], [-2.1232, -2.1226, -2.1228]],
+ [[-2.0405, -2.0396, -2.0403], [-2.0434, -2.0434, -2.0433], [-2.0428, -2.0432, -2.0434]],
+ [[-1.8102, -1.8088, -1.8099], [-1.8131, -1.8126, -1.8129], [-1.8130, -1.8128, -1.8131]],
+ ]
+ ).to(torch_device)
+
+ self.assertTrue(torch.allclose(outputs.pred_masks[0, :, :3, :3], expected_slice, atol=1e-4))
+
+ result = image_processor.post_process_semantic_segmentation(outputs, [input_image.size[::-1]])[0]
+
+ result_expected_shape = torch.Size((170, 297))
+ expected_area = 1082
+ area = (result > 0).sum().item()
+ self.assertEqual(result.shape, result_expected_shape)
+ self.assertEqual(area, expected_area)
+
+ @slow
+ def test_few_shot_inference(self):
+ model = SegGptForImageSegmentation.from_pretrained("BAAI/seggpt-vit-large").to(torch_device)
+ image_processor = self.default_image_processor
+
+ images, masks = prepare_img()
+ input_images = [images[1]] * 2
+ prompt_images = [images[0], images[2]]
+ prompt_masks = [masks[0], masks[2]]
+
+ inputs = image_processor(
+ images=input_images, prompt_images=prompt_images, prompt_masks=prompt_masks, return_tensors="pt"
+ )
+
+ inputs = {k: v.to(torch_device) for k, v in inputs.items()}
+ with torch.no_grad():
+ outputs = model(**inputs, feature_ensemble=True)
+
+ expected_shape = torch.Size((2, 3, 896, 448))
+ expected_slice = torch.tensor(
+ [
+ [[-2.1201, -2.1192, -2.1189], [-2.1217, -2.1210, -2.1204], [-2.1216, -2.1202, -2.1194]],
+ [[-2.0393, -2.0390, -2.0387], [-2.0402, -2.0402, -2.0397], [-2.0400, -2.0394, -2.0388]],
+ [[-1.8083, -1.8076, -1.8077], [-1.8105, -1.8102, -1.8099], [-1.8105, -1.8095, -1.8090]],
+ ]
+ ).to(torch_device)
+
+ self.assertEqual(outputs.pred_masks.shape, expected_shape)
+ self.assertTrue(torch.allclose(outputs.pred_masks[0, :, 448:451, :3], expected_slice, atol=4e-4))
diff --git a/tests/test_modeling_common.py b/tests/test_modeling_common.py
index a2a16a1400069c..6d4f0734cbc74a 100755
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -958,6 +958,16 @@ def _create_and_check_torchscript(self, config, inputs_dict):
traced_model = torch.jit.trace(
model, (input_ids, bbox), check_trace=False
) # when traced model is checked, an error is produced due to name mangling
+ elif (
+ "pixel_values" in inputs and "prompt_pixel_values" in inputs and "prompt_masks" in inputs
+ ): # SegGpt requires additional inputs
+ pixel_values = inputs["pixel_values"]
+ prompt_pixel_values = inputs["prompt_pixel_values"]
+ prompt_masks = inputs["prompt_masks"]
+ model(pixel_values, prompt_pixel_values, prompt_masks)
+ traced_model = torch.jit.trace(
+ model, (pixel_values, prompt_pixel_values, prompt_masks), check_trace=False
+ ) # when traced model is checked, an error is produced due to name mangling
else:
main_input = inputs[main_input_name]
diff --git a/utils/check_repo.py b/utils/check_repo.py
index ca25d7d9e32bf1..7cc06c6781164c 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -308,6 +308,7 @@
"SeamlessM4Tv2NARTextToUnitForConditionalGeneration",
"SeamlessM4Tv2CodeHifiGan",
"SeamlessM4Tv2ForSpeechToSpeech", # no auto class for speech-to-speech
+ "SegGptForImageSegmentation",
"SiglipVisionModel",
"SiglipTextModel",
]
From 871ba71dfa04f9d37a4f32e1f962a1199a5cf51a Mon Sep 17 00:00:00 2001
From: FredericOdermatt <50372080+FredericOdermatt@users.noreply.github.com>
Date: Tue, 27 Feb 2024 09:43:52 +0900
Subject: [PATCH 023/549] GenerationConfig validate both constraints and
force_words_ids (#29163)
GenerationConfig validate both options for constrained decoding: constraints and force_words_ids
---
src/transformers/generation/configuration_utils.py | 8 ++++----
tests/generation/test_configuration_utils.py | 5 +++++
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/src/transformers/generation/configuration_utils.py b/src/transformers/generation/configuration_utils.py
index 87335b2667b23d..f6d9c8f52c0066 100644
--- a/src/transformers/generation/configuration_utils.py
+++ b/src/transformers/generation/configuration_utils.py
@@ -482,11 +482,11 @@ def validate(self, is_init=False):
# 3. detect incorrect paramaterization specific to advanced beam modes
else:
# constrained beam search
- if self.constraints is not None:
+ if self.constraints is not None or self.force_words_ids is not None:
constrained_wrong_parameter_msg = (
- "`constraints` is not `None`, triggering constrained beam search. However, `{flag_name}` is set "
- "to `{flag_value}`, which is incompatible with this generation mode. Set `constraints=None` or "
- "unset `{flag_name}` to continue." + fix_location
+ "one of `constraints`, `force_words_ids` is not `None`, triggering constrained beam search. However, "
+ "`{flag_name}` is set to `{flag_value}`, which is incompatible with this generation mode. Set "
+ "`constraints` and `force_words_ids` to `None` or unset `{flag_name}` to continue." + fix_location
)
if self.do_sample is True:
raise ValueError(
diff --git a/tests/generation/test_configuration_utils.py b/tests/generation/test_configuration_utils.py
index 4ff9d35aa0d2dc..a86dd31440487d 100644
--- a/tests/generation/test_configuration_utils.py
+++ b/tests/generation/test_configuration_utils.py
@@ -156,6 +156,11 @@ def test_validate(self):
# Impossible sets of contraints/parameters will raise an exception
with self.assertRaises(ValueError):
GenerationConfig(do_sample=False, num_beams=1, num_return_sequences=2)
+ with self.assertRaises(ValueError):
+ # dummy constraint
+ GenerationConfig(do_sample=True, num_beams=2, constraints=["dummy"])
+ with self.assertRaises(ValueError):
+ GenerationConfig(do_sample=True, num_beams=2, force_words_ids=[[[1, 2, 3]]])
# Passing `generate()`-only flags to `validate` will raise an exception
with self.assertRaises(ValueError):
From a3f9221a449e9b949e71d9b047c66186f023481f Mon Sep 17 00:00:00 2001
From: regisss <15324346+regisss@users.noreply.github.com>
Date: Tue, 27 Feb 2024 03:03:00 +0100
Subject: [PATCH 024/549] Add generate kwargs to VQA pipeline (#29134)
---
src/transformers/pipelines/visual_question_answering.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/transformers/pipelines/visual_question_answering.py b/src/transformers/pipelines/visual_question_answering.py
index f456835d70904e..9106b19d33671a 100644
--- a/src/transformers/pipelines/visual_question_answering.py
+++ b/src/transformers/pipelines/visual_question_answering.py
@@ -123,9 +123,9 @@ def preprocess(self, inputs, padding=False, truncation=False, timeout=None):
model_inputs.update(image_features)
return model_inputs
- def _forward(self, model_inputs):
+ def _forward(self, model_inputs, **generate_kwargs):
if self.model.can_generate():
- model_outputs = self.model.generate(**model_inputs)
+ model_outputs = self.model.generate(**model_inputs, **generate_kwargs)
else:
model_outputs = self.model(**model_inputs)
return model_outputs
From e3fc90ae680becbe90eb5bbf58b409cae76ff8c8 Mon Sep 17 00:00:00 2001
From: Andrei Panferov
Date: Tue, 27 Feb 2024 09:32:39 +0100
Subject: [PATCH 025/549] Cleaner Cache `dtype` and `device` extraction for
CUDA graph generation for quantizers compatibility (#29079)
* input_layernorm as the beacon of hope
* cleaner dtype extraction
* AQLM + CUDA graph test
* is available check
* shorter text test
---
.../models/llama/modeling_llama.py | 8 ++-
.../aqlm_integration/test_aqlm.py | 68 +++++++++++++++++--
2 files changed, 67 insertions(+), 9 deletions(-)
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 8b55b4f7a3f78c..399cfec4ffc7de 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -817,9 +817,13 @@ def _setup_cache(self, cache_cls, max_batch_size, max_cache_len: Optional[int] =
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
for layer in self.model.layers:
- weights = layer.self_attn.o_proj.weight
+ device = layer.input_layernorm.weight.device
+ if hasattr(self.config, "_pre_quantization_dtype"):
+ dtype = self.config._pre_quantization_dtype
+ else:
+ dtype = layer.self_attn.o_proj.weight.dtype
layer.self_attn.past_key_value = cache_cls(
- self.config, max_batch_size, max_cache_len, device=weights.device, dtype=weights.dtype
+ self.config, max_batch_size, max_cache_len, device=device, dtype=dtype
)
def _reset_cache(self):
diff --git a/tests/quantization/aqlm_integration/test_aqlm.py b/tests/quantization/aqlm_integration/test_aqlm.py
index 6a5cefea2fb177..46b64573b93802 100644
--- a/tests/quantization/aqlm_integration/test_aqlm.py
+++ b/tests/quantization/aqlm_integration/test_aqlm.py
@@ -14,10 +14,13 @@
# limitations under the License.
import gc
+import importlib
import tempfile
import unittest
-from transformers import AqlmConfig, AutoConfig, AutoModelForCausalLM, AutoTokenizer, OPTForCausalLM
+from packaging import version
+
+from transformers import AqlmConfig, AutoConfig, AutoModelForCausalLM, AutoTokenizer, OPTForCausalLM, StaticCache
from transformers.testing_utils import (
require_accelerate,
require_aqlm,
@@ -26,7 +29,7 @@
slow,
torch_device,
)
-from transformers.utils import is_accelerate_available, is_torch_available
+from transformers.utils import is_accelerate_available, is_aqlm_available, is_torch_available
if is_torch_available():
@@ -71,11 +74,12 @@ def test_from_dict(self):
@require_aqlm
@require_accelerate
class AqlmTest(unittest.TestCase):
- model_name = "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch"
+ model_name = "BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf"
input_text = "Hello my name is"
+ max_new_tokens = 32
- EXPECTED_OUTPUT = "Hello my name is Katie and I am a 20 year old student at the University of North Carolina at Chapel Hill. I am currently a sophomore and am majoring in Psychology. I am"
+ EXPECTED_OUTPUT = "Hello my name is Katie. I am a 20 year old college student. I am a very outgoing person. I love to have fun and be active. I"
device_map = "cuda"
@@ -144,7 +148,7 @@ def test_quantized_model(self):
"""
input_ids = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
- output = self.quantized_model.generate(**input_ids, max_new_tokens=40)
+ output = self.quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens)
self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT)
def test_raise_if_non_quantized(self):
@@ -164,7 +168,7 @@ def test_save_pretrained(self):
input_ids = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
- output = model.generate(**input_ids, max_new_tokens=40)
+ output = model.generate(**input_ids, max_new_tokens=self.max_new_tokens)
self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT)
@require_torch_multi_gpu
@@ -178,6 +182,56 @@ def test_quantized_model_multi_gpu(self):
self.assertTrue(set(quantized_model.hf_device_map.values()) == {0, 1})
- output = quantized_model.generate(**input_ids, max_new_tokens=40)
+ output = quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens)
self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT)
+
+ @unittest.skipUnless(
+ is_aqlm_available() and version.parse(importlib.metadata.version("aqlm")) >= version.parse("1.0.3"),
+ "test requires `aqlm>=1.0.3`",
+ )
+ def test_quantized_model_compile(self):
+ """
+ Simple test that checks if the quantized model is working properly
+ """
+
+ # Sample tokens greedily
+ def decode_one_tokens(model, cur_token, input_pos, cache_position):
+ logits = model(
+ cur_token, position_ids=input_pos, cache_position=cache_position, return_dict=False, use_cache=True
+ )[0]
+ new_token = torch.argmax(logits[:, [-1]], dim=-1).to(torch.int)
+
+ return new_token
+
+ # Tokenize the test input
+ input_ids = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)["input_ids"]
+ seq_length = input_ids.shape[1]
+
+ # Setup static KV cache for generation
+ self.quantized_model._setup_cache(StaticCache, 1, max_cache_len=seq_length + self.max_new_tokens + 1)
+
+ # Allocate token ids to be generated and copy prefix ids
+ cache_position = torch.arange(seq_length, device=torch_device)
+ generated_ids = torch.zeros(1, seq_length + self.max_new_tokens, dtype=torch.int, device=torch_device)
+ generated_ids[:, cache_position] = input_ids.to(torch_device).to(torch.int)
+
+ # Do a forward pass to fill the prefix cache and compile the kernels if necessary
+ logits = self.quantized_model(input_ids, cache_position=cache_position, return_dict=False, use_cache=True)[0]
+ next_token = torch.argmax(logits[:, [-1]], dim=-1).to(torch.int)
+ generated_ids[:, [seq_length]] = next_token
+
+ with torch.no_grad():
+ # Compile the CUDA graph
+ decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True)
+
+ # Generate tokens one by one
+ cache_position = torch.tensor([seq_length + 1], device=torch_device)
+ for _ in range(1, self.max_new_tokens):
+ with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True):
+ next_token = decode_one_tokens(self.quantized_model, next_token.clone(), None, cache_position)
+ generated_ids.index_copy_(1, cache_position, next_token)
+ cache_position += 1
+
+ # Check generated text
+ self.assertEqual(self.tokenizer.decode(generated_ids[0], skip_special_tokens=True), self.EXPECTED_OUTPUT)
From 83e366bfd49708796e2c6461d3988d23d008502a Mon Sep 17 00:00:00 2001
From: Merve Noyan
Date: Tue, 27 Feb 2024 12:39:58 +0300
Subject: [PATCH 026/549] Image Feature Extraction docs (#28973)
* Image Feature Extraction docs
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update image_feature_extraction.md
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Address comments
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update docs/source/en/tasks/image_feature_extraction.md
Co-authored-by: Maria Khalusova
* Update image_feature_extraction.md
* Update image_feature_extraction.md
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Maria Khalusova
---
docs/source/en/_toctree.yml | 2 +
.../en/tasks/image_feature_extraction.md | 134 ++++++++++++++++++
2 files changed, 136 insertions(+)
create mode 100644 docs/source/en/tasks/image_feature_extraction.md
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 976a104294c9c9..d1748d7d43c576 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -73,6 +73,8 @@
title: Depth estimation
- local: tasks/image_to_image
title: Image-to-Image
+ - local: tasks/image_feature_extraction
+ title: Image Feature Extraction
- local: tasks/mask_generation
title: Mask Generation
- local: tasks/knowledge_distillation_for_image_classification
diff --git a/docs/source/en/tasks/image_feature_extraction.md b/docs/source/en/tasks/image_feature_extraction.md
new file mode 100644
index 00000000000000..f924247d261592
--- /dev/null
+++ b/docs/source/en/tasks/image_feature_extraction.md
@@ -0,0 +1,134 @@
+
+
+# Image Feature Extraction
+
+[[open-in-colab]]
+
+Image feature extraction is the task of extracting semantically meaningful features given an image. This has many use cases, including image similarity and image retrieval. Moreover, most computer vision models can be used for image feature extraction, where one can remove the task-specific head (image classification, object detection etc) and get the features. These features are very useful on a higher level: edge detection, corner detection and so on. They may also contain information about the real world (e.g. what a cat looks like) depending on how deep the model is. Therefore, these outputs can be used to train new classifiers on a specific dataset.
+
+In this guide, you will:
+
+- Learn to build a simple image similarity system on top of the `image-feature-extraction` pipeline.
+- Accomplish the same task with bare model inference.
+
+## Image Similarity using `image-feature-extraction` Pipeline
+
+We have two images of cats sitting on top of fish nets, one of them is generated.
+
+```python
+from PIL import Image
+import requests
+
+img_urls = ["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.jpeg"]
+image_real = Image.open(requests.get(img_urls[0], stream=True).raw).convert("RGB")
+image_gen = Image.open(requests.get(img_urls[1], stream=True).raw).convert("RGB")
+```
+
+Let's see the pipeline in action. First, initialize the pipeline. If you don't pass any model to it, the pipeline will be automatically initialized with [google/vit-base-patch16-224](google/vit-base-patch16-224). If you'd like to calculate similarity, set `pool` to True.
+
+```python
+import torch
+from transformers import pipeline
+
+DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+pipe = pipeline(task="image-feature-extraction", model_name="google/vit-base-patch16-384", device=DEVICE, pool=True)
+```
+
+To infer with `pipe` pass both images to it.
+
+```python
+outputs = pipe([image_real, image_gen])
+```
+
+The output contains pooled embeddings of those two images.
+
+```python
+# get the length of a single output
+print(len(outputs[0][0]))
+# show outputs
+print(outputs)
+
+# 768
+# [[[-0.03909236937761307, 0.43381670117378235, -0.06913255900144577,
+```
+
+To get the similarity score, we need to pass them to a similarity function.
+
+```python
+from torch.nn.functional import cosine_similarity
+
+similarity_score = cosine_similarity(torch.Tensor(outputs[0]),
+ torch.Tensor(outputs[1]), dim=1)
+
+print(similarity_score)
+
+# tensor([0.6043])
+```
+
+If you want to get the last hidden states before pooling, avoid passing any value for the `pool` parameter, as it is set to `False` by default. These hidden states are useful for training new classifiers or models based on the features from the model.
+
+```python
+pipe = pipeline(task="image-feature-extraction", model_name="google/vit-base-patch16-224", device=DEVICE)
+output = pipe(image_real)
+```
+
+Since the outputs are unpooled, we get the last hidden states where the first dimension is the batch size, and the last two are the embedding shape.
+
+```python
+import numpy as np
+print(np.array(outputs).shape)
+# (1, 197, 768)
+```
+
+## Getting Features and Similarities using `AutoModel`
+
+We can also use `AutoModel` class of transformers to get the features. `AutoModel` loads any transformers model with no task-specific head, and we can use this to get the features.
+
+```python
+from transformers import AutoImageProcessor, AutoModel
+
+processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
+model = AutoModel.from_pretrained("google/vit-base-patch16-224").to(DEVICE)
+```
+
+Let's write a simple function for inference. We will pass the inputs to the `processor` first and pass its outputs to the `model`.
+
+```python
+def infer(image):
+ inputs = processor(image, return_tensors="pt").to(DEVICE)
+ outputs = model(**inputs)
+ return outputs.pooler_output
+```
+
+We can pass the images directly to this function and get the embeddings.
+
+```python
+embed_real = infer(image_real)
+embed_gen = infer(image_gen)
+```
+
+We can get the similarity again over the embeddings.
+
+```python
+from torch.nn.functional import cosine_similarity
+
+similarity_score = cosine_similarity(embed_real, embed_gen, dim=1)
+print(similarity_score)
+
+# tensor([0.6061], device='cuda:0', grad_fn=)
+```
+
From 6d3b643e2ae2763c484c6232691810f647095e03 Mon Sep 17 00:00:00 2001
From: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date: Tue, 27 Feb 2024 10:43:01 +0100
Subject: [PATCH 027/549] Fix `attn_implementation` documentation (#29295)
fix
---
src/transformers/configuration_utils.py | 2 --
src/transformers/modeling_utils.py | 2 ++
src/transformers/models/auto/auto_factory.py | 2 ++
3 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/src/transformers/configuration_utils.py b/src/transformers/configuration_utils.py
index 819fe5fcf288be..dd2ed9d695e73b 100755
--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -236,8 +236,6 @@ class PretrainedConfig(PushToHubMixin):
This attribute is currently not being used during model loading time, but this may change in the future
versions. But we can already start preparing for the future by saving the dtype with save_pretrained.
- attn_implementation (`str`, *optional*):
- The attention implementation to use in the model. Can be any of `"eager"` (manual implementation of the attention), `"sdpa"` (attention using [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html)), or `"flash_attention_2"` (attention using [Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention)). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"` implementation.
> TensorFlow specific parameters
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index b3102a37d37f31..38dde4ec91e267 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -2696,6 +2696,8 @@ def from_pretrained(
[pull request 11471](https://github.com/huggingface/transformers/pull/11471) for more information.
+ attn_implementation (`str`, *optional*):
+ The attention implementation to use in the model (if relevant). Can be any of `"eager"` (manual implementation of the attention), `"sdpa"` (using [`F.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html)), or `"flash_attention_2"` (using [Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention)). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"` implementation.
> Parameters for big model inference
diff --git a/src/transformers/models/auto/auto_factory.py b/src/transformers/models/auto/auto_factory.py
index ce7884d2ef120e..98c0e851bcc22d 100644
--- a/src/transformers/models/auto/auto_factory.py
+++ b/src/transformers/models/auto/auto_factory.py
@@ -58,6 +58,8 @@
The model class to instantiate is selected based on the configuration class:
List options
+ attn_implementation (`str`, *optional*):
+ The attention implementation to use in the model (if relevant). Can be any of `"eager"` (manual implementation of the attention), `"sdpa"` (using [`F.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html)), or `"flash_attention_2"` (using [Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention)). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual `"eager"` implementation.
Examples:
From 63a0c8f1cb8c5434297c213471e4ec467ae81d47 Mon Sep 17 00:00:00 2001
From: Fanli Lin
Date: Tue, 27 Feb 2024 17:44:48 +0800
Subject: [PATCH 028/549] [tests] enable benchmark unit tests on XPU (#29284)
* add xpu for benchmark
* no auto_map
* use require_torch_gpu
* use gpu
* revert
* revert
* fix style
---
src/transformers/benchmark/benchmark_args.py | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/src/transformers/benchmark/benchmark_args.py b/src/transformers/benchmark/benchmark_args.py
index b5887e4a9bcb4b..c20683e416843b 100644
--- a/src/transformers/benchmark/benchmark_args.py
+++ b/src/transformers/benchmark/benchmark_args.py
@@ -17,7 +17,14 @@
from dataclasses import dataclass, field
from typing import Tuple
-from ..utils import cached_property, is_torch_available, is_torch_tpu_available, logging, requires_backends
+from ..utils import (
+ cached_property,
+ is_torch_available,
+ is_torch_tpu_available,
+ is_torch_xpu_available,
+ logging,
+ requires_backends,
+)
from .benchmark_args_utils import BenchmarkArguments
@@ -84,6 +91,9 @@ def _setup_devices(self) -> Tuple["torch.device", int]:
elif is_torch_tpu_available():
device = xm.xla_device()
n_gpu = 0
+ elif is_torch_xpu_available():
+ device = torch.device("xpu")
+ n_gpu = torch.xpu.device_count()
else:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
From 5c341d4555ba3e4b656053317e372ebed0c5af37 Mon Sep 17 00:00:00 2001
From: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Date: Tue, 27 Feb 2024 17:51:37 +0800
Subject: [PATCH 029/549] Use torch 2.2 for deepspeed CI (#29246)
update
Co-authored-by: ydshieh
---
.../Dockerfile | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile b/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile
index a7b08a8c60d31d..648aaa189d859e 100644
--- a/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile
+++ b/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile
@@ -1,10 +1,10 @@
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-11.html#rel-23-11
-FROM nvcr.io/nvidia/pytorch:23.11-py3
+FROM nvcr.io/nvidia/pytorch:23.04-py3
LABEL maintainer="Hugging Face"
ARG DEBIAN_FRONTEND=noninteractive
-ARG PYTORCH='2.1.0'
+ARG PYTORCH='2.2.0'
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu121'
@@ -15,14 +15,12 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip
ARG REF=main
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
-RUN python3 -m pip uninstall -y torch torchvision torchaudio
+RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
# Install latest release PyTorch
# (PyTorch must be installed before pre-compiling any DeepSpeed c++/cuda ops.)
# (https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops)
-RUN python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
-
-RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing]
+RUN python3 -m pip uninstall -y torch torchvision torchaudio && python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
From 8a1faf2803f987803025453851564e7f55bbeff2 Mon Sep 17 00:00:00 2001
From: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Date: Tue, 27 Feb 2024 09:58:43 -0500
Subject: [PATCH 030/549] Add compatibility with skip_memory_metrics for mps
device (#29264)
* Add compatibility with mps device
* fix
* typo and style
---
src/transformers/trainer_utils.py | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/src/transformers/trainer_utils.py b/src/transformers/trainer_utils.py
index fc16a9a827a920..803f6fe840e7d0 100644
--- a/src/transformers/trainer_utils.py
+++ b/src/transformers/trainer_utils.py
@@ -526,6 +526,8 @@ def start(self):
elif is_torch_npu_available():
self.torch.npu.reset_peak_memory_stats()
self.torch.npu.empty_cache()
+ elif is_torch_mps_available():
+ self.torch.mps.empty_cache()
# gpu
if self.torch is not None:
@@ -535,6 +537,8 @@ def start(self):
self.gpu_mem_used_at_start = self.torch.xpu.memory_allocated()
elif is_torch_npu_available():
self.gpu_mem_used_at_start = self.torch.npu.memory_allocated()
+ elif is_torch_mps_available():
+ self.gpu_mem_used_at_start = self.torch.mps.current_allocated_memory()
# cpu
self.cpu_mem_used_at_start = self.cpu_mem_used()
@@ -564,6 +568,8 @@ def stop(self, stage):
self.torch.xpu.empty_cache()
elif is_torch_npu_available():
self.torch.npu.empty_cache()
+ elif is_torch_mps_available():
+ self.torch.mps.empty_cache()
# concepts:
# - alloc_delta: the difference of allocated memory between the end and the start
@@ -581,6 +587,11 @@ def stop(self, stage):
elif is_torch_npu_available():
self.gpu_mem_used_now = self.torch.npu.memory_allocated()
self.gpu_mem_used_peak = self.torch.npu.max_memory_allocated()
+ elif is_torch_mps_available():
+ self.gpu_mem_used_now = self.torch.mps.current_allocated_memory()
+ # self.torch.mps.max_memory_allocated() does not exist yet
+ self.gpu_mem_used_peak = None
+
else:
raise ValueError("No available GPU device found!")
@@ -588,8 +599,11 @@ def stop(self, stage):
"begin": self.gpu_mem_used_at_start,
"end": self.gpu_mem_used_now,
"alloc": (self.gpu_mem_used_now - self.gpu_mem_used_at_start),
- "peaked": max(0, self.gpu_mem_used_peak - self.gpu_mem_used_now),
}
+ if self.gpu_mem_used_peak is not None:
+ self.gpu[self.cur_stage]["peaked"] = max(0, self.gpu_mem_used_peak - self.gpu_mem_used_now)
+ else:
+ self.gpu[self.cur_stage]["peaked"] = "Not available"
# cpu
self.cpu_mem_used_now = self.cpu_mem_used()
From ddf7ac4237cfa08c50e65c297f7afa97a093fa91 Mon Sep 17 00:00:00 2001
From: Raushan Turganbay
Date: Tue, 27 Feb 2024 23:15:26 +0500
Subject: [PATCH 031/549] Token level timestamps for long-form generation in
Whisper (#29148)
---
.../models/whisper/generation_whisper.py | 19 +++++-
.../pipelines/automatic_speech_recognition.py | 11 +++-
tests/models/whisper/test_modeling_whisper.py | 50 +++++++++++++++
..._pipelines_automatic_speech_recognition.py | 64 +++++++++++++++++++
4 files changed, 141 insertions(+), 3 deletions(-)
diff --git a/src/transformers/models/whisper/generation_whisper.py b/src/transformers/models/whisper/generation_whisper.py
index 0d6addb5631bec..5b5957d53478ec 100644
--- a/src/transformers/models/whisper/generation_whisper.py
+++ b/src/transformers/models/whisper/generation_whisper.py
@@ -720,6 +720,7 @@ def generate(
input_stride=input_stride,
prev_idx=prev_i,
idx=i,
+ return_token_timestamps=return_token_timestamps,
)
current_segments[prev_i] += segments
@@ -809,11 +810,15 @@ def generate_with_fallback(
# remove eos token id
if is_not_final and seek_sequence[-1] == generation_config.eos_token_id:
seek_sequence = seek_sequence[:-1]
+ if return_token_timestamps:
+ seek_outputs[i]["token_timestamps"] = seek_outputs[i]["token_timestamps"][:-1]
# remove all padding tokens
if seek_sequence[-1] == generation_config.pad_token_id:
num_paddings = (seek_sequence == generation_config.pad_token_id).sum()
seek_sequence = seek_sequence[:-num_paddings]
+ if return_token_timestamps:
+ seek_outputs[i]["token_timestamps"] = seek_outputs[i]["token_timestamps"][:-num_paddings]
# check which sequences in batch need fallback & which should be skipped
needs_fallback[i], should_skip[i] = self._need_fallback(
@@ -878,15 +883,18 @@ def _postprocess_outputs(self, seek_outputs, decoder_input_ids, return_token_tim
seek_outputs["token_timestamps"] = self._extract_token_timestamps(
seek_outputs, generation_config.alignment_heads, num_frames=num_frames
)
+ seek_outputs["token_timestamps"] = seek_outputs["token_timestamps"][:, decoder_input_ids.shape[-1] :]
seek_outputs["sequences"] = seek_outputs["sequences"][:, decoder_input_ids.shape[-1] :]
def split_by_batch_index(values, key, batch_idx):
if key == "scores":
return [v[batch_idx].cpu() for v in values]
- if key == "past_key_values":
+ elif key == "past_key_values":
# we don't save `past_key_values` as this is too costly
return None
+ elif isinstance(values[batch_idx], tuple) and torch.is_tensor(values[batch_idx][0]):
+ return tuple(tuple(w[batch_idx][None].cpu() for w in v) for v in values)
return values[batch_idx].cpu()
sequence_tokens = seek_outputs["sequences"]
@@ -1611,6 +1619,7 @@ def _retrieve_segment(
input_stride,
prev_idx,
idx,
+ return_token_timestamps,
):
# find the predicted "end of segment" predictions of Whisper
# "end of segment" predictions occur whenever Whisper predicts a timestamp token
@@ -1618,6 +1627,7 @@ def _retrieve_segment(
single_timestamp_ending = timestamp_tokens[-2:].tolist() == [False, True]
timestamp_segment_indices = torch.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0]
timestamp_segment_indices.add_(1)
+ token_timestamps = seek_outputs[idx]["token_timestamps"] if return_token_timestamps else []
# If whisper predicted a "end of segment" via a timestep token, let's go ever each
# "end of segment" prediction and slice the decoding into segments accordingly
@@ -1642,6 +1652,10 @@ def _retrieve_segment(
"result": seek_outputs[idx],
}
)
+ if return_token_timestamps:
+ segments[-1]["token_timestamps"] = (
+ token_timestamps[last_slice:current_slice] + time_offset[prev_idx]
+ )
last_slice = current_slice
if single_timestamp_ending:
@@ -1661,7 +1675,6 @@ def _retrieve_segment(
if timestamps.numel() > 0 and timestamps[-1].item() != timestamp_begin:
# no consecutive timestamps but it has a timestamp; use the last one.
last_timestamp_pos = timestamps[-1].item() - timestamp_begin
-
segments = [
{
"start": time_offset[prev_idx],
@@ -1670,6 +1683,8 @@ def _retrieve_segment(
"result": seek_outputs[idx],
}
]
+ if return_token_timestamps:
+ segments[-1]["token_timestamps"] = token_timestamps + time_offset[prev_idx]
segment_offset = seek_num_frames[prev_idx]
return segments, segment_offset
diff --git a/src/transformers/pipelines/automatic_speech_recognition.py b/src/transformers/pipelines/automatic_speech_recognition.py
index 5e392502c92a33..ee976e9ece0a6c 100644
--- a/src/transformers/pipelines/automatic_speech_recognition.py
+++ b/src/transformers/pipelines/automatic_speech_recognition.py
@@ -483,6 +483,7 @@ def _forward(self, model_inputs, return_timestamps=False, generate_kwargs=None):
generate_kwargs["return_timestamps"] = return_timestamps
if return_timestamps == "word":
generate_kwargs["return_token_timestamps"] = True
+ generate_kwargs["return_segments"] = True
if stride is not None:
if isinstance(stride, tuple):
@@ -499,8 +500,16 @@ def _forward(self, model_inputs, return_timestamps=False, generate_kwargs=None):
attention_mask=attention_mask,
**generate_kwargs,
)
+ # whisper longform generation stores timestamps in "segments"
if return_timestamps == "word" and self.type == "seq2seq_whisper":
- out = {"tokens": tokens["sequences"], "token_timestamps": tokens["token_timestamps"]}
+ if "segments" not in tokens:
+ out = {"tokens": tokens["sequences"], "token_timestamps": tokens["token_timestamps"]}
+ else:
+ token_timestamps = [
+ torch.cat([segment["token_timestamps"] for segment in segment_list])
+ for segment_list in tokens["segments"]
+ ]
+ out = {"tokens": tokens["sequences"], "token_timestamps": token_timestamps}
else:
out = {"tokens": tokens}
if self.type == "seq2seq_whisper":
diff --git a/tests/models/whisper/test_modeling_whisper.py b/tests/models/whisper/test_modeling_whisper.py
index 1f92f1523dbbde..dc24a5bc34794b 100644
--- a/tests/models/whisper/test_modeling_whisper.py
+++ b/tests/models/whisper/test_modeling_whisper.py
@@ -1969,6 +1969,56 @@ def test_tiny_token_timestamp_batch_generation(self):
self.assertEqual(len(generate_outputs.sequences), num_return_sequences * num_samples)
+ @slow
+ def test_tiny_token_timestamp_generation_longform(self):
+ set_seed(0)
+ processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
+ model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
+ model.to(torch_device)
+ model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
+
+ input_speech = self._load_datasamples(5)
+ long_input_speech = np.concatenate(input_speech, dtype=np.float32)
+ inputs = processor.feature_extractor(
+ raw_speech=long_input_speech,
+ return_tensors="pt",
+ truncation=False, # False so the audio isn't truncated and whole audio is sent to the model
+ return_attention_mask=True,
+ padding=True,
+ )
+
+ inputs = inputs.to(torch_device)
+ generate_outputs = model.generate(**inputs, return_segments=True, return_token_timestamps=True)
+
+ token_timestamps_shape = [
+ [segment["token_timestamps"].shape for segment in segment_list]
+ for segment_list in generate_outputs["segments"]
+ ]
+ tokens_shape = [
+ [segment["tokens"].shape for segment in segment_list] for segment_list in generate_outputs["segments"]
+ ]
+ self.assertListEqual(tokens_shape, token_timestamps_shape)
+
+ # fmt: off
+ EXPECTED_OUTPUT = [
+ torch.tensor([0.0000, 0.4200, 0.8200, 0.9400, 1.1200, 1.1200, 1.2200, 1.5000, 1.7200, 2.0400, 2.3400, 2.5200, 2.6600, 3.2000, 3.4400, 3.5600, 3.6800, 3.8200, 4.1000, 4.3000, 4.5800, 4.9400, 5.4000, 6.3600]),
+ torch.tensor([ 6.5400, 6.5400, 6.7400, 6.9600, 7.2600, 7.3400, 7.5800, 7.5800, 7.6400, 7.8400, 8.1000, 8.5000, 9.0000, 9.4800, 9.7200, 10.2600, 11.1000]),
+ torch.tensor([11.2200, 11.2200, 11.4200, 11.6600, 12.0800, 12.4400, 12.5800, 12.8400, 13.1800, 13.6800, 14.0000, 14.2200, 14.6200, 14.9800, 15.2200, 15.6000, 15.9400, 16.2000, 16.5600, 16.8400, 16.9800]),
+ torch.tensor([16.9800, 16.9800, 17.3200, 18.1600, 18.6400, 18.8600, 19.2800, 19.5600, 19.8800, 20.1800, 20.3800, 20.7200, 21.1600, 21.5400, 21.9000, 22.2000, 22.4200, 22.8600, 23.7000]),
+ torch.tensor([23.7000, 23.7000, 23.9400, 24.1800, 24.3800, 24.8400, 25.2800, 25.6600, 25.9200, 26.2600, 26.4000, 26.5800, 26.7600, 27.1400, 27.3800, 28.0400, 28.3800, 28.8200, 29.3400, 29.5200]),
+ torch.tensor([29.4400, 29.4400, 29.7000, 30.0800, 30.3800, 30.5400, 30.8200, 31.0600, 31.6600, 31.9200, 32.3000, 32.4800, 32.6200, 33.6800]),
+ torch.tensor([33.8000, 33.8000, 33.9800, 33.9800, 34.1800, 34.4400, 34.6200, 35.0000, 35.2200, 35.3200, 35.5600, 35.9200, 36.3800, 36.6200, 36.6600, 36.9600, 37.3400, 37.9800, 38.5800, 38.7200, 38.9800, 39.4400, 39.5800, 39.8000, 40.1200, 40.2600]),
+ torch.tensor([40.5200, 40.5200, 40.6200, 41.1000, 41.5400, 41.9200, 42.1000, 42.3200, 42.3200, 43.0600, 44.6000]),
+ torch.tensor([44.7000, 44.7000, 44.8600, 44.9400, 45.1400, 45.1400, 45.2800, 45.6200, 45.9000, 46.2600, 47.1600, 47.4800, 47.7400, 48.1000, 48.2800, 48.4000, 48.6200, 48.8400, 49.0400, 49.2800, 49.4800, 49.6600, 49.9400, 50.5400]),
+ torch.tensor([50.5400, 50.5400, 50.6600, 50.8800, 51.2400, 51.7200, 52.8400]),
+ torch.tensor([52.9600, 52.9600, 53.0400, 53.2600, 53.4200, 53.5800, 53.9200, 54.1200, 54.7200, 54.9400, 55.2600, 55.6200, 55.9800, 56.5600, 56.8000, 56.9200, 57.3600, 57.9200, 58.1800, 58.5000, 58.6400, 58.8200]),
+ torch.tensor([58.6800, 58.6800, 59.1400, 59.5400, 59.9200, 60.1600, 60.3800, 60.8200, 61.6200, 62.2600, 75.2000]),
+ ]
+ # fmt: on
+
+ for segment, exp_segment in zip(generate_outputs["segments"][0], EXPECTED_OUTPUT):
+ self.assertTrue(torch.allclose(segment["token_timestamps"], exp_segment))
+
@slow
def test_tiny_specaugment_librispeech(self):
torch_device = "cpu"
diff --git a/tests/pipelines/test_pipelines_automatic_speech_recognition.py b/tests/pipelines/test_pipelines_automatic_speech_recognition.py
index 42cb7e50c2e1ac..d2af7e44687fbc 100644
--- a/tests/pipelines/test_pipelines_automatic_speech_recognition.py
+++ b/tests/pipelines/test_pipelines_automatic_speech_recognition.py
@@ -361,6 +361,70 @@ def test_return_timestamps_in_preprocess(self):
)
# fmt: on
+ @slow
+ @require_torch
+ def test_return_timestamps_in_preprocess_longform(self):
+ pipe = pipeline(
+ task="automatic-speech-recognition",
+ model="openai/whisper-tiny.en",
+ )
+ data = load_dataset("librispeech_asr", "clean", split="test", streaming=True)
+ samples = [next(iter(data)) for _ in range(8)]
+ audio = np.concatenate([sample["audio"]["array"] for sample in samples])
+
+ res = pipe(audio)
+ expected_output = {
+ "text": " Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst "
+ "the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst "
+ "the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst "
+ "the tents. Concord returned to its place amidst the tents."
+ }
+ self.assertEqual(res, expected_output)
+ res = pipe(audio, return_timestamps=True)
+ self.assertEqual(
+ res,
+ {
+ "text": " Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents. Concord returned to its place amidst the tents.",
+ "chunks": [
+ {"timestamp": (0.0, 3.22), "text": " Concord returned to its place amidst the tents."},
+ {"timestamp": (3.22, 6.74), "text": " Concord returned to its place amidst the tents."},
+ {"timestamp": (6.74, 10.26), "text": " Concord returned to its place amidst the tents."},
+ {"timestamp": (10.26, 13.78), "text": " Concord returned to its place amidst the tents."},
+ {"timestamp": (13.78, 17.3), "text": " Concord returned to its place amidst the tents."},
+ {"timestamp": (17.3, 20.82), "text": " Concord returned to its place amidst the tents."},
+ {"timestamp": (20.82, 24.34), "text": " Concord returned to its place amidst the tents."},
+ {"timestamp": (24.34, 27.86), "text": " Concord returned to its place amidst the tents."},
+ ],
+ },
+ )
+ pipe.model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
+ res = pipe(audio, return_timestamps="word")
+
+ # fmt: off
+ self.assertEqual(
+ res["chunks"][:15],
+ [
+ {"text": " Concord", "timestamp": (0.5, 0.94)},
+ {"text": " returned", "timestamp": (0.94, 1.52)},
+ {"text": " to", "timestamp": (1.52, 1.78)},
+ {"text": " its", "timestamp": (1.78, 1.98)},
+ {"text": " place", "timestamp": (1.98, 2.16)},
+ {"text": " amidst", "timestamp": (2.16, 2.5)},
+ {"text": " the", "timestamp": (2.5, 2.9)},
+ {"text": " tents.", "timestamp": (2.9, 4.2)},
+ {"text": " Concord", "timestamp": (4.2, 4.5)},
+ {"text": " returned", "timestamp": (4.5, 5.0)},
+ {"text": " to", "timestamp": (5.0, 5.28)},
+ {"text": " its", "timestamp": (5.28, 5.48)},
+ {"text": " place", "timestamp": (5.48, 5.7)},
+ {"text": " amidst", "timestamp": (5.7, 6.02)},
+ {"text": " the", "timestamp": (6.02, 6.4)}
+
+
+ ],
+ )
+ # fmt: on
+
@require_torch
def test_return_timestamps_in_init(self):
# segment-level timestamps are accepted
From 227cd54aa51280086d97c6d8463541d76b0b075f Mon Sep 17 00:00:00 2001
From: Sadra Barikbin
Date: Tue, 27 Feb 2024 21:45:43 +0330
Subject: [PATCH 032/549] Fix a few typos in `GenerationMixin`'s docstring
(#29277)
Co-authored-by: Joao Gante
---
src/transformers/generation/utils.py | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index ff5421ad4832a5..5b7d18e06c1d10 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -143,7 +143,7 @@ class GenerateEncoderDecoderOutput(ModelOutput):
Outputs of encoder-decoder generation models, when using non-beam methods.
Args:
- sequences (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+ sequences (`torch.LongTensor` of shape `(batch_size*num_return_sequences, sequence_length)`):
The generated sequences. The second dimension (sequence_length) is either equal to `max_length` or shorter
if all batches finished early due to the `eos_token_id`.
scores (`tuple(torch.FloatTensor)` *optional*, returned when `output_scores=True` is passed or when `config.output_scores=True`):
@@ -204,7 +204,7 @@ class GenerateBeamDecoderOnlyOutput(ModelOutput):
Beam transition scores for each vocabulary token at each generation step. Beam transition scores consisting
of log probabilities of tokens conditioned on log softmax of previously generated tokens in this beam.
Tuple of `torch.FloatTensor` with up to `max_new_tokens` elements (one element for each generated token),
- with each tensor of shape `(batch_size*num_beams*num_return_sequences, config.vocab_size)`.
+ with each tensor of shape `(batch_size*num_beams, config.vocab_size)`.
logits (`tuple(torch.FloatTensor)` *optional*, returned when `output_logits=True` is passed or when `config.output_logits=True`):
Unprocessed prediction scores of the language modeling head (scores for each vocabulary token before SoftMax)
at each generation step. Tuple of `torch.FloatTensor` with up to `max_new_tokens` elements (one element for
@@ -981,9 +981,9 @@ def compute_transition_scores(
shorter if all batches finished early due to the `eos_token_id`.
scores (`tuple(torch.FloatTensor)`):
Transition scores for each vocabulary token at each generation step. Beam transition scores consisting
- of log probabilities of tokens conditioned on log softmax of previously generated tokens Tuple of
- `torch.FloatTensor` with up to `max_new_tokens` elements (one element for each generated token), with
- each tensor of shape `(batch_size*num_beams, config.vocab_size)`.
+ of log probabilities of tokens conditioned on log softmax of previously generated tokens in this beam.
+ Tuple of `torch.FloatTensor` with up to `max_new_tokens` elements (one element for each generated token),
+ with each tensor of shape `(batch_size*num_beams, config.vocab_size)`.
beam_indices (`torch.LongTensor`, *optional*):
Beam indices of generated token id at each generation step. `torch.LongTensor` of shape
`(batch_size*num_return_sequences, sequence_length)`. Only required if a `num_beams>1` at
@@ -1251,12 +1251,12 @@ def generate(
inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):
The sequence used as a prompt for the generation or as model inputs to the encoder. If `None` the
method initializes it with `bos_token_id` and a batch size of 1. For decoder-only models `inputs`
- should of in the format of `input_ids`. For encoder-decoder models *inputs* can represent any of
+ should be in the format of `input_ids`. For encoder-decoder models *inputs* can represent any of
`input_ids`, `input_values`, `input_features`, or `pixel_values`.
generation_config (`~generation.GenerationConfig`, *optional*):
The generation configuration to be used as base parametrization for the generation call. `**kwargs`
passed to generate matching the attributes of `generation_config` will override them. If
- `generation_config` is not provided, the default will be used, which had the following loading
+ `generation_config` is not provided, the default will be used, which has the following loading
priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model
configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s
default values, whose documentation should be checked to parameterize generation.
@@ -1265,7 +1265,7 @@ def generate(
generation config. If a logit processor is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users.
stopping_criteria (`StoppingCriteriaList`, *optional*):
- Custom stopping criteria that complement the default stopping criteria built from arguments and a
+ Custom stopping criteria that complements the default stopping criteria built from arguments and a
generation config. If a stopping criteria is passed that is already created with the arguments or a
generation config an error is thrown. If your stopping criteria depends on the `scores` input, make
sure you pass `return_dict_in_generate=True, output_scores=True` to `generate`. This feature is
@@ -1295,7 +1295,7 @@ def generate(
negative_prompt_attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Attention_mask for `negative_prompt_ids`.
kwargs (`Dict[str, Any]`, *optional*):
- Ad hoc parametrization of `generate_config` and/or additional model-specific kwargs that will be
+ Ad hoc parametrization of `generation_config` and/or additional model-specific kwargs that will be
forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder
specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder_*.
From 83ab0115d1e93009eb52b66096e924bb44f928a1 Mon Sep 17 00:00:00 2001
From: Michael
Date: Wed, 28 Feb 2024 03:26:57 +0800
Subject: [PATCH 033/549] [i18n-zh] Translate fsdp.md into Chinese (#29305)
* [i18n-zh] Translate fsdp.md into Chinese
Signed-off-by: windsonsea
* apply suggestions from Fan-Lin
---------
Signed-off-by: windsonsea
---
docs/source/zh/_toctree.yml | 2 +
docs/source/zh/fsdp.md | 161 ++++++++++++++++++++++++++++++++++++
2 files changed, 163 insertions(+)
create mode 100644 docs/source/zh/fsdp.md
diff --git a/docs/source/zh/_toctree.yml b/docs/source/zh/_toctree.yml
index 7149e4c2f147da..f81f264655ea0d 100644
--- a/docs/source/zh/_toctree.yml
+++ b/docs/source/zh/_toctree.yml
@@ -55,6 +55,8 @@
- local: performance
title: 综述
- sections:
+ - local: fsdp
+ title: 完全分片数据并行
- local: perf_hardware
title: 用于训练的定制硬件
- local: hpo_train
diff --git a/docs/source/zh/fsdp.md b/docs/source/zh/fsdp.md
new file mode 100644
index 00000000000000..a322ec81e52c35
--- /dev/null
+++ b/docs/source/zh/fsdp.md
@@ -0,0 +1,161 @@
+
+
+# 完全分片数据并行
+
+[完全分片数据并行(FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/)是一种数据并行方法,
+它将模型的参数、梯度和优化器状态在可用 GPU(也称为 Worker 或 *rank*)的数量上进行分片。
+与[分布式数据并行(DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)不同,
+FSDP 减少了内存使用量,因为模型在每个 GPU 上都被复制了一次。这就提高了 GPU 内存效率,
+使您能够用较少的 GPU 训练更大的模型。FSDP 已经集成到 Accelerate 中,
+这是一个用于在分布式环境中轻松管理训练的库,这意味着可以从 [`Trainer`] 类中调用这个库。
+
+在开始之前,请确保已安装 Accelerate,并且至少使用 PyTorch 2.1.0 或更高版本。
+
+```bash
+pip install accelerate
+```
+
+## FSDP 配置
+
+首先,运行 [`accelerate config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config)
+命令为您的训练环境创建一个配置文件。Accelerate 使用此配置文件根据您在 `accelerate config`
+中选择的训练选项来自动搭建正确的训练环境。
+
+```bash
+accelerate config
+```
+
+运行 `accelerate config` 时,您将被提示一系列选项来配置训练环境。
+本节涵盖了一些最重要的 FSDP 选项。要了解有关其他可用的 FSDP 选项的更多信息,
+请查阅 [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) 参数。
+
+### 分片策略
+
+FSDP 提供了多种可选择的分片策略:
+
+- `FULL_SHARD` - 将模型参数、梯度和优化器状态跨 Worker 进行分片;为此选项选择 `1`
+- `SHARD_GRAD_OP`- 将梯度和优化器状态跨 Worker 进行分片;为此选项选择 `2`
+- `NO_SHARD` - 不分片任何内容(这等同于 DDP);为此选项选择 `3`
+- `HYBRID_SHARD` - 在每个 Worker 中分片模型参数、梯度和优化器状态,其中每个 Worker 也有完整副本;为此选项选择 `4`
+- `HYBRID_SHARD_ZERO2` - 在每个 Worker 中分片梯度和优化器状态,其中每个 Worker 也有完整副本;为此选项选择 `5`
+
+这由 `fsdp_sharding_strategy` 标志启用。
+
+### CPU 卸载
+
+当参数和梯度在不使用时可以卸载到 CPU 上,以节省更多 GPU 内存并帮助您适应即使 FSDP 也不足以容纳大型模型的情况。
+在运行 `accelerate config` 时,通过设置 `fsdp_offload_params: true` 来启用此功能。
+
+### 包装策略
+
+FSDP 是通过包装网络中的每个层来应用的。通常,包装是以嵌套方式应用的,其中完整的权重在每次前向传递后被丢弃,
+以便在下一层使用内存。**自动包装**策略是实现这一点的最简单方法,您不需要更改任何代码。
+您应该选择 `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` 来包装一个 Transformer 层,
+并且 `fsdp_transformer_layer_cls_to_wrap` 来指定要包装的层(例如 `BertLayer`)。
+
+否则,您可以选择基于大小的包装策略,其中如果一层的参数超过一定数量,则应用 FSDP。通过设置
+`fsdp_wrap_policy: SIZE_BASED_WRAP` 和 `min_num_param` 来启用此功能,将参数设置为所需的大小阈值。
+
+### 检查点
+
+应该使用 `fsdp_state_dict_type: SHARDED_STATE_DICT` 来保存中间检查点,
+因为在排名 0 上保存完整状态字典需要很长时间,通常会导致 `NCCL Timeout` 错误,因为在广播过程中会无限期挂起。
+您可以使用 [`~accelerate.Accelerator.load_state`]` 方法加载分片状态字典以恢复训练。
+
+```py
+# 包含检查点的目录
+accelerator.load_state("ckpt")
+```
+
+然而,当训练结束时,您希望保存完整状态字典,因为分片状态字典仅与 FSDP 兼容。
+
+```py
+if trainer.is_fsdp_enabled:
+ trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
+
+trainer.save_model(script_args.output_dir)
+```
+
+### TPU
+
+[PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html) 支持用于 TPUs 的 FSDP 训练,
+可以通过修改由 `accelerate config` 生成的 FSDP 配置文件来启用。除了上面指定的分片策略和包装选项外,
+您还可以将以下参数添加到文件中。
+
+```yaml
+xla: True # 必须设置为 True 以启用 PyTorch/XLA
+xla_fsdp_settings: # XLA 特定的 FSDP 参数
+xla_fsdp_grad_ckpt: True # 使用梯度检查点
+```
+
+[`xla_fsdp_settings`](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128)
+允许您配置用于 FSDP 的额外 XLA 特定参数。
+
+## 启动训练
+
+FSDP 配置文件示例如下所示:
+
+```yaml
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: "no"
+fsdp_config:
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+ fsdp_backward_prefetch_policy: BACKWARD_PRE
+ fsdp_cpu_ram_efficient_loading: true
+ fsdp_forward_prefetch: false
+ fsdp_offload_params: true
+ fsdp_sharding_strategy: 1
+ fsdp_state_dict_type: SHARDED_STATE_DICT
+ fsdp_sync_module_states: true
+ fsdp_transformer_layer_cls_to_wrap: BertLayer
+ fsdp_use_orig_params: true
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 2
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+要启动训练,请运行 [`accelerate launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch)
+命令,它将自动使用您之前使用 `accelerate config` 创建的配置文件。
+
+```bash
+accelerate launch my-trainer-script.py
+```
+
+```bash
+accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/ my-trainer-script.py
+```
+
+## 下一步
+
+FSDP 在大规模模型训练方面是一个强大的工具,您可以使用多个 GPU 或 TPU。
+通过分片模型参数、优化器和梯度状态,甚至在它们不活动时将其卸载到 CPU 上,
+FSDP 可以减少大规模训练的高成本。如果您希望了解更多信息,下面的内容可能会有所帮助:
+
+- 深入参考 Accelerate 指南,了解有关
+ [FSDP](https://huggingface.co/docs/accelerate/usage_guides/fsdp)的更多信息。
+- 阅读[介绍 PyTorch 完全分片数据并行(FSDP)API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) 博文。
+- 阅读[使用 FSDP 在云 TPU 上扩展 PyTorch 模型](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/)博文。
From 63caa370e6c618dbe7d3fd4cbf545cc32eca1a15 Mon Sep 17 00:00:00 2001
From: RaymondLi0
Date: Wed, 28 Feb 2024 01:24:34 +0100
Subject: [PATCH 034/549] Starcoder2 model - bis (#29215)
* Copy model
* changes
* misc
* fixes
* add embed and residual dropout (#30)
* misc
* remove rms norm and gated MLP
* remove copied mentions where its not a copy anymore
* remove unused _shape
* copied from mistral instead
* fix copies
* fix copies
* add not doctested
* fix
* fix copyright
* Update docs/source/en/model_doc/starcoder2.md
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update src/transformers/models/starcoder2/configuration_starcoder2.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update src/transformers/models/starcoder2/configuration_starcoder2.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix doc
* revert some changes
* add fa2 tests
* fix styling nit
* fix
* push dummy docs
---------
Co-authored-by: Joel Lamy-Poirier
Co-authored-by: younesbelkada
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
README.md | 1 +
README_es.md | 1 +
README_fr.md | 1 +
README_hd.md | 1 +
README_ja.md | 1 +
README_ko.md | 1 +
README_zh-hans.md | 1 +
README_zh-hant.md | 1 +
docs/source/en/_toctree.yml | 2 +
docs/source/en/index.md | 1 +
docs/source/en/model_doc/starcoder2.md | 43 +
docs/source/en/perf_infer_gpu_one.md | 2 +
docs/source/en/tasks/language_modeling.md | 3 +-
.../en/tasks/sequence_classification.md | 2 +-
src/transformers/__init__.py | 16 +
src/transformers/models/__init__.py | 1 +
.../models/auto/configuration_auto.py | 3 +
src/transformers/models/auto/modeling_auto.py | 3 +
.../models/auto/tokenization_auto.py | 1 +
.../models/starcoder2/__init__.py | 62 +
.../starcoder2/configuration_starcoder2.py | 147 ++
.../models/starcoder2/modeling_starcoder2.py | 1377 +++++++++++++++++
src/transformers/utils/dummy_pt_objects.py | 28 +
tests/models/starcoder2/__init__.py | 0
.../starcoder2/test_modeling_starcoder2.py | 549 +++++++
utils/not_doctested.txt | 1 +
26 files changed, 2247 insertions(+), 2 deletions(-)
create mode 100644 docs/source/en/model_doc/starcoder2.md
create mode 100644 src/transformers/models/starcoder2/__init__.py
create mode 100644 src/transformers/models/starcoder2/configuration_starcoder2.py
create mode 100644 src/transformers/models/starcoder2/modeling_starcoder2.py
create mode 100644 tests/models/starcoder2/__init__.py
create mode 100644 tests/models/starcoder2/test_modeling_starcoder2.py
diff --git a/README.md b/README.md
index 8d9dc398573c9c..54e228a1150266 100644
--- a/README.md
+++ b/README.md
@@ -493,6 +493,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
diff --git a/README_es.md b/README_es.md
index e8b85812f73eb4..b3c6845000d2b4 100644
--- a/README_es.md
+++ b/README_es.md
@@ -466,6 +466,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
diff --git a/README_fr.md b/README_fr.md
index 9ff23f6025b226..4b87eba5bbe1ba 100644
--- a/README_fr.md
+++ b/README_fr.md
@@ -487,6 +487,7 @@ Nombre actuel de points de contrôle : ![](https://img.shields.io/endpoint?url=h
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (de l'Université de Tel Aviv), publié dans l'article [Réponse à quelques questions avec peu d'exemples par la pré-sélection des spans](https://arxiv.org/abs/2101.00438) par Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (de Berkeley) a été publié dans l'article [SqueezeBERT : Que l'apprentissage automatique peut-il apprendre au traitement du langage naturel sur les réseaux neuronaux efficaces ?](https://arxiv.org/abs/2006.11316) par Forrest N. Iandola, Albert E. Shaw, Ravi Krishna et Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (de MBZUAI) a été publié dans l'article [SwiftFormer : Attention additive efficace pour les applications de vision mobile en temps réel basées sur des transformateurs](https://arxiv.org/abs/2303.15446) par Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (de Microsoft) a été publié dans l'article [Swin Transformer : Transformateur hiérarchique de la vision utilisant des fenêtres décalées](https://arxiv.org/abs/2103.14030) par Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (de Microsoft) a été publié dans l'article [Swin Transformer V2 : Augmentation de la capacité et de la résolution](https://arxiv.org/abs/2111.09883) par Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
diff --git a/README_hd.md b/README_hd.md
index 081d2d3e206484..e68d9d39ba6242 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -440,6 +440,7 @@ conda install conda-forge::transformers
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (तेल अवीव यूनिवर्सिटी से) साथ में पेपर [स्पैन सिलेक्शन को प्री-ट्रेनिंग करके कुछ-शॉट क्वेश्चन आंसरिंग](https://arxiv.org/abs/2101.00438) ओरि राम, युवल कर्स्टन, जोनाथन बेरेंट, अमीर ग्लोबर्सन, ओमर लेवी द्वारा।
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (बर्कले से) कागज के साथ [SqueezeBERT: कुशल तंत्रिका नेटवर्क के बारे में NLP को कंप्यूटर विज़न क्या सिखा सकता है?](https://arxiv.org/abs/2006.11316) फॉरेस्ट एन. इनडोला, अल्बर्ट ई. शॉ, रवि कृष्णा, और कर्ट डब्ल्यू. केटज़र द्वारा।
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI से) Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. द्वाराअनुसंधान पत्र [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) के साथ जारी किया गया
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (माइक्रोसॉफ्ट से) साथ में कागज [स्वाइन ट्रांसफॉर्मर: शिफ्टेड विंडोज का उपयोग कर पदानुक्रमित विजन ट्रांसफॉर्मर](https://arxiv.org/abs/2103.14030) ज़ी लियू, युटोंग लिन, यू काओ, हान हू, यिक्सुआन वेई, झेंग झांग, स्टीफन लिन, बैनिंग गुओ द्वारा।
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft से) साथ वाला पेपर [Swin Transformer V2: स्केलिंग अप कैपेसिटी एंड रेजोल्यूशन](https://arxiv.org/abs/2111.09883) ज़ी लियू, हान हू, युटोंग लिन, ज़ुलिआंग याओ, ज़ेंडा ज़ी, यिक्सुआन वेई, जिया निंग, यू काओ, झेंग झांग, ली डोंग, फुरु वेई, बैनिंग गुओ द्वारा।
diff --git a/README_ja.md b/README_ja.md
index 69e8a05fe5d4bb..d314b07140f504 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -500,6 +500,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (Tel Aviv University から), Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy から公開された研究論文: [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438)
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (Berkeley から) Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer から公開された研究論文: [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316)
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI から) Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. から公開された研究論文 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446)
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (Microsoft から) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo から公開された研究論文: [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft から) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo から公開された研究論文: [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883)
diff --git a/README_ko.md b/README_ko.md
index daa13f8635a907..f8679087ad1787 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -415,6 +415,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (Tel Aviv University 에서) Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy 의 [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) 논문과 함께 발표했습니다.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (Berkeley 에서) Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer 의 [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 논문과 함께 발표했습니다.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI 에서 제공)은 Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.의 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446)논문과 함께 발표했습니다.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (Microsoft 에서) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 의 [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 논문과 함께 발표했습니다.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft 에서) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 의 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index 8cd63a9c91c14c..1832870d52ff24 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -439,6 +439,7 @@ conda install conda-forge::transformers
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (来自 Tel Aviv University) 伴随论文 [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) 由 Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy 发布。
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (来自 Berkeley) 伴随论文 [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 由 Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer 发布。
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (来自 MBZUAI) 伴随论文 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) 由 Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan 发布。
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (来自 Microsoft) 伴随论文 [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 由 Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 发布。
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (来自 Microsoft) 伴随论文 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 由 Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index ce345a702656b1..2bf31890f359d7 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -451,6 +451,7 @@ conda install conda-forge::transformers
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University) released with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index d1748d7d43c576..ff6e91dbcf25d6 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -484,6 +484,8 @@
title: SqueezeBERT
- local: model_doc/stablelm
title: StableLm
+ - local: model_doc/starcoder2
+ title: Starcoder2
- local: model_doc/switch_transformers
title: SwitchTransformers
- local: model_doc/t5
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
index ae5e21d3b59a56..34995edec39c7d 100644
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -261,6 +261,7 @@ Flax), PyTorch, and/or TensorFlow.
| [Splinter](model_doc/splinter) | ✅ | ❌ | ❌ |
| [SqueezeBERT](model_doc/squeezebert) | ✅ | ❌ | ❌ |
| [StableLm](model_doc/stablelm) | ✅ | ❌ | ❌ |
+| [Starcoder2](model_doc/starcoder2) | ✅ | ❌ | ❌ |
| [SwiftFormer](model_doc/swiftformer) | ✅ | ❌ | ❌ |
| [Swin Transformer](model_doc/swin) | ✅ | ✅ | ❌ |
| [Swin Transformer V2](model_doc/swinv2) | ✅ | ❌ | ❌ |
diff --git a/docs/source/en/model_doc/starcoder2.md b/docs/source/en/model_doc/starcoder2.md
new file mode 100644
index 00000000000000..42dac4e06a36e7
--- /dev/null
+++ b/docs/source/en/model_doc/starcoder2.md
@@ -0,0 +1,43 @@
+
+
+# Starcoder2
+
+## Overview
+
+Starcoder2 has been released with the paper [Stacoder-2](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view) by BigCode team.
+
+Documentation page about the model is coming soon
+
+
+## Starcoder2Config
+
+[[autodoc]] Starcoder2Config
+
+## Starcoder2Model
+
+[[autodoc]] Starcoder2Model
+ - forward
+
+## Starcoder2ForCausalLM
+
+[[autodoc]] Starcoder2ForCausalLM
+ - forward
+
+## Starcoder2ForSequenceClassification
+
+[[autodoc]] Starcoder2ForSequenceClassification
+ - forward
diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md
index b03460a7a0d15c..06a94be8bb5c8e 100644
--- a/docs/source/en/perf_infer_gpu_one.md
+++ b/docs/source/en/perf_infer_gpu_one.md
@@ -54,6 +54,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [OPT](https://huggingface.co/docs/transformers/model_doc/opt#transformers.OPTModel)
* [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel)
* [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel)
+* [Starcoder2](https://huggingface.co/docs/transformers/model_doc/starcoder2#transformers.Starcoder2Model)
* [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model)
* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)
@@ -180,6 +181,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
* [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel)
+* [Starcoder2](https://huggingface.co/docs/transformers/model_doc/starcoder2#transformers.Starcoder2Model)
* [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model)
diff --git a/docs/source/en/tasks/language_modeling.md b/docs/source/en/tasks/language_modeling.md
index 4808552deb2cae..bcd10341b7443e 100644
--- a/docs/source/en/tasks/language_modeling.md
+++ b/docs/source/en/tasks/language_modeling.md
@@ -37,7 +37,8 @@ You can finetune other architectures for causal language modeling following the
Choose one of the following architectures:
-[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [Gemma](../model_doc/gemma), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [StableLm](../model_doc/stablelm), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [Whisper](../model_doc/whisper), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
+[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [Gemma](../model_doc/gemma), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [StableLm](../model_doc/stablelm), [Starcoder2](../model_doc/starcoder2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [Whisper](../model_doc/whisper), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
+
diff --git a/docs/source/en/tasks/sequence_classification.md b/docs/source/en/tasks/sequence_classification.md
index 3c1ab03c2b4ed2..544d24a0bad6d5 100644
--- a/docs/source/en/tasks/sequence_classification.md
+++ b/docs/source/en/tasks/sequence_classification.md
@@ -33,7 +33,7 @@ The task illustrated in this tutorial is supported by the following model archit
-[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [CodeLlama](../model_doc/code_llama), [ConvBERT](../model_doc/convbert), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [Falcon](../model_doc/falcon), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [Gemma](../model_doc/gemma), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [LLaMA](../model_doc/llama), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MPT](../model_doc/mpt), [MRA](../model_doc/mra), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Perceiver](../model_doc/perceiver), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [StableLm](../model_doc/stablelm), [T5](../model_doc/t5), [TAPAS](../model_doc/tapas), [Transformer-XL](../model_doc/transfo-xl), [UMT5](../model_doc/umt5), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso)
+[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [CodeLlama](../model_doc/code_llama), [ConvBERT](../model_doc/convbert), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [Falcon](../model_doc/falcon), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [Gemma](../model_doc/gemma), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [LLaMA](../model_doc/llama), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MPT](../model_doc/mpt), [MRA](../model_doc/mra), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Perceiver](../model_doc/perceiver), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [StableLm](../model_doc/stablelm), [Starcoder2](../model_doc/starcoder2), [T5](../model_doc/t5), [TAPAS](../model_doc/tapas), [Transformer-XL](../model_doc/transfo-xl), [UMT5](../model_doc/umt5), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso)
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index bc1be5842d0260..027cf495466c50 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -809,6 +809,7 @@
"SqueezeBertTokenizer",
],
"models.stablelm": ["STABLELM_PRETRAINED_CONFIG_ARCHIVE_MAP", "StableLmConfig"],
+ "models.starcoder2": ["STARCODER2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Starcoder2Config"],
"models.swiftformer": [
"SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
"SwiftFormerConfig",
@@ -3282,6 +3283,14 @@
"StableLmPreTrainedModel",
]
)
+ _import_structure["models.starcoder2"].extend(
+ [
+ "Starcoder2ForCausalLM",
+ "Starcoder2ForSequenceClassification",
+ "Starcoder2Model",
+ "Starcoder2PreTrainedModel",
+ ]
+ )
_import_structure["models.swiftformer"].extend(
[
"SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -5584,6 +5593,7 @@
SqueezeBertTokenizer,
)
from .models.stablelm import STABLELM_PRETRAINED_CONFIG_ARCHIVE_MAP, StableLmConfig
+ from .models.starcoder2 import STARCODER2_PRETRAINED_CONFIG_ARCHIVE_MAP, Starcoder2Config
from .models.swiftformer import (
SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
SwiftFormerConfig,
@@ -7717,6 +7727,12 @@
StableLmModel,
StableLmPreTrainedModel,
)
+ from .models.starcoder2 import (
+ Starcoder2ForCausalLM,
+ Starcoder2ForSequenceClassification,
+ Starcoder2Model,
+ Starcoder2PreTrainedModel,
+ )
from .models.swiftformer import (
SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
SwiftFormerForImageClassification,
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index df5496f09d01d7..ebb3db25fb96be 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -205,6 +205,7 @@
splinter,
squeezebert,
stablelm,
+ starcoder2,
swiftformer,
swin,
swin2sr,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index ab24b8a332662f..7bc637f3e1060a 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -214,6 +214,7 @@
("splinter", "SplinterConfig"),
("squeezebert", "SqueezeBertConfig"),
("stablelm", "StableLmConfig"),
+ ("starcoder2", "Starcoder2Config"),
("swiftformer", "SwiftFormerConfig"),
("swin", "SwinConfig"),
("swin2sr", "Swin2SRConfig"),
@@ -439,6 +440,7 @@
("splinter", "SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("squeezebert", "SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("stablelm", "STABLELM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+ ("starcoder2", "STARCODER2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("swiftformer", "SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("swin", "SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("swin2sr", "SWIN2SR_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -694,6 +696,7 @@
("splinter", "Splinter"),
("squeezebert", "SqueezeBERT"),
("stablelm", "StableLm"),
+ ("starcoder2", "Starcoder2"),
("swiftformer", "SwiftFormer"),
("swin", "Swin Transformer"),
("swin2sr", "Swin2SR"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 9a2aaaca01dbc5..05b519d2bcd16b 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -203,6 +203,7 @@
("splinter", "SplinterModel"),
("squeezebert", "SqueezeBertModel"),
("stablelm", "StableLmModel"),
+ ("starcoder2", "Starcoder2Model"),
("swiftformer", "SwiftFormerModel"),
("swin", "SwinModel"),
("swin2sr", "Swin2SRModel"),
@@ -465,6 +466,7 @@
("rwkv", "RwkvForCausalLM"),
("speech_to_text_2", "Speech2Text2ForCausalLM"),
("stablelm", "StableLmForCausalLM"),
+ ("starcoder2", "Starcoder2ForCausalLM"),
("transfo-xl", "TransfoXLLMHeadModel"),
("trocr", "TrOCRForCausalLM"),
("whisper", "WhisperForCausalLM"),
@@ -865,6 +867,7 @@
("roformer", "RoFormerForSequenceClassification"),
("squeezebert", "SqueezeBertForSequenceClassification"),
("stablelm", "StableLmForSequenceClassification"),
+ ("starcoder2", "Starcoder2ForSequenceClassification"),
("t5", "T5ForSequenceClassification"),
("tapas", "TapasForSequenceClassification"),
("transfo-xl", "TransfoXLForSequenceClassification"),
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index 373f4e141eb121..2c21f1cd529c74 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -399,6 +399,7 @@
("SqueezeBertTokenizer", "SqueezeBertTokenizerFast" if is_tokenizers_available() else None),
),
("stablelm", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
+ ("starcoder2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
(
"switch_transformers",
(
diff --git a/src/transformers/models/starcoder2/__init__.py b/src/transformers/models/starcoder2/__init__.py
new file mode 100644
index 00000000000000..a2b25f10090b36
--- /dev/null
+++ b/src/transformers/models/starcoder2/__init__.py
@@ -0,0 +1,62 @@
+# Copyright 2024 BigCode and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+ OptionalDependencyNotAvailable,
+ _LazyModule,
+ is_torch_available,
+)
+
+
+_import_structure = {
+ "configuration_starcoder2": ["STARCODER2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Starcoder2Config"],
+}
+
+
+try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["modeling_starcoder2"] = [
+ "Starcoder2ForCausalLM",
+ "Starcoder2Model",
+ "Starcoder2PreTrainedModel",
+ "Starcoder2ForSequenceClassification",
+ ]
+
+
+if TYPE_CHECKING:
+ from .configuration_starcoder2 import STARCODER2_PRETRAINED_CONFIG_ARCHIVE_MAP, Starcoder2Config
+
+ try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .modeling_starcoder2 import (
+ Starcoder2ForCausalLM,
+ Starcoder2ForSequenceClassification,
+ Starcoder2Model,
+ Starcoder2PreTrainedModel,
+ )
+
+
+else:
+ import sys
+
+ sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/starcoder2/configuration_starcoder2.py b/src/transformers/models/starcoder2/configuration_starcoder2.py
new file mode 100644
index 00000000000000..d569ebb4f7ce26
--- /dev/null
+++ b/src/transformers/models/starcoder2/configuration_starcoder2.py
@@ -0,0 +1,147 @@
+# coding=utf-8
+# Copyright 2024 BigCode and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Starcoder2 model configuration"""
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+STARCODER2_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
+
+
+class Starcoder2Config(PretrainedConfig):
+ r"""
+ This is the configuration class to store the configuration of a [`Starcoder2Model`]. It is used to instantiate a
+ Starcoder2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+ with the defaults will yield a similar configuration to that of the [bigcode/starcoder2-7b_16k](https://huggingface.co/bigcode/starcoder2-7b_16k) model.
+
+
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+ documentation from [`PretrainedConfig`] for more information.
+
+
+ Args:
+ vocab_size (`int`, *optional*, defaults to 49152):
+ Vocabulary size of the Starcoder2 model. Defines the number of different tokens that can be represented by the
+ `inputs_ids` passed when calling [`Starcoder2Model`]
+ hidden_size (`int`, *optional*, defaults to 3072):
+ Dimension of the hidden representations.
+ intermediate_size (`int`, *optional*, defaults to 12288):
+ Dimension of the MLP representations.
+ num_hidden_layers (`int`, *optional*, defaults to 30):
+ Number of hidden layers in the Transformer encoder.
+ num_attention_heads (`int`, *optional*, defaults to 24):
+ Number of attention heads for each attention layer in the Transformer encoder.
+ num_key_value_heads (`int`, *optional*, defaults to 2):
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+ by meanpooling all the original heads within that group. For more details checkout [this
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `8`.
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
+ The non-linear activation function (function or string) in the decoder.
+ max_position_embeddings (`int`, *optional*, defaults to 4096):
+ The maximum sequence length that this model might ever be used with. Starcoder2's sliding window attention
+ allows sequence of up to 4096*32 tokens.
+ initializer_range (`float`, *optional*, defaults to 0.02):
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+ norm_epsilon (`float`, *optional*, defaults to 1e-05):
+ Epsilon value for the layer norm
+ use_cache (`bool`, *optional*, defaults to `True`):
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
+ relevant if `config.is_decoder=True`.
+ bos_token_id (`int`, *optional*, defaults to 50256):
+ The id of the "beginning-of-sequence" token.
+ eos_token_id (`int`, *optional*, defaults to 50256):
+ The id of the "end-of-sequence" token.
+ rope_theta (`float`, *optional*, defaults to 10000.0):
+ The base period of the RoPE embeddings.
+ sliding_window (`int`, *optional*):
+ Sliding window attention window size. If not specified, will default to `None` (no sliding window).
+ attention_dropout (`float`, *optional*, defaults to 0.0):
+ The dropout ratio for the attention probabilities.
+ residual_dropout (`float`, *optional*, defaults to 0.0):
+ Residual connection dropout value.
+ embedding_dropout (`float`, *optional*, defaults to 0.0):
+ Embedding dropout.
+ use_bias (`bool`, *optional*, defaults to `True`):
+ Whether to use bias term on linear layers of the model.
+
+
+ ```python
+ >>> from transformers import Starcoder2Model, Starcoder2Config
+
+ >>> # Initializing a Starcoder2 7B style configuration
+ >>> configuration = Starcoder2Config()
+
+ >>> # Initializing a model from the Starcoder2 7B style configuration
+ >>> model = Starcoder2Model(configuration)
+
+ >>> # Accessing the model configuration
+ >>> configuration = model.config
+ ```"""
+
+ model_type = "starcoder2"
+ keys_to_ignore_at_inference = ["past_key_values"]
+
+ def __init__(
+ self,
+ vocab_size=49152,
+ hidden_size=3072,
+ intermediate_size=12288,
+ num_hidden_layers=30,
+ num_attention_heads=24,
+ num_key_value_heads=2,
+ hidden_act="gelu_pytorch_tanh",
+ max_position_embeddings=4096,
+ initializer_range=0.018042,
+ norm_epsilon=1e-5,
+ use_cache=True,
+ bos_token_id=50256,
+ eos_token_id=50256,
+ rope_theta=10000.0,
+ sliding_window=None,
+ attention_dropout=0.0,
+ residual_dropout=0.0,
+ embedding_dropout=0.0,
+ use_bias=True,
+ **kwargs,
+ ):
+ self.vocab_size = vocab_size
+ self.max_position_embeddings = max_position_embeddings
+ self.hidden_size = hidden_size
+ self.intermediate_size = intermediate_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.sliding_window = sliding_window
+ self.use_bias = use_bias
+ self.num_key_value_heads = num_key_value_heads
+ self.hidden_act = hidden_act
+ self.initializer_range = initializer_range
+ self.norm_epsilon = norm_epsilon
+ self.use_cache = use_cache
+ self.rope_theta = rope_theta
+ self.attention_dropout = attention_dropout
+ self.residual_dropout = residual_dropout
+ self.embedding_dropout = embedding_dropout
+
+ super().__init__(
+ bos_token_id=bos_token_id,
+ eos_token_id=eos_token_id,
+ **kwargs,
+ )
diff --git a/src/transformers/models/starcoder2/modeling_starcoder2.py b/src/transformers/models/starcoder2/modeling_starcoder2.py
new file mode 100644
index 00000000000000..ac0c8fac9c007c
--- /dev/null
+++ b/src/transformers/models/starcoder2/modeling_starcoder2.py
@@ -0,0 +1,1377 @@
+# coding=utf-8
+# Copyright 2024 BigCode and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Starcoder2 model."""
+import inspect
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ...activations import ACT2FN
+from ...cache_utils import Cache, DynamicCache
+from ...modeling_attn_mask_utils import _prepare_4d_causal_attention_mask, _prepare_4d_causal_attention_mask_for_sdpa
+from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+ add_start_docstrings,
+ add_start_docstrings_to_model_forward,
+ is_flash_attn_2_available,
+ is_flash_attn_greater_or_equal_2_10,
+ logging,
+ replace_return_docstrings,
+)
+from .configuration_starcoder2 import Starcoder2Config
+
+
+if is_flash_attn_2_available():
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
+
+ _flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
+
+
+logger = logging.get_logger(__name__)
+
+_CONFIG_FOR_DOC = "Starcoder2Config"
+
+
+# Copied from transformers.models.llama.modeling_llama._get_unpad_data
+def _get_unpad_data(attention_mask):
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
+ return (
+ indices,
+ cu_seqlens,
+ max_seqlen_in_batch,
+ )
+
+
+# Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->Starcoder2
+class Starcoder2RotaryEmbedding(nn.Module):
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+ super().__init__()
+
+ self.dim = dim
+ self.max_position_embeddings = max_position_embeddings
+ self.base = base
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+ # Build here to make `torch.jit.trace` work.
+ self._set_cos_sin_cache(
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+ )
+
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
+ self.max_seq_len_cached = seq_len
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+
+ freqs = torch.outer(t, self.inv_freq)
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
+ emb = torch.cat((freqs, freqs), dim=-1)
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+
+ def forward(self, x, seq_len=None):
+ # x: [bs, num_attention_heads, seq_len, head_size]
+ if seq_len > self.max_seq_len_cached:
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+
+ return (
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
+ )
+
+
+# Copied from transformers.models.llama.modeling_llama.rotate_half
+def rotate_half(x):
+ """Rotates half the hidden dims of the input."""
+ x1 = x[..., : x.shape[-1] // 2]
+ x2 = x[..., x.shape[-1] // 2 :]
+ return torch.cat((-x2, x1), dim=-1)
+
+
+# Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
+ """Applies Rotary Position Embedding to the query and key tensors.
+
+ Args:
+ q (`torch.Tensor`): The query tensor.
+ k (`torch.Tensor`): The key tensor.
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
+ position_ids (`torch.Tensor`):
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
+ used to pass offsetted position ids when working with a KV-cache.
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+ Returns:
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+ """
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
+ q_embed = (q * cos) + (rotate_half(q) * sin)
+ k_embed = (k * cos) + (rotate_half(k) * sin)
+ return q_embed, k_embed
+
+
+class Starcoder2MLP(nn.Module):
+ def __init__(self, config: Starcoder2Config):
+ super().__init__()
+ embed_dim = config.hidden_size
+ self.c_fc = nn.Linear(embed_dim, config.intermediate_size, bias=config.use_bias)
+ self.c_proj = nn.Linear(config.intermediate_size, embed_dim, bias=config.use_bias)
+ self.act = ACT2FN[config.hidden_act]
+ self.residual_dropout = config.residual_dropout
+
+ def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -> torch.FloatTensor:
+ hidden_states = self.c_fc(hidden_states)
+ hidden_states = self.act(hidden_states)
+ hidden_states = self.c_proj(hidden_states)
+ hidden_states = nn.functional.dropout(hidden_states, p=self.residual_dropout, training=self.training)
+ return hidden_states
+
+
+# Copied from transformers.models.llama.modeling_llama.repeat_kv
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+ """
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+ """
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+ if n_rep == 1:
+ return hidden_states
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+class Starcoder2Attention(nn.Module):
+ """
+ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
+ and "Generating Long Sequences with Sparse Transformers".
+ """
+
+ def __init__(self, config: Starcoder2Config, layer_idx: Optional[int] = None):
+ super().__init__()
+ self.config = config
+ self.layer_idx = layer_idx
+ if layer_idx is None:
+ logger.warning_once(
+ f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
+ "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
+ "when creating this class."
+ )
+
+ self.hidden_size = config.hidden_size
+ self.num_heads = config.num_attention_heads
+ self.head_dim = self.hidden_size // self.num_heads
+ self.num_key_value_heads = config.num_key_value_heads
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+ self.max_position_embeddings = config.max_position_embeddings
+ self.rope_theta = config.rope_theta
+ self.use_bias = config.use_bias
+ self.is_causal = True
+ self.attention_dropout = config.attention_dropout
+ self.residual_dropout = config.residual_dropout
+
+ if (self.head_dim * self.num_heads) != self.hidden_size:
+ raise ValueError(
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+ f" and `num_heads`: {self.num_heads})."
+ )
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=self.use_bias)
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=self.use_bias)
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=self.use_bias)
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=self.use_bias)
+
+ self.rotary_emb = Starcoder2RotaryEmbedding(
+ self.head_dim,
+ max_position_embeddings=self.max_position_embeddings,
+ base=self.rope_theta,
+ )
+
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_value: Optional[Cache] = None,
+ output_attentions: bool = False,
+ use_cache: bool = False,
+ **kwargs,
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+ if "padding_mask" in kwargs:
+ warnings.warn(
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+ )
+ bsz, q_len, _ = hidden_states.size()
+
+ query_states = self.q_proj(hidden_states)
+ key_states = self.k_proj(hidden_states)
+ value_states = self.v_proj(hidden_states)
+
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+
+ kv_seq_len = key_states.shape[-2]
+ if past_key_value is not None:
+ if self.layer_idx is None:
+ raise ValueError(
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
+ "with a layer index."
+ )
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+
+ if past_key_value is not None:
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+
+ # repeat k/v heads if n_kv_heads < n_heads
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
+
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
+ raise ValueError(
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
+ f" {attn_weights.size()}"
+ )
+
+ if attention_mask is not None:
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
+ raise ValueError(
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+ )
+
+ attn_weights = attn_weights + attention_mask
+
+ # upcast attention to fp32
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
+ attn_output = torch.matmul(attn_weights, value_states)
+
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
+ raise ValueError(
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
+ f" {attn_output.size()}"
+ )
+
+ attn_output = attn_output.transpose(1, 2).contiguous()
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+
+ attn_output = self.o_proj(attn_output)
+ attn_output = nn.functional.dropout(attn_output, p=self.residual_dropout, training=self.training)
+
+ if not output_attentions:
+ attn_weights = None
+
+ return attn_output, attn_weights, past_key_value
+
+
+# Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2 with Mistral->Starcoder2
+class Starcoder2FlashAttention2(Starcoder2Attention):
+ """
+ Starcoder2 flash attention module. This module inherits from `Starcoder2Attention` as the weights of the module stays
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
+ flash attention and deal with padding tokens in case the input contains any of them.
+ """
+
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+
+ # Ignore copy
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_value: Optional[Cache] = None,
+ output_attentions: bool = False,
+ use_cache: bool = False,
+ **kwargs,
+ ):
+ if "padding_mask" in kwargs:
+ warnings.warn(
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+ )
+
+ # overwrite attention_mask with padding_mask
+ attention_mask = kwargs.pop("padding_mask")
+ bsz, q_len, _ = hidden_states.size()
+
+ query_states = self.q_proj(hidden_states)
+ key_states = self.k_proj(hidden_states)
+ value_states = self.v_proj(hidden_states)
+
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+
+ kv_seq_len = key_states.shape[-2]
+ if past_key_value is not None:
+ if self.layer_idx is None:
+ raise ValueError(
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
+ "with a layer index."
+ )
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
+
+ # Because the input can be padded, the absolute sequence length depends on the max position id.
+ rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
+ cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
+
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+
+ use_sliding_windows = (
+ _flash_supports_window_size
+ and getattr(self.config, "sliding_window", None) is not None
+ and kv_seq_len > self.config.sliding_window
+ )
+
+ if not _flash_supports_window_size:
+ logger.warning_once(
+ "The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
+ " make sure to upgrade flash-attn library."
+ )
+
+ if past_key_value is not None:
+ # Activate slicing cache only if the config has a value `sliding_windows` attribute
+ cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
+ if (
+ getattr(self.config, "sliding_window", None) is not None
+ and kv_seq_len > self.config.sliding_window
+ and cache_has_contents
+ ):
+ slicing_tokens = 1 - self.config.sliding_window
+
+ past_key = past_key_value[self.layer_idx][0]
+ past_value = past_key_value[self.layer_idx][1]
+
+ past_key = past_key[:, :, slicing_tokens:, :].contiguous()
+ past_value = past_value[:, :, slicing_tokens:, :].contiguous()
+
+ if past_key.shape[-2] != self.config.sliding_window - 1:
+ raise ValueError(
+ f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
+ f" {past_key.shape}"
+ )
+
+ if attention_mask is not None:
+ attention_mask = attention_mask[:, slicing_tokens:]
+ attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
+
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+
+ # repeat k/v heads if n_kv_heads < n_heads
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
+ dropout_rate = 0.0 if not self.training else self.attention_dropout
+
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
+ # cast them back in float16 just to be sure everything works as expected.
+ input_dtype = query_states.dtype
+ if input_dtype == torch.float32:
+ if torch.is_autocast_enabled():
+ target_dtype = torch.get_autocast_gpu_dtype()
+ # Handle the case where the model is quantized
+ elif hasattr(self.config, "_pre_quantization_dtype"):
+ target_dtype = self.config._pre_quantization_dtype
+ else:
+ target_dtype = self.q_proj.weight.dtype
+
+ logger.warning_once(
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+ f" {target_dtype}."
+ )
+
+ query_states = query_states.to(target_dtype)
+ key_states = key_states.to(target_dtype)
+ value_states = value_states.to(target_dtype)
+
+ # Reashape to the expected shape for Flash Attention
+ query_states = query_states.transpose(1, 2)
+ key_states = key_states.transpose(1, 2)
+ value_states = value_states.transpose(1, 2)
+
+ attn_output = self._flash_attention_forward(
+ query_states,
+ key_states,
+ value_states,
+ attention_mask,
+ q_len,
+ dropout=dropout_rate,
+ use_sliding_windows=use_sliding_windows,
+ )
+
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
+ attn_output = self.o_proj(attn_output)
+ attn_output = nn.functional.dropout(attn_output, p=self.residual_dropout, training=self.training)
+
+ if not output_attentions:
+ attn_weights = None
+
+ return attn_output, attn_weights, past_key_value
+
+ def _flash_attention_forward(
+ self,
+ query_states,
+ key_states,
+ value_states,
+ attention_mask,
+ query_length,
+ dropout=0.0,
+ softmax_scale=None,
+ use_sliding_windows=False,
+ ):
+ """
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
+ first unpad the input, then computes the attention scores and pad the final attention scores.
+
+ Args:
+ query_states (`torch.Tensor`):
+ Input query states to be passed to Flash Attention API
+ key_states (`torch.Tensor`):
+ Input key states to be passed to Flash Attention API
+ value_states (`torch.Tensor`):
+ Input value states to be passed to Flash Attention API
+ attention_mask (`torch.Tensor`):
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
+ position of padding tokens and 1 for the position of non-padding tokens.
+ dropout (`int`, *optional*):
+ Attention dropout
+ softmax_scale (`float`, *optional*):
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
+ use_sliding_windows (`bool`, *optional*):
+ Whether to activate sliding window attention.
+ """
+ if not self._flash_attn_uses_top_left_mask:
+ causal = self.is_causal
+ else:
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
+ causal = self.is_causal and query_length != 1
+
+ # Contains at least one padding token in the sequence
+ if attention_mask is not None:
+ batch_size = query_states.shape[0]
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
+ query_states, key_states, value_states, attention_mask, query_length
+ )
+
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+
+ if not use_sliding_windows:
+ attn_output_unpad = flash_attn_varlen_func(
+ query_states,
+ key_states,
+ value_states,
+ cu_seqlens_q=cu_seqlens_q,
+ cu_seqlens_k=cu_seqlens_k,
+ max_seqlen_q=max_seqlen_in_batch_q,
+ max_seqlen_k=max_seqlen_in_batch_k,
+ dropout_p=dropout,
+ softmax_scale=softmax_scale,
+ causal=causal,
+ )
+ else:
+ attn_output_unpad = flash_attn_varlen_func(
+ query_states,
+ key_states,
+ value_states,
+ cu_seqlens_q=cu_seqlens_q,
+ cu_seqlens_k=cu_seqlens_k,
+ max_seqlen_q=max_seqlen_in_batch_q,
+ max_seqlen_k=max_seqlen_in_batch_k,
+ dropout_p=dropout,
+ softmax_scale=softmax_scale,
+ causal=causal,
+ window_size=(self.config.sliding_window, self.config.sliding_window),
+ )
+
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+ else:
+ if not use_sliding_windows:
+ attn_output = flash_attn_func(
+ query_states,
+ key_states,
+ value_states,
+ dropout,
+ softmax_scale=softmax_scale,
+ causal=causal,
+ )
+ else:
+ attn_output = flash_attn_func(
+ query_states,
+ key_states,
+ value_states,
+ dropout,
+ softmax_scale=softmax_scale,
+ causal=causal,
+ window_size=(self.config.sliding_window, self.config.sliding_window),
+ )
+
+ return attn_output
+
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
+ batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
+
+ # On the first iteration we need to properly re-create the padding mask
+ # by slicing it on the proper place
+ if kv_seq_len != attention_mask.shape[-1]:
+ attention_mask_num_tokens = attention_mask.shape[-1]
+ attention_mask = attention_mask[:, attention_mask_num_tokens - kv_seq_len :]
+
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
+
+ key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
+ value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
+
+ if query_length == kv_seq_len:
+ query_layer = index_first_axis(
+ query_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
+ )
+ cu_seqlens_q = cu_seqlens_k
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
+ indices_q = indices_k
+ elif query_length == 1:
+ max_seqlen_in_batch_q = 1
+ cu_seqlens_q = torch.arange(
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
+ ) # There is a memcpy here, that is very bad.
+ indices_q = cu_seqlens_q[:-1]
+ query_layer = query_layer.squeeze(1)
+ else:
+ # The -q_len: slice assumes left padding.
+ attention_mask = attention_mask[:, -query_length:]
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
+
+ return (
+ query_layer,
+ key_layer,
+ value_layer,
+ indices_q,
+ (cu_seqlens_q, cu_seqlens_k),
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
+ )
+
+
+# Copied from transformers.models.mistral.modeling_mistral.MistralSdpaAttention with Mistral->Starcoder2
+class Starcoder2SdpaAttention(Starcoder2Attention):
+ """
+ Starcoder2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+ `Starcoder2Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+ SDPA API.
+ """
+
+ # Ignore copy
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_value: Optional[Cache] = None,
+ output_attentions: bool = False,
+ use_cache: bool = False,
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+ if output_attentions:
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
+ logger.warning_once(
+ "Starcoder2Model is using Starcoder2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+ )
+ return super().forward(
+ hidden_states=hidden_states,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_value=past_key_value,
+ output_attentions=output_attentions,
+ use_cache=use_cache,
+ )
+
+ bsz, q_len, _ = hidden_states.size()
+
+ query_states = self.q_proj(hidden_states)
+ key_states = self.k_proj(hidden_states)
+ value_states = self.v_proj(hidden_states)
+
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+
+ kv_seq_len = key_states.shape[-2]
+ if past_key_value is not None:
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+
+ if past_key_value is not None:
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
+
+ if attention_mask is not None:
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
+ raise ValueError(
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+ )
+
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
+ if query_states.device.type == "cuda" and attention_mask is not None:
+ query_states = query_states.contiguous()
+ key_states = key_states.contiguous()
+ value_states = value_states.contiguous()
+
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
+ query_states,
+ key_states,
+ value_states,
+ attn_mask=attention_mask,
+ dropout_p=self.attention_dropout if self.training else 0.0,
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
+ )
+
+ attn_output = attn_output.transpose(1, 2).contiguous()
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+
+ attn_output = self.o_proj(attn_output)
+ # The difference with Mistral is that here it uses dropout
+ attn_output = nn.functional.dropout(attn_output, p=self.residual_dropout, training=self.training)
+
+ return attn_output, None, past_key_value
+
+
+STARCODER2_ATTENTION_CLASSES = {
+ "eager": Starcoder2Attention,
+ "flash_attention_2": Starcoder2FlashAttention2,
+ "sdpa": Starcoder2SdpaAttention,
+}
+
+
+class Starcoder2DecoderLayer(nn.Module):
+ def __init__(self, config: Starcoder2Config, layer_idx: int):
+ super().__init__()
+ self.hidden_size = config.hidden_size
+
+ self.self_attn = STARCODER2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
+
+ self.mlp = Starcoder2MLP(config)
+
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)
+
+ # Copied from transformers.models.mistral.modeling_mistral.MistralDecoderLayer.forward
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
+ output_attentions: Optional[bool] = False,
+ use_cache: Optional[bool] = False,
+ **kwargs,
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+ if "padding_mask" in kwargs:
+ warnings.warn(
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+ )
+ """
+ Args:
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+ `(batch, sequence_length)` where padding elements are indicated by 0.
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+ returned tensors for more detail.
+ use_cache (`bool`, *optional*):
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+ (see `past_key_values`).
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+ """
+
+ residual = hidden_states
+
+ hidden_states = self.input_layernorm(hidden_states)
+
+ # Self Attention
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
+ hidden_states=hidden_states,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_value=past_key_value,
+ output_attentions=output_attentions,
+ use_cache=use_cache,
+ )
+ hidden_states = residual + hidden_states
+
+ # Fully Connected
+ residual = hidden_states
+ hidden_states = self.post_attention_layernorm(hidden_states)
+ hidden_states = self.mlp(hidden_states)
+ hidden_states = residual + hidden_states
+
+ outputs = (hidden_states,)
+
+ if output_attentions:
+ outputs += (self_attn_weights,)
+
+ if use_cache:
+ outputs += (present_key_value,)
+
+ return outputs
+
+
+STARCODER2_START_DOCSTRING = r"""
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+ etc.)
+
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+ and behavior.
+
+ Parameters:
+ config ([`Starcoder2Config`]):
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
+ load the weights associated with the model, only the configuration. Check out the
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+@add_start_docstrings(
+ "The bare Starcoder2 Model outputting raw hidden-states without any specific head on top.",
+ STARCODER2_START_DOCSTRING,
+)
+# Copied from transformers.models.mistral.modeling_mistral.MistralPreTrainedModel with Mistral->Starcoder2
+class Starcoder2PreTrainedModel(PreTrainedModel):
+ config_class = Starcoder2Config
+ base_model_prefix = "model"
+ supports_gradient_checkpointing = True
+ _no_split_modules = ["Starcoder2DecoderLayer"]
+ _skip_keys_device_placement = "past_key_values"
+ _supports_flash_attn_2 = True
+ _supports_sdpa = True
+ _supports_cache_class = True
+
+ def _init_weights(self, module):
+ std = self.config.initializer_range
+ if isinstance(module, nn.Linear):
+ module.weight.data.normal_(mean=0.0, std=std)
+ if module.bias is not None:
+ module.bias.data.zero_()
+ elif isinstance(module, nn.Embedding):
+ module.weight.data.normal_(mean=0.0, std=std)
+ if module.padding_idx is not None:
+ module.weight.data[module.padding_idx].zero_()
+
+
+STARCODER2_INPUTS_DOCSTRING = r"""
+ Args:
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+ it.
+
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+ [`PreTrainedTokenizer.__call__`] for details.
+
+ [What are input IDs?](../glossary#input-ids)
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+ - 1 for tokens that are **not masked**,
+ - 0 for tokens that are **masked**.
+
+ [What are attention masks?](../glossary#attention-mask)
+
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+ [`PreTrainedTokenizer.__call__`] for details.
+
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+ `past_key_values`).
+
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+ information on the default strategy.
+
+ - 1 indicates the head is **not masked**,
+ - 0 indicates the head is **masked**.
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+ config.n_positions - 1]`.
+
+ [What are position IDs?](../glossary#position-ids)
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
+
+ Two formats are allowed:
+ - a [`~cache_utils.Cache`] instance;
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
+ cache format.
+
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
+ legacy cache format will be returned.
+
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
+ of shape `(batch_size, sequence_length)`.
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+ model's internal embedding lookup matrix.
+ use_cache (`bool`, *optional*):
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+ `past_key_values`).
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+ tensors for more detail.
+ output_hidden_states (`bool`, *optional*):
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+ more detail.
+ return_dict (`bool`, *optional*):
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+ "The bare Starcoder2 Model outputting raw hidden-states without any specific head on top.",
+ STARCODER2_START_DOCSTRING,
+)
+class Starcoder2Model(Starcoder2PreTrainedModel):
+ """
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Starcoder2DecoderLayer`]
+
+ Args:
+ config: Starcoder2Config
+ """
+
+ def __init__(self, config: Starcoder2Config):
+ super().__init__(config)
+ self.padding_idx = config.pad_token_id
+ self.vocab_size = config.vocab_size
+
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+ self.embedding_dropout = config.embedding_dropout
+ self.layers = nn.ModuleList(
+ [Starcoder2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+ )
+ self._attn_implementation = config._attn_implementation
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)
+ self.gradient_checkpointing = False
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.embed_tokens
+
+ def set_input_embeddings(self, value):
+ self.embed_tokens = value
+
+ @add_start_docstrings_to_model_forward(STARCODER2_INPUTS_DOCSTRING)
+ def forward(
+ self,
+ input_ids: torch.LongTensor = None,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
+ inputs_embeds: Optional[torch.FloatTensor] = None,
+ use_cache: Optional[bool] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ # retrieve input_ids and inputs_embeds
+ if input_ids is not None and inputs_embeds is not None:
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+ elif input_ids is not None:
+ batch_size, seq_length = input_ids.shape
+ elif inputs_embeds is not None:
+ batch_size, seq_length, _ = inputs_embeds.shape
+ else:
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+ if self.gradient_checkpointing and self.training:
+ if use_cache:
+ logger.warning_once(
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+ )
+ use_cache = False
+
+ past_key_values_length = 0
+
+ if use_cache:
+ use_legacy_cache = not isinstance(past_key_values, Cache)
+ if use_legacy_cache:
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
+
+ if position_ids is None:
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
+ position_ids = torch.arange(
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
+ )
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
+ else:
+ position_ids = position_ids.view(-1, seq_length).long()
+
+ if inputs_embeds is None:
+ inputs_embeds = self.embed_tokens(input_ids)
+
+ if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
+ is_padding_right = attention_mask[:, -1].sum().item() != batch_size
+ if is_padding_right:
+ raise ValueError(
+ "You are attempting to perform batched generation with padding_side='right'"
+ " this may lead to unexpected behaviour for Flash Attention version of Starcoder2. Make sure to "
+ " call `tokenizer.padding_side = 'left'` before tokenizing the input. "
+ )
+
+ if self._attn_implementation == "flash_attention_2":
+ # 2d mask is passed through the layers
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
+ elif self._attn_implementation == "sdpa" and not output_attentions:
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
+ # the manual implementation that requires a 4D causal mask in all cases.
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
+ attention_mask,
+ (batch_size, seq_length),
+ inputs_embeds,
+ past_key_values_length,
+ )
+ else:
+ # 4d mask is passed through the layers
+ attention_mask = _prepare_4d_causal_attention_mask(
+ attention_mask,
+ (batch_size, seq_length),
+ inputs_embeds,
+ past_key_values_length,
+ sliding_window=self.config.sliding_window,
+ )
+
+ hidden_states = inputs_embeds
+ hidden_states = nn.functional.dropout(hidden_states, p=self.embedding_dropout, training=self.training)
+
+ # decoder layers
+ all_hidden_states = () if output_hidden_states else None
+ all_self_attns = () if output_attentions else None
+ next_decoder_cache = None
+
+ for decoder_layer in self.layers:
+ if output_hidden_states:
+ all_hidden_states += (hidden_states,)
+
+ if self.gradient_checkpointing and self.training:
+ layer_outputs = self._gradient_checkpointing_func(
+ decoder_layer.__call__,
+ hidden_states,
+ attention_mask,
+ position_ids,
+ past_key_values,
+ output_attentions,
+ use_cache,
+ )
+ else:
+ layer_outputs = decoder_layer(
+ hidden_states,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_value=past_key_values,
+ output_attentions=output_attentions,
+ use_cache=use_cache,
+ )
+
+ hidden_states = layer_outputs[0]
+
+ if use_cache:
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
+
+ if output_attentions:
+ all_self_attns += (layer_outputs[1],)
+
+ hidden_states = self.norm(hidden_states)
+
+ # add hidden states from the last decoder layer
+ if output_hidden_states:
+ all_hidden_states += (hidden_states,)
+
+ next_cache = None
+ if use_cache:
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
+
+ if not return_dict:
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+ return BaseModelOutputWithPast(
+ last_hidden_state=hidden_states,
+ past_key_values=next_cache,
+ hidden_states=all_hidden_states,
+ attentions=all_self_attns,
+ )
+
+
+# Copied from transformers.models.mistral.modeling_mistral.MistralForCausalLM with MISTRAL->STARCODER2,Mistral-7B-v0.1->starcoder2-7b_16k,Mistral->Starcoder2,mistralai->bigcode
+class Starcoder2ForCausalLM(Starcoder2PreTrainedModel):
+ _tied_weights_keys = ["lm_head.weight"]
+
+ def __init__(self, config):
+ super().__init__(config)
+ self.model = Starcoder2Model(config)
+ self.vocab_size = config.vocab_size
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.model.embed_tokens
+
+ def set_input_embeddings(self, value):
+ self.model.embed_tokens = value
+
+ def get_output_embeddings(self):
+ return self.lm_head
+
+ def set_output_embeddings(self, new_embeddings):
+ self.lm_head = new_embeddings
+
+ def set_decoder(self, decoder):
+ self.model = decoder
+
+ def get_decoder(self):
+ return self.model
+
+ @add_start_docstrings_to_model_forward(STARCODER2_INPUTS_DOCSTRING)
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+ def forward(
+ self,
+ input_ids: torch.LongTensor = None,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
+ inputs_embeds: Optional[torch.FloatTensor] = None,
+ labels: Optional[torch.LongTensor] = None,
+ use_cache: Optional[bool] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
+ r"""
+ Args:
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+ Returns:
+
+ Example:
+
+ ```python
+ >>> from transformers import AutoTokenizer, Starcoder2ForCausalLM
+
+ >>> model = Starcoder2ForCausalLM.from_pretrained("bigcode/starcoder2-7b_16k")
+ >>> tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-7b_16k")
+
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+ >>> # Generate
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+ ```"""
+
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+ outputs = self.model(
+ input_ids=input_ids,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_values=past_key_values,
+ inputs_embeds=inputs_embeds,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ hidden_states = outputs[0]
+ logits = self.lm_head(hidden_states)
+ logits = logits.float()
+
+ loss = None
+ if labels is not None:
+ # Shift so that tokens < n predict n
+ shift_logits = logits[..., :-1, :].contiguous()
+ shift_labels = labels[..., 1:].contiguous()
+ # Flatten the tokens
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
+ shift_labels = shift_labels.view(-1)
+ # Ensure tensors are on the same device
+ shift_labels = shift_labels.to(shift_logits.device)
+ loss_fct = CrossEntropyLoss()
+ loss = loss_fct(shift_logits, shift_labels)
+
+ if not return_dict:
+ output = (logits,) + outputs[1:]
+ return (loss,) + output if loss is not None else output
+
+ return CausalLMOutputWithPast(
+ loss=loss,
+ logits=logits,
+ past_key_values=outputs.past_key_values,
+ hidden_states=outputs.hidden_states,
+ attentions=outputs.attentions,
+ )
+
+ def prepare_inputs_for_generation(
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+ ):
+ # Omit tokens covered by past_key_values
+ if past_key_values is not None:
+ if isinstance(past_key_values, Cache):
+ cache_length = past_key_values.get_seq_length()
+ past_length = past_key_values.seen_tokens
+ max_cache_length = past_key_values.get_max_length()
+ else:
+ cache_length = past_length = past_key_values[0][0].shape[2]
+ max_cache_length = None
+
+ # Keep only the unprocessed tokens:
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
+ # input)
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
+ # input_ids based on the past_length.
+ elif past_length < input_ids.shape[1]:
+ input_ids = input_ids[:, past_length:]
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
+
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
+ if (
+ max_cache_length is not None
+ and attention_mask is not None
+ and cache_length + input_ids.shape[1] > max_cache_length
+ ):
+ attention_mask = attention_mask[:, -max_cache_length:]
+
+ position_ids = kwargs.get("position_ids", None)
+ if attention_mask is not None and position_ids is None:
+ # create position_ids on the fly for batch generation
+ position_ids = attention_mask.long().cumsum(-1) - 1
+ position_ids.masked_fill_(attention_mask == 0, 1)
+ if past_key_values:
+ position_ids = position_ids[:, -input_ids.shape[1] :]
+
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+ if inputs_embeds is not None and past_key_values is None:
+ model_inputs = {"inputs_embeds": inputs_embeds}
+ else:
+ model_inputs = {"input_ids": input_ids}
+
+ model_inputs.update(
+ {
+ "position_ids": position_ids,
+ "past_key_values": past_key_values,
+ "use_cache": kwargs.get("use_cache"),
+ "attention_mask": attention_mask,
+ }
+ )
+ return model_inputs
+
+ @staticmethod
+ def _reorder_cache(past_key_values, beam_idx):
+ reordered_past = ()
+ for layer_past in past_key_values:
+ reordered_past += (
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+ )
+ return reordered_past
+
+
+@add_start_docstrings(
+ """
+ The Starcoder2 Model transformer with a sequence classification head on top (linear layer).
+
+ [`Starcoder2ForSequenceClassification`] uses the last token in order to do the classification, as other causal models
+ (e.g. GPT-2) do.
+
+ Since it does classification on the last token, it requires to know the position of the last token. If a
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
+ each row of the batch).
+ """,
+ STARCODER2_START_DOCSTRING,
+)
+# Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Starcoder2, LLAMA->STARCODER2
+class Starcoder2ForSequenceClassification(Starcoder2PreTrainedModel):
+ def __init__(self, config):
+ super().__init__(config)
+ self.num_labels = config.num_labels
+ self.model = Starcoder2Model(config)
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.model.embed_tokens
+
+ def set_input_embeddings(self, value):
+ self.model.embed_tokens = value
+
+ @add_start_docstrings_to_model_forward(STARCODER2_INPUTS_DOCSTRING)
+ def forward(
+ self,
+ input_ids: torch.LongTensor = None,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
+ inputs_embeds: Optional[torch.FloatTensor] = None,
+ labels: Optional[torch.LongTensor] = None,
+ use_cache: Optional[bool] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
+ r"""
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+ """
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ transformer_outputs = self.model(
+ input_ids,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_values=past_key_values,
+ inputs_embeds=inputs_embeds,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+ hidden_states = transformer_outputs[0]
+ logits = self.score(hidden_states)
+
+ if input_ids is not None:
+ batch_size = input_ids.shape[0]
+ else:
+ batch_size = inputs_embeds.shape[0]
+
+ if self.config.pad_token_id is None and batch_size != 1:
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+ if self.config.pad_token_id is None:
+ sequence_lengths = -1
+ else:
+ if input_ids is not None:
+ # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
+ sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
+ sequence_lengths = sequence_lengths % input_ids.shape[-1]
+ sequence_lengths = sequence_lengths.to(logits.device)
+ else:
+ sequence_lengths = -1
+
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
+
+ loss = None
+ if labels is not None:
+ labels = labels.to(logits.device)
+ if self.config.problem_type is None:
+ if self.num_labels == 1:
+ self.config.problem_type = "regression"
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+ self.config.problem_type = "single_label_classification"
+ else:
+ self.config.problem_type = "multi_label_classification"
+
+ if self.config.problem_type == "regression":
+ loss_fct = MSELoss()
+ if self.num_labels == 1:
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+ else:
+ loss = loss_fct(pooled_logits, labels)
+ elif self.config.problem_type == "single_label_classification":
+ loss_fct = CrossEntropyLoss()
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
+ elif self.config.problem_type == "multi_label_classification":
+ loss_fct = BCEWithLogitsLoss()
+ loss = loss_fct(pooled_logits, labels)
+ if not return_dict:
+ output = (pooled_logits,) + transformer_outputs[1:]
+ return ((loss,) + output) if loss is not None else output
+
+ return SequenceClassifierOutputWithPast(
+ loss=loss,
+ logits=pooled_logits,
+ past_key_values=transformer_outputs.past_key_values,
+ hidden_states=transformer_outputs.hidden_states,
+ attentions=transformer_outputs.attentions,
+ )
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index 3ba08016855cb3..5c635cf7af2c1c 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -7895,6 +7895,34 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
+class Starcoder2ForCausalLM(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class Starcoder2ForSequenceClassification(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class Starcoder2Model(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class Starcoder2PreTrainedModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
diff --git a/tests/models/starcoder2/__init__.py b/tests/models/starcoder2/__init__.py
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/tests/models/starcoder2/test_modeling_starcoder2.py b/tests/models/starcoder2/test_modeling_starcoder2.py
new file mode 100644
index 00000000000000..dfedb2ed788a47
--- /dev/null
+++ b/tests/models/starcoder2/test_modeling_starcoder2.py
@@ -0,0 +1,549 @@
+# coding=utf-8
+# Copyright 2024 BigCode and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the PyTorch Starcoder2 model. """
+
+
+import tempfile
+import unittest
+
+import pytest
+
+from transformers import Starcoder2Config, is_torch_available
+from transformers.testing_utils import (
+ require_bitsandbytes,
+ require_flash_attn,
+ require_torch,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ...generation.test_utils import GenerationTesterMixin
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, ids_tensor
+from ...test_pipeline_mixin import PipelineTesterMixin
+
+
+if is_torch_available():
+ import torch
+
+ from transformers import (
+ AutoTokenizer,
+ Starcoder2ForCausalLM,
+ Starcoder2ForSequenceClassification,
+ Starcoder2Model,
+ )
+
+
+# Copied from transformers.tests.models.mistral.test_modeling_mistral.Starcoder2ModelTester with Mistral->Starcoder2
+class Starcoder2ModelTester:
+ def __init__(
+ self,
+ parent,
+ batch_size=13,
+ seq_length=7,
+ is_training=True,
+ use_input_mask=True,
+ use_token_type_ids=False,
+ use_labels=True,
+ vocab_size=99,
+ hidden_size=32,
+ num_hidden_layers=2,
+ num_attention_heads=4,
+ num_key_value_heads=2,
+ intermediate_size=37,
+ hidden_act="gelu",
+ hidden_dropout_prob=0.1,
+ attention_probs_dropout_prob=0.1,
+ max_position_embeddings=512,
+ type_vocab_size=16,
+ type_sequence_label_size=2,
+ initializer_range=0.02,
+ num_labels=3,
+ num_choices=4,
+ pad_token_id=0,
+ scope=None,
+ ):
+ self.parent = parent
+ self.batch_size = batch_size
+ self.seq_length = seq_length
+ self.is_training = is_training
+ self.use_input_mask = use_input_mask
+ self.use_token_type_ids = use_token_type_ids
+ self.use_labels = use_labels
+ self.vocab_size = vocab_size
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.num_key_value_heads = num_key_value_heads
+ self.intermediate_size = intermediate_size
+ self.hidden_act = hidden_act
+ self.hidden_dropout_prob = hidden_dropout_prob
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
+ self.max_position_embeddings = max_position_embeddings
+ self.type_vocab_size = type_vocab_size
+ self.type_sequence_label_size = type_sequence_label_size
+ self.initializer_range = initializer_range
+ self.num_labels = num_labels
+ self.num_choices = num_choices
+ self.pad_token_id = pad_token_id
+ self.scope = scope
+
+ # Copied from tests.models.llama.test_modeling_llama.LlamaModelTester.prepare_config_and_inputs
+ def prepare_config_and_inputs(self):
+ input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+ input_mask = None
+ if self.use_input_mask:
+ input_mask = torch.tril(torch.ones(self.batch_size, self.seq_length)).to(torch_device)
+
+ token_type_ids = None
+ if self.use_token_type_ids:
+ token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+ sequence_labels = None
+ token_labels = None
+ choice_labels = None
+ if self.use_labels:
+ sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+ token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+ choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+ config = self.get_config()
+
+ return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+ # Ignore copy
+ def get_config(self):
+ return Starcoder2Config(
+ vocab_size=self.vocab_size,
+ hidden_size=self.hidden_size,
+ num_hidden_layers=self.num_hidden_layers,
+ num_attention_heads=self.num_attention_heads,
+ num_key_value_heads=self.num_key_value_heads,
+ intermediate_size=self.intermediate_size,
+ hidden_act=self.hidden_act,
+ hidden_dropout_prob=self.hidden_dropout_prob,
+ attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+ max_position_embeddings=self.max_position_embeddings,
+ type_vocab_size=self.type_vocab_size,
+ is_decoder=False,
+ initializer_range=self.initializer_range,
+ pad_token_id=self.pad_token_id,
+ eos_token_id=self.pad_token_id,
+ bos_token_id=self.pad_token_id,
+ )
+
+ # Copied from tests.models.llama.test_modeling_llama.LlamaModelTester.create_and_check_model with Llama->Starcoder2
+ def create_and_check_model(
+ self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+ ):
+ model = Starcoder2Model(config=config)
+ model.to(torch_device)
+ model.eval()
+ result = model(input_ids, attention_mask=input_mask)
+ result = model(input_ids)
+ self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
+
+ # Copied from tests.models.llama.test_modeling_llama.LlamaModelTester.create_and_check_model_as_decoder with Llama->Starcoder2
+ def create_and_check_model_as_decoder(
+ self,
+ config,
+ input_ids,
+ token_type_ids,
+ input_mask,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ encoder_hidden_states,
+ encoder_attention_mask,
+ ):
+ config.add_cross_attention = True
+ model = Starcoder2Model(config)
+ model.to(torch_device)
+ model.eval()
+ result = model(
+ input_ids,
+ attention_mask=input_mask,
+ encoder_hidden_states=encoder_hidden_states,
+ encoder_attention_mask=encoder_attention_mask,
+ )
+ result = model(
+ input_ids,
+ attention_mask=input_mask,
+ encoder_hidden_states=encoder_hidden_states,
+ )
+ result = model(input_ids, attention_mask=input_mask)
+ self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
+
+ # Copied from tests.models.llama.test_modeling_llama.LlamaModelTester.create_and_check_for_causal_lm with Llama->Starcoder2
+ def create_and_check_for_causal_lm(
+ self,
+ config,
+ input_ids,
+ token_type_ids,
+ input_mask,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ encoder_hidden_states,
+ encoder_attention_mask,
+ ):
+ model = Starcoder2ForCausalLM(config=config)
+ model.to(torch_device)
+ model.eval()
+ result = model(input_ids, attention_mask=input_mask, labels=token_labels)
+ self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
+
+ # Copied from tests.models.llama.test_modeling_llama.LlamaModelTester.create_and_check_decoder_model_past_large_inputs with Llama->Starcoder2
+ def create_and_check_decoder_model_past_large_inputs(
+ self,
+ config,
+ input_ids,
+ token_type_ids,
+ input_mask,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ encoder_hidden_states,
+ encoder_attention_mask,
+ ):
+ config.is_decoder = True
+ config.add_cross_attention = True
+ model = Starcoder2ForCausalLM(config=config)
+ model.to(torch_device)
+ model.eval()
+
+ # first forward pass
+ outputs = model(
+ input_ids,
+ attention_mask=input_mask,
+ encoder_hidden_states=encoder_hidden_states,
+ encoder_attention_mask=encoder_attention_mask,
+ use_cache=True,
+ )
+ past_key_values = outputs.past_key_values
+
+ # create hypothetical multiple next token and extent to next_input_ids
+ next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
+ next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+ # append to next input_ids and
+ next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
+ next_attention_mask = torch.cat([input_mask, next_mask], dim=-1)
+
+ output_from_no_past = model(
+ next_input_ids,
+ attention_mask=next_attention_mask,
+ encoder_hidden_states=encoder_hidden_states,
+ encoder_attention_mask=encoder_attention_mask,
+ output_hidden_states=True,
+ )["hidden_states"][0]
+ output_from_past = model(
+ next_tokens,
+ attention_mask=next_attention_mask,
+ encoder_hidden_states=encoder_hidden_states,
+ encoder_attention_mask=encoder_attention_mask,
+ past_key_values=past_key_values,
+ output_hidden_states=True,
+ )["hidden_states"][0]
+
+ # select random slice
+ random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+ output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+ output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+ self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+ # test that outputs are equal for slice
+ self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+ # Copied from tests.models.llama.test_modeling_llama.LlamaModelTester.prepare_config_and_inputs_for_common
+ def prepare_config_and_inputs_for_common(self):
+ config_and_inputs = self.prepare_config_and_inputs()
+ (
+ config,
+ input_ids,
+ token_type_ids,
+ input_mask,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ ) = config_and_inputs
+ inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+ return config, inputs_dict
+
+
+@require_torch
+# Copied from transformers.tests.models.mistral.test_modeling_mistral.MistralModelTest with Mistral->Starcoder2
+class Starcoder2ModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ all_model_classes = (
+ (Starcoder2Model, Starcoder2ForCausalLM, Starcoder2ForSequenceClassification) if is_torch_available() else ()
+ )
+ all_generative_model_classes = (Starcoder2ForCausalLM,) if is_torch_available() else ()
+ pipeline_model_mapping = (
+ {
+ "feature-extraction": Starcoder2Model,
+ "text-classification": Starcoder2ForSequenceClassification,
+ "text-generation": Starcoder2ForCausalLM,
+ "zero-shot": Starcoder2ForSequenceClassification,
+ }
+ if is_torch_available()
+ else {}
+ )
+ test_headmasking = False
+ test_pruning = False
+
+ # TODO (ydshieh): Check this. See https://app.circleci.com/pipelines/github/huggingface/transformers/79245/workflows/9490ef58-79c2-410d-8f51-e3495156cf9c/jobs/1012146
+ def is_pipeline_test_to_skip(
+ self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
+ ):
+ return True
+
+ def setUp(self):
+ self.model_tester = Starcoder2ModelTester(self)
+ self.config_tester = ConfigTester(self, config_class=Starcoder2Config, hidden_size=37)
+
+ def test_config(self):
+ self.config_tester.run_common_tests()
+
+ def test_model(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_model(*config_and_inputs)
+
+ def test_model_various_embeddings(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ for type in ["absolute", "relative_key", "relative_key_query"]:
+ config_and_inputs[0].position_embedding_type = type
+ self.model_tester.create_and_check_model(*config_and_inputs)
+
+ def test_Starcoder2_sequence_classification_model(self):
+ config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
+ print(config)
+ config.num_labels = 3
+ input_ids = input_dict["input_ids"]
+ attention_mask = input_ids.ne(1).to(torch_device)
+ sequence_labels = ids_tensor([self.model_tester.batch_size], self.model_tester.type_sequence_label_size)
+ model = Starcoder2ForSequenceClassification(config)
+ model.to(torch_device)
+ model.eval()
+ result = model(input_ids, attention_mask=attention_mask, labels=sequence_labels)
+ self.assertEqual(result.logits.shape, (self.model_tester.batch_size, self.model_tester.num_labels))
+
+ def test_Starcoder2_sequence_classification_model_for_single_label(self):
+ config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
+ config.num_labels = 3
+ config.problem_type = "single_label_classification"
+ input_ids = input_dict["input_ids"]
+ attention_mask = input_ids.ne(1).to(torch_device)
+ sequence_labels = ids_tensor([self.model_tester.batch_size], self.model_tester.type_sequence_label_size)
+ model = Starcoder2ForSequenceClassification(config)
+ model.to(torch_device)
+ model.eval()
+ result = model(input_ids, attention_mask=attention_mask, labels=sequence_labels)
+ self.assertEqual(result.logits.shape, (self.model_tester.batch_size, self.model_tester.num_labels))
+
+ def test_Starcoder2_sequence_classification_model_for_multi_label(self):
+ config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
+ config.num_labels = 3
+ config.problem_type = "multi_label_classification"
+ input_ids = input_dict["input_ids"]
+ attention_mask = input_ids.ne(1).to(torch_device)
+ sequence_labels = ids_tensor(
+ [self.model_tester.batch_size, config.num_labels], self.model_tester.type_sequence_label_size
+ ).to(torch.float)
+ model = Starcoder2ForSequenceClassification(config)
+ model.to(torch_device)
+ model.eval()
+ result = model(input_ids, attention_mask=attention_mask, labels=sequence_labels)
+ self.assertEqual(result.logits.shape, (self.model_tester.batch_size, self.model_tester.num_labels))
+
+ @unittest.skip("Starcoder2 buffers include complex numbers, which breaks this test")
+ def test_save_load_fast_init_from_base(self):
+ pass
+
+ @unittest.skip("Starcoder2 uses GQA on all models so the KV cache is a non standard format")
+ def test_past_key_values_format(self):
+ pass
+
+ @require_flash_attn
+ @require_torch_gpu
+ @pytest.mark.flash_attn_test
+ @slow
+ def test_flash_attn_2_generate_padding_right(self):
+ import torch
+
+ for model_class in self.all_generative_model_classes:
+ config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+ model = model_class(config)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ model.save_pretrained(tmpdirname)
+ model = model_class.from_pretrained(tmpdirname, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(
+ torch_device
+ )
+
+ dummy_input = torch.LongTensor([[0, 2, 3, 4], [0, 2, 3, 4]]).to(torch_device)
+ dummy_attention_mask = torch.LongTensor([[1, 1, 1, 1], [1, 1, 1, 0]]).to(torch_device)
+
+ model.generate(dummy_input, attention_mask=dummy_attention_mask, max_new_tokens=1, do_sample=False)
+
+ model = model_class.from_pretrained(
+ tmpdirname,
+ torch_dtype=torch.float16,
+ attn_implementation="flash_attention_2",
+ low_cpu_mem_usage=True,
+ ).to(torch_device)
+
+ with self.assertRaises(ValueError):
+ _ = model.generate(
+ dummy_input, attention_mask=dummy_attention_mask, max_new_tokens=1, do_sample=False
+ )
+
+ @require_flash_attn
+ @require_torch_gpu
+ @pytest.mark.flash_attn_test
+ @slow
+ def test_flash_attn_2_generate_use_cache(self):
+ import torch
+
+ max_new_tokens = 30
+
+ for model_class in self.all_generative_model_classes:
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+ dummy_input = inputs_dict[model_class.main_input_name]
+ if dummy_input.dtype in [torch.float32, torch.bfloat16]:
+ dummy_input = dummy_input.to(torch.float16)
+
+ # make sure that all models have enough positions for generation
+ if hasattr(config, "max_position_embeddings"):
+ config.max_position_embeddings = max_new_tokens + dummy_input.shape[1] + 1
+
+ model = model_class(config)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ model.save_pretrained(tmpdirname)
+
+ dummy_attention_mask = inputs_dict.get("attention_mask", torch.ones_like(dummy_input))
+ # NOTE: Starcoder2 apparently does not support right padding + use_cache with FA2.
+ dummy_attention_mask[:, -1] = 1
+
+ model = model_class.from_pretrained(
+ tmpdirname,
+ torch_dtype=torch.float16,
+ attn_implementation="flash_attention_2",
+ low_cpu_mem_usage=True,
+ ).to(torch_device)
+
+ # Just test that a large cache works as expected
+ _ = model.generate(
+ dummy_input,
+ attention_mask=dummy_attention_mask,
+ max_new_tokens=max_new_tokens,
+ do_sample=False,
+ use_cache=True,
+ )
+
+ @require_flash_attn
+ @require_torch_gpu
+ @pytest.mark.flash_attn_test
+ @slow
+ def test_flash_attn_2_inference_padding_right(self):
+ self.skipTest("Starcoder2 flash attention does not support right padding")
+
+
+@slow
+@require_torch_gpu
+class Starcoder2IntegrationTest(unittest.TestCase):
+ def test_starcoder2_batched_generation_sdpa(self):
+ EXPECTED_TEXT = [
+ "Hello my name is Younes and I am a student at the University of Liverpool. I am currently studying for my MSc in Computer Science. I am interested in the field of Machine Learning and I am currently working on",
+ "def hello_world():\n\treturn 'Hello World!'\n\n@app.route('/hello/')\ndef hello_name(name):\n\treturn 'Hello %s!' % name\n\n@app",
+ ]
+ model_id = "bigcode/starcoder2-7b_16k"
+
+ model = Starcoder2ForCausalLM.from_pretrained(
+ model_id, torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa"
+ )
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
+ tokenizer.pad_token = tokenizer.eos_token
+
+ text = ["Hello my name is Younes and", "def hello_world():"]
+ inputs = tokenizer(text, return_tensors="pt", padding=True).to(torch_device)
+
+ output = model.generate(**inputs, max_new_tokens=40, do_sample=False)
+ output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
+ self.assertEqual(EXPECTED_TEXT, output_text)
+
+ def test_starcoder2_batched_generation_eager(self):
+ EXPECTED_TEXT = [
+ "Hello my name is Younes and I am a student at the University of Liverpool. I am currently studying for my MSc in Computer Science. I am interested in the field of Machine Learning and I am currently working on",
+ "def hello_world():\n\treturn 'Hello World!'\n\n@app.route('/hello/')\ndef hello_name(name):\n\treturn 'Hello %s!' % name\n\n@app",
+ ]
+ model_id = "bigcode/starcoder2-7b_16k"
+
+ model = Starcoder2ForCausalLM.from_pretrained(
+ model_id, torch_dtype=torch.float16, device_map="auto", attn_implementation="eager"
+ )
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
+ tokenizer.pad_token = tokenizer.eos_token
+
+ text = ["Hello my name is Younes and", "def hello_world():"]
+ inputs = tokenizer(text, return_tensors="pt", padding=True).to(torch_device)
+
+ output = model.generate(**inputs, max_new_tokens=40, do_sample=False)
+ output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
+ self.assertEqual(EXPECTED_TEXT, output_text)
+
+ @require_flash_attn
+ def test_starcoder2_batched_generation_fa2(self):
+ EXPECTED_TEXT = [
+ "Hello my name is Younes and I am a student at the University of Liverpool. I am currently studying for my MSc in Computer Science. I am interested in the field of Machine Learning and I am currently working on",
+ "def hello_world():\n\treturn 'Hello World!'\n\n@app.route('/hello/')\ndef hello_name(name):\n\treturn 'Hello %s!' % name\n\n@app",
+ ]
+ model_id = "bigcode/starcoder2-7b_16k"
+
+ model = Starcoder2ForCausalLM.from_pretrained(
+ model_id, torch_dtype=torch.float16, device_map="auto", attn_implementation="flash_attention_2"
+ )
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
+ tokenizer.pad_token = tokenizer.eos_token
+
+ text = ["Hello my name is Younes and", "def hello_world():"]
+ inputs = tokenizer(text, return_tensors="pt", padding=True).to(torch_device)
+
+ output = model.generate(**inputs, max_new_tokens=40, do_sample=False)
+ output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
+ self.assertEqual(EXPECTED_TEXT, output_text)
+
+ @require_bitsandbytes
+ def test_starcoder2_batched_generation_4bit(self):
+ EXPECTED_TEXT = [
+ 'Hello my name is Younes and I am a student at the University of Maryland. I am currently working on a project that is related to the topic of "How to make a game". I am currently working on a project',
+ 'def hello_world():\n\treturn "Hello World"\n\n@app.route(\'/hello/\')\ndef hello_name(name):\n\treturn "Hello " + name\n\n@app.route',
+ ]
+ model_id = "bigcode/starcoder2-7b_16k"
+
+ model = Starcoder2ForCausalLM.from_pretrained(model_id, load_in_4bit=True)
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
+ tokenizer.pad_token = tokenizer.eos_token
+
+ text = ["Hello my name is Younes and", "def hello_world():"]
+ inputs = tokenizer(text, return_tensors="pt", padding=True).to(torch_device)
+
+ output = model.generate(**inputs, max_new_tokens=40, do_sample=False)
+ output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
+ self.assertEqual(EXPECTED_TEXT, output_text)
diff --git a/utils/not_doctested.txt b/utils/not_doctested.txt
index b2f644ccb7a347..daf47b1cb1caec 100644
--- a/utils/not_doctested.txt
+++ b/utils/not_doctested.txt
@@ -809,6 +809,7 @@ src/transformers/models/splinter/configuration_splinter.py
src/transformers/models/splinter/modeling_splinter.py
src/transformers/models/squeezebert/modeling_squeezebert.py
src/transformers/models/stablelm/modeling_stablelm.py
+src/transformers/models/starcoder2/modeling_starcoder2.py
src/transformers/models/swiftformer/configuration_swiftformer.py
src/transformers/models/swiftformer/convert_swiftformer_original_to_hf.py
src/transformers/models/swiftformer/modeling_swiftformer.py
From bd5b9863060c31f60d66b6aec88b9743d3dcd8f4 Mon Sep 17 00:00:00 2001
From: Jared Van Bortel
Date: Tue, 27 Feb 2024 21:10:36 -0500
Subject: [PATCH 035/549] simplify get_class_in_module and fix for paths
containing a dot (#29262)
---
src/transformers/dynamic_module_utils.py | 25 +++++-------------------
1 file changed, 5 insertions(+), 20 deletions(-)
diff --git a/src/transformers/dynamic_module_utils.py b/src/transformers/dynamic_module_utils.py
index 34486bb74632d6..72fd0367a0f401 100644
--- a/src/transformers/dynamic_module_utils.py
+++ b/src/transformers/dynamic_module_utils.py
@@ -185,35 +185,20 @@ def check_imports(filename: Union[str, os.PathLike]) -> List[str]:
return get_relative_imports(filename)
-def get_class_in_module(repo_id: str, class_name: str, module_path: Union[str, os.PathLike]) -> typing.Type:
+def get_class_in_module(class_name: str, module_path: Union[str, os.PathLike]) -> typing.Type:
"""
Import a module on the cache directory for modules and extract a class from it.
Args:
- repo_id (`str`): The repo containing the module. Used for path manipulation.
class_name (`str`): The name of the class to import.
module_path (`str` or `os.PathLike`): The path to the module to import.
-
Returns:
`typing.Type`: The class looked for.
"""
- module_path = module_path.replace(os.path.sep, ".")
- try:
- module = importlib.import_module(module_path)
- except ModuleNotFoundError as e:
- # This can happen when the repo id contains ".", which Python's import machinery interprets as a directory
- # separator. We do a bit of monkey patching to detect and fix this case.
- if not (
- "." in repo_id
- and module_path.startswith("transformers_modules")
- and repo_id.replace("/", ".") in module_path
- ):
- raise e # We can't figure this one out, just reraise the original error
- corrected_path = os.path.join(HF_MODULES_CACHE, module_path.replace(".", "/")) + ".py"
- corrected_path = corrected_path.replace(repo_id.replace(".", "/"), repo_id)
- module = importlib.machinery.SourceFileLoader(module_path, corrected_path).load_module()
-
+ name = os.path.normpath(module_path).replace(".py", "").replace(os.path.sep, ".")
+ module_path = str(Path(HF_MODULES_CACHE) / module_path)
+ module = importlib.machinery.SourceFileLoader(name, module_path).load_module()
return getattr(module, class_name)
@@ -513,7 +498,7 @@ def get_class_from_dynamic_module(
local_files_only=local_files_only,
repo_type=repo_type,
)
- return get_class_in_module(repo_id, class_name, final_module.replace(".py", ""))
+ return get_class_in_module(class_name, final_module)
def custom_object_save(obj: Any, folder: Union[str, os.PathLike], config: Optional[Dict] = None) -> List[str]:
From ad00c482c7fe9437c93bbc6be5a4a428c3219b5c Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Wed, 28 Feb 2024 06:25:23 +0100
Subject: [PATCH 036/549] FIX [`Gemma` / `CI`] Make sure our runners have
access to the model (#29242)
* pu hf token in gemma tests
* update suggestion
* add to flax
* revert
* fix
* fixup
* forward contrib credits from discussion
---------
Co-authored-by: ArthurZucker
---
src/transformers/testing_utils.py | 16 ++++++++++++++++
tests/models/gemma/test_modeling_flax_gemma.py | 5 ++---
tests/models/gemma/test_modeling_gemma.py | 3 ++-
3 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py
index ca4b0db8b8cc0b..e1415a4cc620ac 100644
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -31,12 +31,14 @@
import unittest
from collections import defaultdict
from collections.abc import Mapping
+from functools import wraps
from io import StringIO
from pathlib import Path
from typing import Callable, Dict, Iterable, Iterator, List, Optional, Union
from unittest import mock
from unittest.mock import patch
+import huggingface_hub
import urllib3
from transformers import logging as transformers_logging
@@ -460,6 +462,20 @@ def require_torch_sdpa(test_case):
return unittest.skipUnless(is_torch_sdpa_available(), "test requires PyTorch SDPA")(test_case)
+def require_read_token(fn):
+ """
+ A decorator that loads the HF token for tests that require to load gated models.
+ """
+ token = os.getenv("HF_HUB_READ_TOKEN", None)
+
+ @wraps(fn)
+ def _inner(*args, **kwargs):
+ with patch(huggingface_hub.utils._headers, "get_token", return_value=token):
+ return fn(*args, **kwargs)
+
+ return _inner
+
+
def require_peft(test_case):
"""
Decorator marking a test that requires PEFT.
diff --git a/tests/models/gemma/test_modeling_flax_gemma.py b/tests/models/gemma/test_modeling_flax_gemma.py
index 515ec1837dbbf4..0f3c5df4f13622 100644
--- a/tests/models/gemma/test_modeling_flax_gemma.py
+++ b/tests/models/gemma/test_modeling_flax_gemma.py
@@ -11,14 +11,12 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-
-
import unittest
import numpy as np
from transformers import AutoTokenizer, GemmaConfig, is_flax_available
-from transformers.testing_utils import require_flax, slow
+from transformers.testing_utils import require_flax, require_read_token, slow
from ...generation.test_flax_utils import FlaxGenerationTesterMixin
from ...test_modeling_flax_common import FlaxModelTesterMixin, ids_tensor
@@ -205,6 +203,7 @@ def test_model_from_pretrained(self):
@slow
@require_flax
+@require_read_token
class FlaxGemmaIntegrationTest(unittest.TestCase):
input_text = ["The capital of France is", "To play the perfect cover drive"]
model_id = "google/gemma-2b"
diff --git a/tests/models/gemma/test_modeling_gemma.py b/tests/models/gemma/test_modeling_gemma.py
index 670519d2a17f7b..6385e4cbf5a809 100644
--- a/tests/models/gemma/test_modeling_gemma.py
+++ b/tests/models/gemma/test_modeling_gemma.py
@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch Gemma model. """
-
import tempfile
import unittest
@@ -24,6 +23,7 @@
from transformers.testing_utils import (
require_bitsandbytes,
require_flash_attn,
+ require_read_token,
require_torch,
require_torch_gpu,
require_torch_sdpa,
@@ -529,6 +529,7 @@ def test_flash_attn_2_equivalence(self):
@require_torch_gpu
@slow
+@require_read_token
class GemmaIntegrationTest(unittest.TestCase):
input_text = ["Hello I am doing", "Hi today"]
From e715c78c66a1089f66f98a412205f54d6dd4cb53 Mon Sep 17 00:00:00 2001
From: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date: Wed, 28 Feb 2024 09:38:44 +0100
Subject: [PATCH 037/549] Remove numpy usage from owlvit (#29326)
* remove numpy usage from owlvit
* fix init owlv2
* style
---
src/transformers/models/owlv2/modeling_owlv2.py | 17 +++++++++--------
.../models/owlvit/modeling_owlvit.py | 17 +++++++++--------
2 files changed, 18 insertions(+), 16 deletions(-)
diff --git a/src/transformers/models/owlv2/modeling_owlv2.py b/src/transformers/models/owlv2/modeling_owlv2.py
index 5146fbb89dcee6..e538d2b4d4081f 100644
--- a/src/transformers/models/owlv2/modeling_owlv2.py
+++ b/src/transformers/models/owlv2/modeling_owlv2.py
@@ -1311,6 +1311,8 @@ def __init__(self, config: Owlv2Config):
self.layer_norm = nn.LayerNorm(config.vision_config.hidden_size, eps=config.vision_config.layer_norm_eps)
self.sigmoid = nn.Sigmoid()
+ self.sqrt_num_patches = config.vision_config.image_size // config.vision_config.patch_size
+
# Copied from transformers.models.owlvit.modeling_owlvit.OwlViTForObjectDetection.normalize_grid_corner_coordinates
def normalize_grid_corner_coordinates(self, feature_map: torch.FloatTensor):
# Computes normalized xy corner coordinates from feature_map.
@@ -1320,6 +1322,7 @@ def normalize_grid_corner_coordinates(self, feature_map: torch.FloatTensor):
device = feature_map.device
num_patches = feature_map.shape[1]
+ # TODO: Remove numpy usage.
box_coordinates = np.stack(
np.meshgrid(np.arange(1, num_patches + 1), np.arange(1, num_patches + 1)), axis=-1
).astype(np.float32)
@@ -1432,8 +1435,7 @@ def image_text_embedder(
image_embeds = self.owlv2.vision_model.post_layernorm(last_hidden_state)
# Resize class token
- new_size = tuple(np.array(image_embeds.shape) - np.array((0, 1, 0)))
- class_token_out = torch.broadcast_to(image_embeds[:, :1, :], new_size)
+ class_token_out = torch.broadcast_to(image_embeds[:, :1, :], image_embeds[:, :-1].shape)
# Merge image embedding with class tokens
image_embeds = image_embeds[:, 1:, :] * class_token_out
@@ -1442,8 +1444,8 @@ def image_text_embedder(
# Resize to [batch_size, num_patches, num_patches, hidden_size]
new_size = (
image_embeds.shape[0],
- int(np.sqrt(image_embeds.shape[1])),
- int(np.sqrt(image_embeds.shape[1])),
+ self.sqrt_num_patches,
+ self.sqrt_num_patches,
image_embeds.shape[-1],
)
image_embeds = image_embeds.reshape(new_size)
@@ -1466,8 +1468,7 @@ def image_embedder(
image_embeds = self.owlv2.vision_model.post_layernorm(last_hidden_state)
# Resize class token
- new_size = tuple(np.array(image_embeds.shape) - np.array((0, 1, 0)))
- class_token_out = torch.broadcast_to(image_embeds[:, :1, :], new_size)
+ class_token_out = torch.broadcast_to(image_embeds[:, :1, :], image_embeds[:, :-1].shape)
# Merge image embedding with class tokens
image_embeds = image_embeds[:, 1:, :] * class_token_out
@@ -1476,8 +1477,8 @@ def image_embedder(
# Resize to [batch_size, num_patches, num_patches, hidden_size]
new_size = (
image_embeds.shape[0],
- int(np.sqrt(image_embeds.shape[1])),
- int(np.sqrt(image_embeds.shape[1])),
+ self.sqrt_num_patches,
+ self.sqrt_num_patches,
image_embeds.shape[-1],
)
image_embeds = image_embeds.reshape(new_size)
diff --git a/src/transformers/models/owlvit/modeling_owlvit.py b/src/transformers/models/owlvit/modeling_owlvit.py
index b8e8a36fec777c..a06610a643bb36 100644
--- a/src/transformers/models/owlvit/modeling_owlvit.py
+++ b/src/transformers/models/owlvit/modeling_owlvit.py
@@ -1292,6 +1292,8 @@ def __init__(self, config: OwlViTConfig):
self.layer_norm = nn.LayerNorm(config.vision_config.hidden_size, eps=config.vision_config.layer_norm_eps)
self.sigmoid = nn.Sigmoid()
+ self.sqrt_num_patches = config.vision_config.image_size // config.vision_config.patch_size
+
def normalize_grid_corner_coordinates(self, feature_map: torch.FloatTensor):
# Computes normalized xy corner coordinates from feature_map.
if not feature_map.ndim == 4:
@@ -1300,6 +1302,7 @@ def normalize_grid_corner_coordinates(self, feature_map: torch.FloatTensor):
device = feature_map.device
num_patches = feature_map.shape[1]
+ # TODO: Remove numpy usage.
box_coordinates = np.stack(
np.meshgrid(np.arange(1, num_patches + 1), np.arange(1, num_patches + 1)), axis=-1
).astype(np.float32)
@@ -1394,8 +1397,7 @@ def image_text_embedder(
image_embeds = self.owlvit.vision_model.post_layernorm(last_hidden_state)
# Resize class token
- new_size = tuple(np.array(image_embeds.shape) - np.array((0, 1, 0)))
- class_token_out = torch.broadcast_to(image_embeds[:, :1, :], new_size)
+ class_token_out = torch.broadcast_to(image_embeds[:, :1, :], image_embeds[:, :-1].shape)
# Merge image embedding with class tokens
image_embeds = image_embeds[:, 1:, :] * class_token_out
@@ -1404,8 +1406,8 @@ def image_text_embedder(
# Resize to [batch_size, num_patches, num_patches, hidden_size]
new_size = (
image_embeds.shape[0],
- int(np.sqrt(image_embeds.shape[1])),
- int(np.sqrt(image_embeds.shape[1])),
+ self.sqrt_num_patches,
+ self.sqrt_num_patches,
image_embeds.shape[-1],
)
image_embeds = image_embeds.reshape(new_size)
@@ -1427,8 +1429,7 @@ def image_embedder(
image_embeds = self.owlvit.vision_model.post_layernorm(last_hidden_state)
# Resize class token
- new_size = tuple(np.array(image_embeds.shape) - np.array((0, 1, 0)))
- class_token_out = torch.broadcast_to(image_embeds[:, :1, :], new_size)
+ class_token_out = torch.broadcast_to(image_embeds[:, :1, :], image_embeds[:, :-1].shape)
# Merge image embedding with class tokens
image_embeds = image_embeds[:, 1:, :] * class_token_out
@@ -1437,8 +1438,8 @@ def image_embedder(
# Resize to [batch_size, num_patches, num_patches, hidden_size]
new_size = (
image_embeds.shape[0],
- int(np.sqrt(image_embeds.shape[1])),
- int(np.sqrt(image_embeds.shape[1])),
+ self.sqrt_num_patches,
+ self.sqrt_num_patches,
image_embeds.shape[-1],
)
image_embeds = image_embeds.reshape(new_size)
From a52888524d488ddd8fb022cdf3b9ce5ca03ee08e Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Wed, 28 Feb 2024 10:13:57 +0100
Subject: [PATCH 038/549] [`require_read_token`] fix typo (#29345)
fix wrapper
---
src/transformers/testing_utils.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py
index e1415a4cc620ac..0efaf77a98a73c 100644
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -470,7 +470,7 @@ def require_read_token(fn):
@wraps(fn)
def _inner(*args, **kwargs):
- with patch(huggingface_hub.utils._headers, "get_token", return_value=token):
+ with patch.object(huggingface_hub.utils._headers, "get_token", return_value=token):
return fn(*args, **kwargs)
return _inner
From 7c87f3577eb799e01a94b5ff0e1aee935d77cc95 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Wed, 28 Feb 2024 10:41:58 +0100
Subject: [PATCH 039/549] [`T5 and Llama Tokenizer`] remove warning (#29346)
* remove warning
* add co-author
* update
---------
Co-authored-by: hiaoxui
---
src/transformers/models/llama/tokenization_llama.py | 4 ++--
.../models/seamless_m4t/tokenization_seamless_m4t.py | 4 ++--
src/transformers/models/t5/tokenization_t5.py | 4 ++--
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/src/transformers/models/llama/tokenization_llama.py b/src/transformers/models/llama/tokenization_llama.py
index 14c6a3dcd536e4..2f8997274ce758 100644
--- a/src/transformers/models/llama/tokenization_llama.py
+++ b/src/transformers/models/llama/tokenization_llama.py
@@ -243,7 +243,7 @@ def get_vocab(self):
return vocab
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize
- def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> List[str]:
+ def tokenize(self, text: "TextInput", **kwargs) -> List[str]:
"""
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
first token is special.
@@ -255,7 +255,7 @@ def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> Lis
if self.add_prefix_space:
text = SPIECE_UNDERLINE + text
- tokens = super().tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
+ tokens = super().tokenize(text, **kwargs)
if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
tokens = tokens[1:]
diff --git a/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py b/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py
index afefd6feba117d..99dd1f0955063c 100644
--- a/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py
+++ b/src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py
@@ -447,7 +447,7 @@ def get_spm_processor(self, from_slow=False):
return tokenizer
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize
- def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> List[str]:
+ def tokenize(self, text: "TextInput", **kwargs) -> List[str]:
"""
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
first token is special.
@@ -459,7 +459,7 @@ def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> Lis
if self.add_prefix_space:
text = SPIECE_UNDERLINE + text
- tokens = super().tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
+ tokens = super().tokenize(text, **kwargs)
if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
tokens = tokens[1:]
diff --git a/src/transformers/models/t5/tokenization_t5.py b/src/transformers/models/t5/tokenization_t5.py
index 8d32029857a631..fba83ae9203eeb 100644
--- a/src/transformers/models/t5/tokenization_t5.py
+++ b/src/transformers/models/t5/tokenization_t5.py
@@ -377,7 +377,7 @@ def __setstate__(self, d):
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(self.vocab_file)
- def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> List[str]:
+ def tokenize(self, text: "TextInput", **kwargs) -> List[str]:
"""
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
first token is special.
@@ -389,7 +389,7 @@ def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> Lis
if self.add_prefix_space:
text = SPIECE_UNDERLINE + text
- tokens = super().tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
+ tokens = super().tokenize(text, **kwargs)
if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
tokens = tokens[1:]
From 8a8a0a4ae09572681d6429588d93da4982656d06 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Wed, 28 Feb 2024 10:45:53 +0100
Subject: [PATCH 040/549] [`Llama ROPE`] Fix torch export but also slow downs
in forward (#29198)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* remove control flow
* update gptneox
* update ....
* nits
* Actually let's just break. Otherwise we are silently failing which imo is not optimal
* version BC
* fix tests
* fix eager causal
* nit
* add a test
* style
* nits
* nits
* more nits for the test
* update and fix
* make sure cuda graphs are not skipped
* read token is needed for meta llama
* update!
* fiixup
* compile test should be slow
* fix thet fix copies
* stle 🫠
---
.../models/gpt_neox/modeling_gpt_neox.py | 6 ++-
.../models/llama/modeling_llama.py | 38 +++++++------
tests/models/llama/test_modeling_llama.py | 54 ++++++++++++++++++-
3 files changed, 75 insertions(+), 23 deletions(-)
diff --git a/src/transformers/models/gpt_neox/modeling_gpt_neox.py b/src/transformers/models/gpt_neox/modeling_gpt_neox.py
index 8dd1cde35c7b89..882b4fc9ecc322 100755
--- a/src/transformers/models/gpt_neox/modeling_gpt_neox.py
+++ b/src/transformers/models/gpt_neox/modeling_gpt_neox.py
@@ -563,10 +563,11 @@ def forward(self, x, seq_len=None):
)
+# copied from transformers.models.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding.__init__
+# TODO @gante bring compatibility back
class GPTNeoXLinearScalingRotaryEmbedding(GPTNeoXRotaryEmbedding):
"""GPTNeoXRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
- # Copied from transformers.models.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding.__init__
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
self.scaling_factor = scaling_factor
super().__init__(dim, max_position_embeddings, base, device)
@@ -586,7 +587,8 @@ def _set_cos_sin_cache(self, seq_len, device, dtype):
class GPTNeoXDynamicNTKScalingRotaryEmbedding(GPTNeoXRotaryEmbedding):
"""GPTNeoXRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
- # Copied from transformers.models.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding.__init__
+ # copied from transformers.models.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding.__init__
+ # TODO @gante no longer copied from
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
self.scaling_factor = scaling_factor
super().__init__(dim, max_position_embeddings, base, device)
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 399cfec4ffc7de..1f9ee6bb1a566c 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -92,54 +92,55 @@ def forward(self, hidden_states):
class LlamaRotaryEmbedding(nn.Module):
- def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
super().__init__()
+ self.scaling_factor = scaling_factor
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
+ # For BC we register cos and sin cached
+ self.max_seq_len_cached = max_position_embeddings
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+ t = t / self.scaling_factor
+ freqs = torch.outer(t, self.inv_freq)
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
+ emb = torch.cat((freqs, freqs), dim=-1)
+ self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False)
+ self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False)
@property
def sin_cached(self):
logger.warning_once(
- "The sin_cached attribute will be removed in 4.40. Bear in mind that its contents changed in v4.38. Use "
- "the forward method of RoPE from now on instead."
+ "The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
+ "the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class"
)
return self._sin_cached
@property
def cos_cached(self):
logger.warning_once(
- "The cos_cached attribute will be removed in 4.40. Bear in mind that its contents changed in v4.38. Use "
- "the forward method of RoPE from now on instead."
+ "The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
+ "the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class"
)
return self._cos_cached
def forward(self, x, position_ids, seq_len=None):
if seq_len is not None:
- logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.40.")
+ logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.39.")
# x: [bs, num_attention_heads, seq_len, head_size]
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
- cos = emb.cos().to(dtype=x.dtype)
- sin = emb.sin().to(dtype=x.dtype)
- # backwards compatibility
- self._cos_cached = cos
- self._sin_cached = sin
- return cos, sin
+ return emb.cos().to(dtype=x.dtype), emb.sin().to(dtype=x.dtype)
class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
"""LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
- def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
- self.scaling_factor = scaling_factor
- super().__init__(dim, max_position_embeddings, base, device)
-
def forward(self, x, position_ids, seq_len=None):
# difference to the original RoPE: a scaling factor is aplied to the position ids
position_ids = position_ids.float() / self.scaling_factor
@@ -150,10 +151,6 @@ def forward(self, x, position_ids, seq_len=None):
class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
"""LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
- def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
- self.scaling_factor = scaling_factor
- super().__init__(dim, max_position_embeddings, base, device)
-
def forward(self, x, position_ids, seq_len=None):
# difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
seq_len = torch.max(position_ids) + 1
@@ -367,6 +364,7 @@ def forward(
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
if attention_mask is not None: # no matter the length, we just slice it
+ causal_mask = attention_mask
if cache_position is not None:
causal_mask = attention_mask[:, :, cache_position, : key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
diff --git a/tests/models/llama/test_modeling_llama.py b/tests/models/llama/test_modeling_llama.py
index a393950232f306..308e5d91195215 100644
--- a/tests/models/llama/test_modeling_llama.py
+++ b/tests/models/llama/test_modeling_llama.py
@@ -20,10 +20,12 @@
import pytest
from parameterized import parameterized
-from transformers import LlamaConfig, is_torch_available, set_seed
+from transformers import LlamaConfig, StaticCache, is_torch_available, logging, set_seed
from transformers.testing_utils import (
+ CaptureLogger,
require_bitsandbytes,
require_flash_attn,
+ require_read_token,
require_torch,
require_torch_accelerator,
require_torch_gpu,
@@ -595,6 +597,56 @@ def test_model_13b_greedy_generation(self):
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
self.assertEqual(EXPECTED_TEXT_COMPLETION, text)
+ @slow
+ @require_torch_gpu
+ @require_read_token
+ def test_compile_static_cache(self):
+ NUM_TOKENS_TO_GENERATE = 40
+ EXPECTED_TEXT_COMPLETION = [
+ "Simply put, the theory of relativity states that 1) the speed of light is constant, 2) the speed of light is the same for all observers, and 3) the laws of physics are the same for all observers.",
+ "My favorite all time favorite condiment is ketchup. I love it on everything. I love it on my eggs, my fries, my chicken, my burgers, my hot dogs, my sandwiches, my salads, my p",
+ ]
+ prompts = [
+ "Simply put, the theory of relativity states that ",
+ "My favorite all time favorite condiment is ketchup.",
+ ]
+ tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token="", padding_side="right")
+ model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential")
+ inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
+
+ def decode_one_tokens(model, cur_token, input_pos, cache_position):
+ logits = model(
+ cur_token, position_ids=input_pos, cache_position=cache_position, return_dict=False, use_cache=True
+ )[0]
+ new_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
+ return new_token
+
+ batch_size, seq_length = inputs["input_ids"].shape
+ with torch.no_grad():
+ model._setup_cache(StaticCache, 2, max_cache_len=4096)
+ cache_position = torch.arange(seq_length, device=torch_device)
+ generated_ids = torch.zeros(
+ batch_size, seq_length + NUM_TOKENS_TO_GENERATE + 1, dtype=torch.int, device=torch_device
+ )
+ generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int)
+
+ logits = model(**inputs, cache_position=cache_position, return_dict=False, use_cache=True)[0]
+ next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
+ generated_ids[:, seq_length] = next_token[:, 0]
+
+ decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True)
+ cache_position = torch.tensor([seq_length + 1], device=torch_device)
+ for _ in range(1, NUM_TOKENS_TO_GENERATE):
+ with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True):
+ with CaptureLogger(logging.get_logger(__name__)) as cl:
+ next_token = decode_one_tokens(model, next_token.clone(), None, cache_position)
+ self.assertNotIn("skipping cudagraphs due to", cl.out)
+ generated_ids[:, cache_position] = next_token.int()
+ cache_position += 1
+
+ text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+ self.assertEqual(EXPECTED_TEXT_COMPLETION, text)
+
@require_torch
class CodeLlamaIntegrationTest(unittest.TestCase):
From 2ce56d35f6054cd844980ed4265ca3289bb56e0d Mon Sep 17 00:00:00 2001
From: Leonardo Emili
Date: Wed, 28 Feb 2024 11:16:15 +0100
Subject: [PATCH 041/549] Disable Mixtral `output_router_logits` during
inference (#29249)
* Set output_router_logits=False in prepare_inputs_for_generation for mixtral
* Add output_router_logits=False to prepare_inputs_for_generation for mixtral
* Fix style
---
src/transformers/models/mixtral/modeling_mixtral.py | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/transformers/models/mixtral/modeling_mixtral.py b/src/transformers/models/mixtral/modeling_mixtral.py
index 674ace5f236039..01ea7282d780b7 100644
--- a/src/transformers/models/mixtral/modeling_mixtral.py
+++ b/src/transformers/models/mixtral/modeling_mixtral.py
@@ -1415,7 +1415,13 @@ def forward(
)
def prepare_inputs_for_generation(
- self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+ self,
+ input_ids,
+ past_key_values=None,
+ attention_mask=None,
+ inputs_embeds=None,
+ output_router_logits=False,
+ **kwargs,
):
# Omit tokens covered by past_key_values
if past_key_values is not None:
@@ -1467,6 +1473,7 @@ def prepare_inputs_for_generation(
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
+ "output_router_logits": output_router_logits,
}
)
return model_inputs
From 7628b3a0f40212c0f264233fc6da0d9c9cf88853 Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Wed, 28 Feb 2024 11:34:54 +0000
Subject: [PATCH 042/549] Idefics: generate fix (#29320)
---
.../models/idefics/modeling_idefics.py | 54 ++++++++-----------
1 file changed, 21 insertions(+), 33 deletions(-)
diff --git a/src/transformers/models/idefics/modeling_idefics.py b/src/transformers/models/idefics/modeling_idefics.py
index bdd915c1bd8d59..eed75b3522b0a9 100644
--- a/src/transformers/models/idefics/modeling_idefics.py
+++ b/src/transformers/models/idefics/modeling_idefics.py
@@ -19,7 +19,7 @@
# limitations under the License.
""" PyTorch Idefics model."""
from dataclasses import dataclass
-from typing import List, Optional, Tuple, Union
+from typing import Any, Dict, List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
@@ -187,35 +187,6 @@ def expand_inputs_for_generation(
return input_ids, model_kwargs
-def update_model_kwargs_for_generation(outputs, model_kwargs):
- # must have this key set to at least None
- if "past_key_values" in outputs:
- model_kwargs["past_key_values"] = outputs.past_key_values
- else:
- model_kwargs["past_key_values"] = None
-
- # update token_type_ids with last value
- if "token_type_ids" in model_kwargs:
- token_type_ids = model_kwargs["token_type_ids"]
- model_kwargs["token_type_ids"] = torch.cat([token_type_ids, token_type_ids[:, -1].unsqueeze(-1)], dim=-1)
-
- # update attention masks
- if "attention_mask" in model_kwargs:
- attention_mask = model_kwargs["attention_mask"]
- model_kwargs["attention_mask"] = torch.cat(
- [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
- )
- if "image_attention_mask" in model_kwargs:
- image_attention_mask = model_kwargs["image_attention_mask"]
- last_mask = image_attention_mask[:, -1, :].unsqueeze(1)
- model_kwargs["image_attention_mask"] = last_mask
-
- # Get the precomputed image_hidden_states
- model_kwargs["image_hidden_states"] = outputs.image_hidden_states
-
- return model_kwargs
-
-
def prepare_inputs_for_generation(input_ids, past_key_values=None, **kwargs):
token_type_ids = kwargs.get("token_type_ids", None)
# only last token for inputs_ids if past is defined in kwargs
@@ -1580,9 +1551,26 @@ def _expand_inputs_for_generation(
):
return expand_inputs_for_generation(*args, **model_kwargs)
- @staticmethod
- def _update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder):
- return update_model_kwargs_for_generation(outputs, model_kwargs)
+ def _update_model_kwargs_for_generation(
+ self,
+ outputs: ModelOutput,
+ model_kwargs: Dict[str, Any],
+ is_encoder_decoder: bool = False,
+ standardize_cache_format: bool = False,
+ model_inputs: Optional[Dict[str, Any]] = None,
+ ) -> Dict[str, Any]:
+ model_kwargs = super()._update_model_kwargs_for_generation(
+ outputs, model_kwargs, is_encoder_decoder, standardize_cache_format, model_inputs
+ )
+
+ if "image_attention_mask" in model_kwargs:
+ image_attention_mask = model_kwargs["image_attention_mask"]
+ last_mask = image_attention_mask[:, -1, :].unsqueeze(1)
+ model_kwargs["image_attention_mask"] = last_mask
+
+ # Get the precomputed image_hidden_states
+ model_kwargs["image_hidden_states"] = outputs.image_hidden_states
+ return model_kwargs
@staticmethod
def _reorder_cache(past, beam_idx):
From d3a4b4754440041b20247365f33fb8e44c6d9caf Mon Sep 17 00:00:00 2001
From: Daniel Han
Date: Thu, 29 Feb 2024 01:16:53 +1100
Subject: [PATCH 043/549] RoPE loses precision for Llama / Gemma + Gemma
logits.float() (#29285)
* Update modeling_llama.py
Llama - Force float32 since bfloat16 loses precision on long contexts
* Update modeling_llama.py
* Update modeling_gemma.py
Fix RoPE and logits.float()
* @torch.no_grad()
* @torch.no_grad()
* Cos, Sin to float32
* cos, sin to float32
* Update src/transformers/models/gemma/modeling_gemma.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Resolve PR conflicts
* Fix RoPE for llama
* Revert "Fix RoPE for llama"
This reverts commit b860a22dab9bb01cd15cb9a3220abeaefad3e458.
* Fix RoPE for llama
* RoPE device
* Autocast device type
* RoPE
* RoPE isinstance
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
src/transformers/models/gemma/modeling_gemma.py | 17 ++++++++++++-----
src/transformers/models/llama/modeling_llama.py | 14 +++++++++++---
2 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index 4e6e7cd8ab6d35..72e07ea82467af 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -101,18 +101,25 @@ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
self.base = base
self.register_buffer("inv_freq", None, persistent=False)
+ @torch.no_grad()
def forward(self, x, position_ids, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if self.inv_freq is None:
self.inv_freq = 1.0 / (
self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
)
-
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
- freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
- emb = torch.cat((freqs, freqs), dim=-1)
- return emb.cos().to(dtype=x.dtype), emb.sin().to(dtype=x.dtype)
+ # Force float32 since bfloat16 loses precision on long contexts
+ # See https://github.com/huggingface/transformers/pull/29285
+ device_type = x.device.type
+ device_type = device_type if isinstance(device_type, str) else "cpu"
+ with torch.autocast(device_type=device_type, enabled=False):
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+ emb = torch.cat((freqs, freqs), dim=-1)
+ cos = emb.cos()
+ sin = emb.sin()
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
# Copied from transformers.models.llama.modeling_llama.rotate_half
@@ -1082,7 +1089,7 @@ def forward(
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
-
+ logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 1f9ee6bb1a566c..0179f370ca0cf9 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -126,6 +126,7 @@ def cos_cached(self):
)
return self._cos_cached
+ @torch.no_grad()
def forward(self, x, position_ids, seq_len=None):
if seq_len is not None:
logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.39.")
@@ -133,9 +134,16 @@ def forward(self, x, position_ids, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
- freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
- emb = torch.cat((freqs, freqs), dim=-1)
- return emb.cos().to(dtype=x.dtype), emb.sin().to(dtype=x.dtype)
+ # Force float32 since bfloat16 loses precision on long contexts
+ # See https://github.com/huggingface/transformers/pull/29285
+ device_type = x.device.type
+ device_type = device_type if isinstance(device_type, str) else "cpu"
+ with torch.autocast(device_type=device_type, enabled=False):
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+ emb = torch.cat((freqs, freqs), dim=-1)
+ cos = emb.cos()
+ sin = emb.sin()
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
From 554e7ada89a1effba70004babf522b20ca99c739 Mon Sep 17 00:00:00 2001
From: jiqing-feng <107918818+jiqing-feng@users.noreply.github.com>
Date: Wed, 28 Feb 2024 22:56:25 +0800
Subject: [PATCH 044/549] check if position_ids exists before using it (#29306)
Co-authored-by: Joao Gante
---
src/transformers/models/gemma/modeling_gemma.py | 6 ++++--
src/transformers/models/llama/modeling_llama.py | 6 ++++--
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index 72e07ea82467af..13265be8f3e1e9 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -1168,7 +1168,9 @@ def prepare_inputs_for_generation(
# TODO @gante we should only keep a `cache_position` in generate, and do +=1.
# same goes for position ids. Could also help with continued generation.
- cache_position = torch.arange(past_length, past_length + position_ids.shape[-1], device=position_ids.device)
+ input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
+ cache_position = torch.arange(past_length, past_length + input_length, device=input_ids.device)
+ position_ids = position_ids.contiguous() if position_ids is not None else None
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
@@ -1181,7 +1183,7 @@ def prepare_inputs_for_generation(
model_inputs.update(
{
- "position_ids": position_ids.contiguous(),
+ "position_ids": position_ids,
"cache_position": cache_position,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 0179f370ca0cf9..254310d2653977 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -1284,7 +1284,9 @@ def prepare_inputs_for_generation(
# TODO @gante we should only keep a `cache_position` in generate, and do +=1.
# same goes for position ids. Could also help with continued generation.
- cache_position = torch.arange(past_length, past_length + position_ids.shape[-1], device=position_ids.device)
+ input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
+ cache_position = torch.arange(past_length, past_length + input_length, device=input_ids.device)
+ position_ids = position_ids.contiguous() if position_ids is not None else None
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
@@ -1297,7 +1299,7 @@ def prepare_inputs_for_generation(
model_inputs.update(
{
- "position_ids": position_ids.contiguous(),
+ "position_ids": position_ids,
"cache_position": cache_position,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
From f54d82cace511b48a2ad5d32ac83abef5a49ee13 Mon Sep 17 00:00:00 2001
From: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Date: Wed, 28 Feb 2024 10:09:25 -0500
Subject: [PATCH 045/549] [CI] Quantization workflow (#29046)
* [CI] Quantization workflow
* build dockerfile
* fix dockerfile
* update self-cheduled.yml
* test build dockerfile on push
* fix torch install
* udapte to python 3.10
* update aqlm version
* uncomment build dockerfile
* tests if the scheduler works
* fix docker
* do not trigger on psuh again
* add additional runs
* test again
* all good
* style
* Update .github/workflows/self-scheduled.yml
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
* test build dockerfile with torch 2.2.0
* fix extra
* clean
* revert changes
* Revert "revert changes"
This reverts commit 4cb52b8822da9d1786a821a33e867e4fcc00d8fd.
* revert correct change
---------
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
.github/workflows/build-docker-images.yml | 28 ++++++++++
.github/workflows/self-scheduled.yml | 54 ++++++++++++++++++-
.../Dockerfile | 50 +++++++++++++++++
docs/source/en/hf_quantizer.md | 2 +-
utils/notification_service.py | 1 +
5 files changed, 133 insertions(+), 2 deletions(-)
create mode 100644 docker/transformers-quantization-latest-gpu/Dockerfile
diff --git a/.github/workflows/build-docker-images.yml b/.github/workflows/build-docker-images.yml
index 2b198bd4af56c5..6144f8036f96c9 100644
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@@ -297,3 +297,31 @@ jobs:
# REF=main
# push: true
# tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci
+
+ latest-quantization-torch-docker:
+ name: "Latest Pytorch + Quantization [dev]"
+ # Push CI doesn't need this image
+ if: inputs.image_postfix != '-push-ci'
+ runs-on: [intel-cpu, 8-cpu, ci]
+ steps:
+ -
+ name: Set up Docker Buildx
+ uses: docker/setup-buildx-action@v3
+ -
+ name: Check out code
+ uses: actions/checkout@v3
+ -
+ name: Login to DockerHub
+ uses: docker/login-action@v3
+ with:
+ username: ${{ secrets.DOCKERHUB_USERNAME }}
+ password: ${{ secrets.DOCKERHUB_PASSWORD }}
+ -
+ name: Build and push
+ uses: docker/build-push-action@v5
+ with:
+ context: ./docker/transformers-quantization-latest-gpu
+ build-args: |
+ REF=main
+ push: true
+ tags: huggingface/transformers-quantization-latest-gpu${{ inputs.image_postfix }}
\ No newline at end of file
diff --git a/.github/workflows/self-scheduled.yml b/.github/workflows/self-scheduled.yml
index c3c77925bbe734..465c00dd13bbcd 100644
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@@ -297,6 +297,56 @@ jobs:
name: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports
path: /workspace/transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu
+ run_tests_quantization_torch_gpu:
+ name: Quantization tests
+ strategy:
+ fail-fast: false
+ matrix:
+ machine_type: [single-gpu, multi-gpu]
+ runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, daily-ci]
+ container:
+ image: huggingface/transformers-quantization-latest-gpu
+ options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+ needs: setup
+ steps:
+ - name: Update clone
+ working-directory: /transformers
+ run: git fetch && git checkout ${{ github.sha }}
+
+ - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+ working-directory: /transformers
+ run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+
+ - name: NVIDIA-SMI
+ run: |
+ nvidia-smi
+
+ - name: Environment
+ working-directory: /transformers
+ run: |
+ python3 utils/print_env.py
+
+ - name: Show installed libraries and their versions
+ working-directory: /transformers
+ run: pip freeze
+
+ - name: Run quantization tests on GPU
+ working-directory: /transformers
+ run: |
+ python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_quantization_torch_gpu tests/quantization
+
+ - name: Failure short reports
+ if: ${{ failure() }}
+ continue-on-error: true
+ run: cat /transformers/reports/${{ matrix.machine_type }}_tests_quantization_torch_gpu/failures_short.txt
+
+ - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_quantization_torch_gpu"
+ if: ${{ always() }}
+ uses: actions/upload-artifact@v3
+ with:
+ name: ${{ matrix.machine_type }}_run_tests_quantization_torch_gpu
+ path: /transformers/reports/${{ matrix.machine_type }}_tests_quantization_torch_gpu
+
run_extract_warnings:
name: Extract warnings in CI artifacts
runs-on: ubuntu-22.04
@@ -307,7 +357,8 @@ jobs:
run_examples_gpu,
run_pipelines_tf_gpu,
run_pipelines_torch_gpu,
- run_all_tests_torch_cuda_extensions_gpu
+ run_all_tests_torch_cuda_extensions_gpu,
+ run_tests_quantization_torch_gpu,
]
steps:
- name: Checkout transformers
@@ -355,6 +406,7 @@ jobs:
run_pipelines_tf_gpu,
run_pipelines_torch_gpu,
run_all_tests_torch_cuda_extensions_gpu,
+ run_tests_quantization_torch_gpu,
run_extract_warnings
]
steps:
diff --git a/docker/transformers-quantization-latest-gpu/Dockerfile b/docker/transformers-quantization-latest-gpu/Dockerfile
new file mode 100644
index 00000000000000..66bdcc42bae9fd
--- /dev/null
+++ b/docker/transformers-quantization-latest-gpu/Dockerfile
@@ -0,0 +1,50 @@
+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
+LABEL maintainer="Hugging Face"
+
+ARG DEBIAN_FRONTEND=noninteractive
+
+# Use login shell to read variables from `~/.profile` (to pass dynamic created variables between RUN commands)
+SHELL ["sh", "-lc"]
+
+# The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant
+# to be used as arguments for docker build (so far).
+
+ARG PYTORCH='2.2.0'
+# Example: `cu102`, `cu113`, etc.
+ARG CUDA='cu118'
+
+RUN apt update
+RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python python3-pip ffmpeg
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+
+ARG REF=main
+RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
+
+RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile
+RUN echo torch=$VERSION
+# `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build.
+# Currently, let's just use their latest releases (when `torch` is installed with a release version)
+RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
+
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch]
+
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
+
+# Add bitsandbytes for mixed int8 testing
+RUN python3 -m pip install --no-cache-dir bitsandbytes
+
+# Add auto-gptq for gtpq quantization testing
+RUN python3 -m pip install --no-cache-dir auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
+
+# Add optimum for gptq quantization testing
+RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
+
+# Add aqlm for quantization testing
+RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
+
+# Add autoawq for quantization testing
+RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp38-cp38-linux_x86_64.whl
+
+# When installing in editable mode, `transformers` is not recognized as a package.
+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
\ No newline at end of file
diff --git a/docs/source/en/hf_quantizer.md b/docs/source/en/hf_quantizer.md
index 154cfb54b9ebc8..8261a6bc4585e1 100644
--- a/docs/source/en/hf_quantizer.md
+++ b/docs/source/en/hf_quantizer.md
@@ -66,4 +66,4 @@ For some quantization methods, they may require "pre-quantizing" the models thro
7. Document everything! Make sure your quantization method is documented in the [`docs/source/en/quantization.md`](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/docs/source/en/quantization.md) file.
-8. Add tests! You should add tests by first adding the package in our nightly Dockerfile inside `docker/transformers-all-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out how it is implemented for other quantization methods.
+8. Add tests! You should add tests by first adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out how it is implemented for other quantization methods.
diff --git a/utils/notification_service.py b/utils/notification_service.py
index 39a0fb840cf5ad..d29e6994a232b2 100644
--- a/utils/notification_service.py
+++ b/utils/notification_service.py
@@ -1043,6 +1043,7 @@ def prepare_reports(title, header, reports, to_truncate=True):
"PyTorch pipelines": "run_tests_torch_pipeline_gpu",
"TensorFlow pipelines": "run_tests_tf_pipeline_gpu",
"Torch CUDA extension tests": "run_tests_torch_cuda_extensions_gpu_test_reports",
+ "Quantization tests": "run_tests_quantization_torch_gpu",
}
if ci_event in ["push", "Nightly CI"] or ci_event.startswith("Past CI"):
From 49204c1d37b807def930fe45f5f84abc370a7200 Mon Sep 17 00:00:00 2001
From: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date: Wed, 28 Feb 2024 16:36:47 +0100
Subject: [PATCH 046/549] Better SDPA unmasking implementation (#29318)
* better unmask imple
* comment
* typo
* bug report pytorch
* cleanup
* fix import
* add back example
* retrigger ci
* come on
---
src/transformers/modeling_attn_mask_utils.py | 69 ++++---------------
.../models/falcon/modeling_falcon.py | 16 ++---
.../models/gemma/modeling_gemma.py | 11 ++-
.../gpt_bigcode/modeling_gpt_bigcode.py | 39 +++++------
.../models/llama/modeling_llama.py | 11 ++-
5 files changed, 54 insertions(+), 92 deletions(-)
diff --git a/src/transformers/modeling_attn_mask_utils.py b/src/transformers/modeling_attn_mask_utils.py
index 1a2c0db7bb140c..faae0d763f4e59 100755
--- a/src/transformers/modeling_attn_mask_utils.py
+++ b/src/transformers/modeling_attn_mask_utils.py
@@ -187,7 +187,8 @@ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int]
@staticmethod
def _unmask_unattended(
- expanded_mask: torch.Tensor, attention_mask: torch.Tensor, unmasked_value: Union[bool, float]
+ expanded_mask: torch.FloatTensor,
+ min_dtype: float,
):
# fmt: off
"""
@@ -200,13 +201,7 @@ def _unmask_unattended(
The dimension num_masks of `expanded_mask` is most often 1, but it can also be the number of heads in the case of alibi attention bias.
- For example, if `attention_mask` is
- ```
- [[0, 0, 1],
- [1, 1, 1],
- [0, 1, 1]]
- ```
- and `expanded_mask` is (e.g. here left-padding case)
+ For example, if `expanded_mask` is (e.g. here left-padding case)
```
[[[[0, 0, 0],
[0, 0, 0],
@@ -232,47 +227,12 @@ def _unmask_unattended(
```
"""
# fmt: on
+ if expanded_mask.dtype == torch.bool:
+ raise ValueError(
+ "AttentionMaskConverter._unmask_unattended expects a float `expanded_mask`, got a BoolTensor."
+ )
- # Get the index of the first non-zero value for every sample in the batch.
- # In the above example, indices = [[2], [0], [1]]]
- tmp = torch.arange(attention_mask.shape[1], 0, -1)
- indices = torch.argmax(attention_mask.cpu() * tmp, 1, keepdim=True)
-
- # Find the batch indexes that have unattended tokens on the leftmost side (e.g. [0, 0, 1, 1, 1]), for which the first rows of the
- # expanded mask will be completely unattended.
- left_masked_rows = torch.where(indices > 0)[0]
-
- if left_masked_rows.shape[0] == 0:
- return expanded_mask
- indices = indices[left_masked_rows]
-
- max_len = torch.max(indices)
- range_tensor = torch.arange(max_len).unsqueeze(0)
- range_tensor = range_tensor.repeat(indices.size(0), 1)
-
- # Avoid unmasking tokens at relevant target positions (on the row axis), by rather unmasking possibly several times the first row that should always be unmasked as we filtered out the batch above.
- range_tensor[range_tensor >= indices] = 0
-
- # TODO: we may drop support for 3D attention mask as the refactor from Patrick maybe dropped this case
- if expanded_mask.dim() == 4:
- num_masks = expanded_mask.shape[1]
- if num_masks == 1:
- # Broadcast [left_masked_rows, 1], [left_masked_rows, max_len]
- mask_slice = (left_masked_rows[:, None], 0, range_tensor)
- else:
- # Broadcast [left_masked_rows, 1, 1], [1, num_masks, 1], [left_masked_rows, 1, max_len]
- mask_slice = (
- left_masked_rows[:, None, None],
- torch.arange(num_masks)[None, :, None],
- range_tensor[:, None, :],
- )
- else:
- # Broadcast [left_masked_rows, 1], [left_masked_rows, max_len]
- mask_slice = (left_masked_rows[:, None], range_tensor)
-
- expanded_mask[mask_slice] = unmasked_value
-
- return expanded_mask
+ return expanded_mask.mul(~torch.all(expanded_mask == min_dtype, dim=-1, keepdim=True))
def _prepare_4d_causal_attention_mask(
@@ -406,15 +366,12 @@ def _prepare_4d_causal_attention_mask_for_sdpa(
key_value_length=key_value_length,
)
- # From PyTorch 2.1 onwards, F.scaled_dot_product_attention with the memory-efficient attention backend
- # produces nans if sequences are completely unattended in the attention mask. Details: https://github.com/pytorch/pytorch/issues/110213
- #
- # This fix is not applied in case we are tracing with torch.jit.trace or symbolic_trace, as _unmask_unattended has a data-dependent
- # controlflow that can not be captured properly.
- # TODO: _unmask_unattended does not work either with torch.compile when using fullgraph=True. We should find a way to detect this case.
- if query_length > 1 and not is_tracing:
+ # Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+ # Details: https://github.com/pytorch/pytorch/issues/110213
+ if not is_tracing and expanded_4d_mask.device.type == "cuda":
expanded_4d_mask = AttentionMaskConverter._unmask_unattended(
- expanded_4d_mask, attention_mask, unmasked_value=0.0
+ expanded_4d_mask, min_dtype=torch.finfo(inputs_embeds.dtype).min
)
return expanded_4d_mask
diff --git a/src/transformers/models/falcon/modeling_falcon.py b/src/transformers/models/falcon/modeling_falcon.py
index 7ef857748ca813..2dde8d1cac67f6 100644
--- a/src/transformers/models/falcon/modeling_falcon.py
+++ b/src/transformers/models/falcon/modeling_falcon.py
@@ -438,9 +438,9 @@ def forward(
else:
present = None
- # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
- # Reference: https://github.com/pytorch/pytorch/issues/112577.
- if query_layer.device.type == "cuda" and attention_mask is not None:
+ if self._use_sdpa and query_layer.device.type == "cuda" and attention_mask is not None:
+ # For torch<=2.1.2, SDPA with memory-efficient backend is bugged with non-contiguous inputs with custom attn_mask,
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
query_layer = query_layer.contiguous()
key_layer = key_layer.contiguous()
value_layer = value_layer.contiguous()
@@ -456,6 +456,7 @@ def forward(
# The query_length > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case query_length == 1.
is_causal=self.is_causal and attention_mask is None and query_length > 1,
)
+
attention_scores = None
else:
attention_scores = query_layer @ key_layer.transpose(-1, -2)
@@ -1112,18 +1113,17 @@ def forward(
if attention_mask_2d is None:
attention_mask = alibi / math.sqrt(self.config.hidden_size // self.num_heads)
else:
+ min_dtype = torch.finfo(alibi.dtype).min
attention_mask = torch.masked_fill(
alibi / math.sqrt(self.config.hidden_size // self.num_heads),
attention_mask < -1,
- torch.finfo(alibi.dtype).min,
+ min_dtype,
)
# From PyTorch 2.1 onwards, F.scaled_dot_product_attention with the memory-efficient attention backend
# produces nans if sequences are completely unattended in the attention mask. Details: https://github.com/pytorch/pytorch/issues/110213
- if seq_length > 1:
- attention_mask = AttentionMaskConverter._unmask_unattended(
- attention_mask, attention_mask_2d, unmasked_value=0.0
- )
+ if seq_length > 1 and attention_mask.device.type == "cuda":
+ attention_mask = AttentionMaskConverter._unmask_unattended(attention_mask, min_dtype=min_dtype)
else:
# PyTorch SDPA does not support head_mask, we fall back on the eager implementation in this case.
attention_mask = _prepare_4d_causal_attention_mask(
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index 13265be8f3e1e9..e78ff54be865ea 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -27,6 +27,7 @@
from ...activations import ACT2FN
from ...cache_utils import Cache, DynamicCache, StaticCache
from ...modeling_attn_mask_utils import (
+ AttentionMaskConverter,
_prepare_4d_causal_attention_mask,
)
from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
@@ -978,7 +979,11 @@ def _update_causal_mask(self, attention_mask, input_tensor):
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
- if self.config._attn_implementation == "sdpa" and attention_mask is not None:
+ if (
+ self.config._attn_implementation == "sdpa"
+ and attention_mask is not None
+ and attention_mask.device.type == "cuda"
+ ):
# TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
is_tracing = (
torch.jit.is_tracing()
@@ -986,10 +991,10 @@ def _update_causal_mask(self, attention_mask, input_tensor):
or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
)
if not is_tracing and torch.any(attention_mask != 1):
- # Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
- causal_mask = causal_mask.mul(~torch.all(causal_mask == min_dtype, dim=-1, keepdim=True)).to(dtype)
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
return causal_mask
diff --git a/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py b/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
index 0b8a1bbb485517..2ef46eaa9f7322 100644
--- a/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
+++ b/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
@@ -30,6 +30,7 @@
TokenClassifierOutput,
)
from ...modeling_utils import PreTrainedModel
+from ...pytorch_utils import is_torch_greater_or_equal_than_2_2
from ...utils import (
add_code_sample_docstrings,
add_start_docstrings,
@@ -534,21 +535,16 @@ def _attn(self, query, key, value, attention_mask=None, head_mask=None):
key = key.unsqueeze(1)
value = value.unsqueeze(1)
- # Although these expand are not numerically useful, PyTorch 2.1 can not dispatch to memory-efficient backend
+ # Although these expand are not numerically useful, PyTorch can not dispatch to memory-efficient backend
# and flash attention backend (No available kernel. Aborting execution.) from the shapes
# query = [batch_size, num_heads, query_length, head_dim]
# key = [batch_size, 1, past_length, head_dim]
# value = [batch_size, 1, past_length, head_dim]
#
- # so we could do:
- #
- # key = key.expand(-1, self.num_heads, -1, -1)
- # value = value.expand(-1, self.num_heads, -1, -1)
- #
- # However SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
- # so we always dispatch to the math path: https://github.com/pytorch/pytorch/issues/112577.
- # Arguably we could still do expand + contiguous when `query.device.type == "cuda"` in order to dispatch on memory-efficient
- # backend, but it feels very hacky.
+ # torch==2.1.2 is bugged with non-contiguous inputs with custom attn_mask (https://github.com/pytorch/pytorch/issues/112577), hence the check.
+ if is_torch_greater_or_equal_than_2_2:
+ key = key.expand(-1, self.num_heads, -1, -1)
+ value = value.expand(-1, self.num_heads, -1, -1)
else:
query_length = query_shape[-1]
@@ -1020,6 +1016,15 @@ def forward(
self_attention_mask = self_attention_mask.unsqueeze(2 if self.multi_query else 1)
if self._use_sdpa and head_mask is None and not output_attentions:
+ # SDPA with a custom mask is much faster in fp16/fp32 dtype rather than bool. Cast here to floating point instead of at every layer.
+ dtype = self.wte.weight.dtype
+ min_dtype = torch.finfo(dtype).min
+ self_attention_mask = torch.where(
+ self_attention_mask,
+ torch.full([], 0.0, dtype=dtype, device=self_attention_mask.device),
+ torch.full([], min_dtype, dtype=dtype, device=self_attention_mask.device),
+ )
+
# output_attentions=True can not be supported when using SDPA, and we fall back on
# the manual implementation that requires a 4D causal mask in all cases.
if self.multi_query:
@@ -1027,23 +1032,13 @@ def forward(
# [batch_size, target_length, 1, source_length], not compatible with SDPA, hence this transpose.
self_attention_mask = self_attention_mask.transpose(1, 2)
- if query_length > 1 and attention_mask is not None:
+ if query_length > 1 and attention_mask is not None and attention_mask.device.type == "cuda":
# From PyTorch 2.1 onwards, F.scaled_dot_product_attention with the memory-efficient attention backend
# produces nans if sequences are completely unattended in the attention mask. Details: https://github.com/pytorch/pytorch/issues/110213
self_attention_mask = AttentionMaskConverter._unmask_unattended(
- self_attention_mask, attention_mask, unmasked_value=True
+ self_attention_mask, min_dtype=min_dtype
)
- # SDPA with a custom mask is much faster in fp16/fp32 dtype rather than bool. Cast here to floating point instead of at every layer.
- dtype = self.wte.weight.dtype
- self_attention_mask = torch.where(
- self_attention_mask,
- torch.full([], 0.0, dtype=dtype, device=self_attention_mask.device),
- torch.full(
- [], torch.finfo(self.wte.weight.dtype).min, dtype=dtype, device=self_attention_mask.device
- ),
- )
-
attention_mask = self_attention_mask
# If a 2D or 3D attention mask is provided for the cross-attention
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 254310d2653977..4ea8a208a92315 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -30,6 +30,7 @@
from ...activations import ACT2FN
from ...cache_utils import Cache, DynamicCache, StaticCache
+from ...modeling_attn_mask_utils import AttentionMaskConverter
from ...modeling_outputs import (
BaseModelOutputWithPast,
CausalLMOutputWithPast,
@@ -1090,7 +1091,11 @@ def _update_causal_mask(self, attention_mask, input_tensor):
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
- if self.config._attn_implementation == "sdpa" and attention_mask is not None:
+ if (
+ self.config._attn_implementation == "sdpa"
+ and attention_mask is not None
+ and attention_mask.device.type == "cuda"
+ ):
# TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
is_tracing = (
torch.jit.is_tracing()
@@ -1098,10 +1103,10 @@ def _update_causal_mask(self, attention_mask, input_tensor):
or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
)
if not is_tracing and torch.any(attention_mask != 1):
- # Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
- causal_mask = causal_mask.mul(~torch.all(causal_mask == min_dtype, dim=-1, keepdim=True)).to(dtype)
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
return causal_mask
From 2209b7afa04b3a6366350065f541e9248d6663c2 Mon Sep 17 00:00:00 2001
From: Michael
Date: Thu, 29 Feb 2024 01:41:18 +0800
Subject: [PATCH 047/549] [i18n-zh] Sync source/zh/index.md (#29331)
* [i18n-zh] Sync source/zh/index.md
* apply review comments
---
docs/source/zh/index.md | 613 ++++++++++++++++++----------------------
1 file changed, 268 insertions(+), 345 deletions(-)
diff --git a/docs/source/zh/index.md b/docs/source/zh/index.md
index 549d6e6371f54b..3750e506b0ea04 100644
--- a/docs/source/zh/index.md
+++ b/docs/source/zh/index.md
@@ -37,7 +37,7 @@ rendered properly in your Markdown viewer.
## 目录
-这篇文档被组织为以下5个章节:
+这篇文档由以下 5 个章节组成:
- **开始使用** 包含了库的快速上手和安装说明,便于配置和运行。
- **教程** 是一个初学者开始的好地方。本章节将帮助你获得你会用到的使用这个库的基本技能。
@@ -45,354 +45,277 @@ rendered properly in your Markdown viewer.
- **概念指南** 对 🤗 Transformers 的模型,任务和设计理念背后的基本概念和思想做了更多的讨论和解释。
- **API 介绍** 描述了所有的类和函数:
- - **MAIN CLASSES** 详述了配置(configuration)、模型(model)、分词器(tokenizer)和流水线(pipeline)这几个最重要的类。
- - **MODELS** 详述了在这个库中和每个模型实现有关的类和函数。
- - **INTERNAL HELPERS** 详述了内部使用的工具类和函数。
+ - **主要类别** 详述了配置(configuration)、模型(model)、分词器(tokenizer)和流水线(pipeline)这几个最重要的类。
+ - **模型** 详述了在这个库中和每个模型实现有关的类和函数。
+ - **内部帮助** 详述了内部使用的工具类和函数。
-### 支持的模型
+### 支持的模型和框架
-
-
-1. **[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
-1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
-1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
-1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
-1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
-1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
-1. **[BEiT](model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei.
-1. **[BERT](model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
-1. **[BERT For Sequence Generation](model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
-1. **[BERTweet](model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen.
-1. **[BigBird-Pegasus](model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
-1. **[BigBird-RoBERTa](model_doc/big_bird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
-1. **[BioGpt](model_doc/biogpt)** (from Microsoft Research AI4Science) released with the paper [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.
-1. **[BiT](model_doc/bit)** (from Google AI) released with the paper [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby.
-1. **[Blenderbot](model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
-1. **[BlenderbotSmall](model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
-1. **[BLIP](model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
-1. **[BLOOM](model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/).
-1. **[BORT](model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
-1. **[ByT5](model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
-1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
-1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
-1. **[Chinese-CLIP](model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
-1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
-1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
-1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
-1. **[Conditional DETR](model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
-1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
-1. **[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
-1. **[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie.
-1. **[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
-1. **[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
-1. **[CvT](model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang.
-1. **[Data2Vec](model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
-1. **[DeBERTa](model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
-1. **[DeBERTa-v2](model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
-1. **[Decision Transformer](model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
-1. **[Deformable DETR](model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai.
-1. **[DeiT](model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
-1. **[DETR](model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
-1. **[DialoGPT](model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
-1. **[DiNAT](model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi.
-1. **[DistilBERT](model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
-1. **[DiT](model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
-1. **[Donut](model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
-1. **[DPR](model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
-1. **[DPT](master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
-1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
-1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
-1. **[ERNIE](model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu.
-1. **[ESM](model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives.
-1. **[FLAN-T5](model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
-1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
-1. **[FLAVA](model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
-1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
-1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
-1. **[GIT](model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
-1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
-1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
-1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
-1. **[GPT NeoX](model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach
-1. **[GPT NeoX Japanese](model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
-1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.
-1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
-1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
-1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
-1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
-1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
-1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
-1. **[Jukebox](model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
-1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
-1. **[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
-1. **[LayoutLMv3](model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.
-1. **[LayoutXLM](model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
-1. **[LED](model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-1. **[LeViT](model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze.
-1. **[LiLT](model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding.
-1. **[Longformer](model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-1. **[LongT5](model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
-1. **[LUKE](model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
-1. **[LXMERT](model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
-1. **[M-CTC-T](model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
-1. **[M2M100](model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
-1. **[MarianMT](model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
-1. **[MarkupLM](model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei.
-1. **[Mask2Former](model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
-1. **[MaskFormer](model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
-1. **[mBART](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-1. **[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
-1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
-1. **[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
-1. **[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
-1. **[MobileBERT](model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
-1. **[MobileNetV1](model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
-1. **[MobileNetV2](model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.
-1. **[MobileViT](model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
-1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
-1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
-1. **[MVP](model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
-1. **[NAT](model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
-1. **[Nezha](model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
-1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
-1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
-1. **[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
-1. **[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
-1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
-1. **[PEGASUS-X](model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu.
-1. **[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
-1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
-1. **[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
-1. **[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
-1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
-1. **[RAG](model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
-1. **[REALM](model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.
-1. **[Reformer](model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-1. **[RegNet](model_doc/regnet)** (from META Platforms) released with the paper [Designing Network Design Space](https://arxiv.org/abs/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár.
-1. **[RemBERT](model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
-1. **[ResNet](model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
-1. **[RoBERTa](model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-1. **[RoBERTa-PreLayerNorm](model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli.
-1. **[RoCBert](model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
-1. **[RoFormer](model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
-1. **[SegFormer](model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
-1. **[SEW](model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
-1. **[SEW-D](model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
-1. **[SpeechToTextTransformer](model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
-1. **[SpeechToTextTransformer2](model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
-1. **[Splinter](model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
-1. **[SqueezeBERT](model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
-1. **[Swin Transformer](model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
-1. **[Swin Transformer V2](model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
-1. **[Swin2SR](model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
-1. **[SwitchTransformers](model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer.
-1. **[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-1. **[T5v1.1](model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-1. **[Table Transformer](model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
-1. **[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
-1. **[TAPEX](model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
-1. **[Time Series Transformer](model_doc/time_series_transformer)** (from HuggingFace).
-1. **[TimeSformer](model_doc/timesformer)** (from Facebook) released with the paper [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) by Gedas Bertasius, Heng Wang, Lorenzo Torresani.
-1. **[Trajectory Transformer](model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine
-1. **[Transformer-XL](model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
-1. **[UL2](model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
-1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
-1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
-1. **[UPerNet](model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
-1. **[VAN](model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
-1. **[VideoMAE](model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
-1. **[ViLT](model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
-1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
-1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
-1. **[ViT Hybrid](model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
-1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
-1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
-1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
-1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
-1. **[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
-1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
-1. **[Whisper](model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
-1. **[X-CLIP](model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
-1. **[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
-1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
-1. **[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-1. **[XLM-RoBERTa](model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
-1. **[XLM-RoBERTa-XL](model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
-1. **[XLNet](model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-1. **[XLS-R](model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
-1. **[XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
-1. **[YOLOS](model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.
-1. **[YOSO](model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.
-
-
-### 支持的框架
-
-下表展示了库中对每个模型的支持情况,如是否具有 Python 分词器(表中的“Tokenizer slow”)、是否具有由 🤗 Tokenizers 库支持的快速分词器(表中的“Tokenizer fast”)、是否支持 Jax(通过
-Flax)、PyTorch 与 TensorFlow。
+下表展示了库中对每个模型的支持情况,如是否具有 Python 分词器(表中的“Tokenizer slow”)、是否具有由 🤗 Tokenizers 库支持的快速分词器(表中的“Tokenizer fast”)、是否支持 Jax(通过 Flax)、PyTorch 与 TensorFlow。
-| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
-|:-----------------------------:|:--------------:|:--------------:|:---------------:|:------------------:|:------------:|
-| ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
-| AltCLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Audio Spectrogram Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| BART | ✅ | ✅ | ✅ | ✅ | ✅ |
-| BEiT | ❌ | ❌ | ✅ | ❌ | ✅ |
-| BERT | ✅ | ✅ | ✅ | ✅ | ✅ |
-| Bert Generation | ✅ | ❌ | ✅ | ❌ | ❌ |
-| BigBird | ✅ | ✅ | ✅ | ❌ | ✅ |
-| BigBird-Pegasus | ❌ | ❌ | ✅ | ❌ | ❌ |
-| BioGpt | ✅ | ❌ | ✅ | ❌ | ❌ |
-| BiT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Blenderbot | ✅ | ✅ | ✅ | ✅ | ✅ |
-| BlenderbotSmall | ✅ | ✅ | ✅ | ✅ | ✅ |
-| BLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
-| BLOOM | ❌ | ✅ | ✅ | ❌ | ❌ |
-| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| CANINE | ✅ | ❌ | ✅ | ❌ | ❌ |
-| Chinese-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
-| CLIP | ✅ | ✅ | ✅ | ✅ | ✅ |
-| CLIPSeg | ❌ | ❌ | ✅ | ❌ | ❌ |
-| CodeGen | ✅ | ✅ | ✅ | ❌ | ❌ |
-| Conditional DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ConvBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| ConvNeXT | ❌ | ❌ | ✅ | ✅ | ❌ |
-| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
-| CvT | ❌ | ❌ | ✅ | ✅ | ❌ |
-| Data2VecAudio | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Data2VecText | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Data2VecVision | ❌ | ❌ | ✅ | ✅ | ❌ |
-| DeBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
-| DeBERTa-v2 | ✅ | ✅ | ✅ | ✅ | ❌ |
-| Decision Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Deformable DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
-| DeiT | ❌ | ❌ | ✅ | ✅ | ❌ |
-| DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
-| DiNAT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| DistilBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
-| DonutSwin | ❌ | ❌ | ✅ | ❌ | ❌ |
-| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
-| DPT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ELECTRA | ✅ | ✅ | ✅ | ✅ | ✅ |
-| Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
-| ERNIE | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ESM | ✅ | ❌ | ✅ | ✅ | ❌ |
-| FairSeq Machine-Translation | ✅ | ❌ | ✅ | ❌ | ❌ |
-| FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ |
-| FLAVA | ❌ | ❌ | ✅ | ❌ | ❌ |
-| FNet | ✅ | ✅ | ✅ | ❌ | ❌ |
-| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
-| GIT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| GLPN | ❌ | ❌ | ✅ | ❌ | ❌ |
-| GPT Neo | ❌ | ❌ | ✅ | ❌ | ✅ |
-| GPT NeoX | ❌ | ✅ | ✅ | ❌ | ❌ |
-| GPT NeoX Japanese | ✅ | ❌ | ✅ | ❌ | ❌ |
-| GPT-J | ❌ | ❌ | ✅ | ✅ | ✅ |
-| GPT-Sw3 | ✅ | ✅ | ✅ | ✅ | ✅ |
-| GroupViT | ❌ | ❌ | ✅ | ✅ | ❌ |
-| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
-| I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ImageGPT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Jukebox | ✅ | ❌ | ✅ | ❌ | ❌ |
-| LayoutLM | ✅ | ✅ | ✅ | ✅ | ❌ |
-| LayoutLMv2 | ✅ | ✅ | ✅ | ❌ | ❌ |
-| LayoutLMv3 | ✅ | ✅ | ✅ | ✅ | ❌ |
-| LED | ✅ | ✅ | ✅ | ✅ | ❌ |
-| LeViT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| LiLT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Longformer | ✅ | ✅ | ✅ | ✅ | ❌ |
-| LongT5 | ❌ | ❌ | ✅ | ❌ | ✅ |
-| LUKE | ✅ | ❌ | ✅ | ❌ | ❌ |
-| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| M-CTC-T | ❌ | ❌ | ✅ | ❌ | ❌ |
-| M2M100 | ✅ | ❌ | ✅ | ❌ | ❌ |
-| Marian | ✅ | ❌ | ✅ | ✅ | ✅ |
-| MarkupLM | ✅ | ✅ | ✅ | ❌ | ❌ |
-| Mask2Former | ❌ | ❌ | ✅ | ❌ | ❌ |
-| MaskFormer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| MaskFormerSwin | ❌ | ❌ | ❌ | ❌ | ❌ |
-| mBART | ✅ | ✅ | ✅ | ✅ | ✅ |
-| Megatron-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| MobileNetV1 | ❌ | ❌ | ✅ | ❌ | ❌ |
-| MobileNetV2 | ❌ | ❌ | ✅ | ❌ | ❌ |
-| MobileViT | ❌ | ❌ | ✅ | ✅ | ❌ |
-| MPNet | ✅ | ✅ | ✅ | ✅ | ❌ |
-| MT5 | ✅ | ✅ | ✅ | ✅ | ✅ |
-| MVP | ✅ | ✅ | ✅ | ❌ | ❌ |
-| NAT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Nezha | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Nyströmformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
-| OPT | ❌ | ❌ | ✅ | ✅ | ✅ |
-| OWL-ViT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ |
-| PEGASUS-X | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ |
-| PLBart | ✅ | ❌ | ✅ | ❌ | ❌ |
-| PoolFormer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
-| QDQBert | ❌ | ❌ | ✅ | ❌ | ❌ |
-| RAG | ✅ | ❌ | ✅ | ✅ | ❌ |
-| REALM | ✅ | ✅ | ✅ | ❌ | ❌ |
-| Reformer | ✅ | ✅ | ✅ | ❌ | ❌ |
-| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ |
-| RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| ResNet | ❌ | ❌ | ✅ | ✅ | ❌ |
-| RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
-| RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ |
-| RoBERTa-PreLayerNorm | ❌ | ❌ | ✅ | ✅ | ✅ |
-| RoCBert | ✅ | ❌ | ✅ | ❌ | ❌ |
-| RoFormer | ✅ | ✅ | ✅ | ✅ | ✅ |
-| SegFormer | ❌ | ❌ | ✅ | ✅ | ❌ |
-| SEW | ❌ | ❌ | ✅ | ❌ | ❌ |
-| SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | ✅ |
-| Speech2Text | ✅ | ❌ | ✅ | ✅ | ❌ |
-| Speech2Text2 | ✅ | ❌ | ❌ | ❌ | ❌ |
-| Splinter | ✅ | ✅ | ✅ | ❌ | ❌ |
-| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
-| Swin Transformer | ❌ | ❌ | ✅ | ✅ | ❌ |
-| Swin Transformer V2 | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Swin2SR | ❌ | ❌ | ✅ | ❌ | ❌ |
-| SwitchTransformers | ❌ | ❌ | ✅ | ❌ | ❌ |
-| T5 | ✅ | ✅ | ✅ | ✅ | ✅ |
-| Table Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| TAPAS | ✅ | ❌ | ✅ | ✅ | ❌ |
-| Time Series Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| TimeSformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Trajectory Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ |
-| TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ |
-| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
-| UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ |
-| UPerNet | ❌ | ❌ | ✅ | ❌ | ❌ |
-| VAN | ❌ | ❌ | ✅ | ❌ | ❌ |
-| VideoMAE | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ViLT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Vision Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
-| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
-| VisualBERT | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ViT | ❌ | ❌ | ✅ | ✅ | ✅ |
-| ViT Hybrid | ❌ | ❌ | ✅ | ❌ | ❌ |
-| ViTMAE | ❌ | ❌ | ✅ | ✅ | ❌ |
-| ViTMSN | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
-| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
-| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
-| Whisper | ✅ | ❌ | ✅ | ✅ | ❌ |
-| X-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
-| XGLM | ✅ | ✅ | ✅ | ✅ | ✅ |
-| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
-| XLM-ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
-| XLM-RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ |
-| XLM-RoBERTa-XL | ❌ | ❌ | ✅ | ❌ | ❌ |
-| XLNet | ✅ | ✅ | ✅ | ✅ | ❌ |
-| YOLOS | ❌ | ❌ | ✅ | ❌ | ❌ |
-| YOSO | ❌ | ❌ | ✅ | ❌ | ❌ |
+| 模型 | PyTorch 支持 | TensorFlow 支持 | Flax 支持 |
+|:------------------------------------------------------------------------:|:---------------:|:------------------:|:------------:|
+| [ALBERT](../en/model_doc/albert.md) | ✅ | ✅ | ✅ |
+| [ALIGN](../en/model_doc/align.md) | ✅ | ❌ | ❌ |
+| [AltCLIP](../en/model_doc/altclip) | ✅ | ❌ | ❌ |
+| [Audio Spectrogram Transformer](../en/model_doc/audio-spectrogram-transformer) | ✅ | ❌ | ❌ |
+| [Autoformer](../en/model_doc/autoformer) | ✅ | ❌ | ❌ |
+| [Bark](../en/model_doc/bark) | ✅ | ❌ | ❌ |
+| [BART](../en/model_doc/bart) | ✅ | ✅ | ✅ |
+| [BARThez](../en/model_doc/barthez) | ✅ | ✅ | ✅ |
+| [BARTpho](../en/model_doc/bartpho) | ✅ | ✅ | ✅ |
+| [BEiT](../en/model_doc/beit) | ✅ | ❌ | ✅ |
+| [BERT](../en/model_doc/bert) | ✅ | ✅ | ✅ |
+| [Bert Generation](../en/model_doc/bert-generation) | ✅ | ❌ | ❌ |
+| [BertJapanese](../en/model_doc/bert-japanese) | ✅ | ✅ | ✅ |
+| [BERTweet](../en/model_doc/bertweet) | ✅ | ✅ | ✅ |
+| [BigBird](../en/model_doc/big_bird) | ✅ | ❌ | ✅ |
+| [BigBird-Pegasus](../en/model_doc/bigbird_pegasus) | ✅ | ❌ | ❌ |
+| [BioGpt](../en/model_doc/biogpt) | ✅ | ❌ | ❌ |
+| [BiT](../en/model_doc/bit) | ✅ | ❌ | ❌ |
+| [Blenderbot](../en/model_doc/blenderbot) | ✅ | ✅ | ✅ |
+| [BlenderbotSmall](../en/model_doc/blenderbot-small) | ✅ | ✅ | ✅ |
+| [BLIP](../en/model_doc/blip) | ✅ | ✅ | ❌ |
+| [BLIP-2](../en/model_doc/blip-2) | ✅ | ❌ | ❌ |
+| [BLOOM](../en/model_doc/bloom) | ✅ | ❌ | ✅ |
+| [BORT](../en/model_doc/bort) | ✅ | ✅ | ✅ |
+| [BridgeTower](../en/model_doc/bridgetower) | ✅ | ❌ | ❌ |
+| [BROS](../en/model_doc/bros) | ✅ | ❌ | ❌ |
+| [ByT5](../en/model_doc/byt5) | ✅ | ✅ | ✅ |
+| [CamemBERT](../en/model_doc/camembert) | ✅ | ✅ | ❌ |
+| [CANINE](../en/model_doc/canine) | ✅ | ❌ | ❌ |
+| [Chinese-CLIP](../en/model_doc/chinese_clip) | ✅ | ❌ | ❌ |
+| [CLAP](../en/model_doc/clap) | ✅ | ❌ | ❌ |
+| [CLIP](../en/model_doc/clip) | ✅ | ✅ | ✅ |
+| [CLIPSeg](../en/model_doc/clipseg) | ✅ | ❌ | ❌ |
+| [CLVP](../en/model_doc/clvp) | ✅ | ❌ | ❌ |
+| [CodeGen](../en/model_doc/codegen) | ✅ | ❌ | ❌ |
+| [CodeLlama](../en/model_doc/code_llama) | ✅ | ❌ | ✅ |
+| [Conditional DETR](../en/model_doc/conditional_detr) | ✅ | ❌ | ❌ |
+| [ConvBERT](../en/model_doc/convbert) | ✅ | ✅ | ❌ |
+| [ConvNeXT](../en/model_doc/convnext) | ✅ | ✅ | ❌ |
+| [ConvNeXTV2](../en/model_doc/convnextv2) | ✅ | ✅ | ❌ |
+| [CPM](../en/model_doc/cpm) | ✅ | ✅ | ✅ |
+| [CPM-Ant](../en/model_doc/cpmant) | ✅ | ❌ | ❌ |
+| [CTRL](../en/model_doc/ctrl) | ✅ | ✅ | ❌ |
+| [CvT](../en/model_doc/cvt) | ✅ | ✅ | ❌ |
+| [Data2VecAudio](../en/model_doc/data2vec) | ✅ | ❌ | ❌ |
+| [Data2VecText](../en/model_doc/data2vec) | ✅ | ❌ | ❌ |
+| [Data2VecVision](../en/model_doc/data2vec) | ✅ | ✅ | ❌ |
+| [DeBERTa](../en/model_doc/deberta) | ✅ | ✅ | ❌ |
+| [DeBERTa-v2](../en/model_doc/deberta-v2) | ✅ | ✅ | ❌ |
+| [Decision Transformer](../en/model_doc/decision_transformer) | ✅ | ❌ | ❌ |
+| [Deformable DETR](../en/model_doc/deformable_detr) | ✅ | ❌ | ❌ |
+| [DeiT](../en/model_doc/deit) | ✅ | ✅ | ❌ |
+| [DePlot](../en/model_doc/deplot) | ✅ | ❌ | ❌ |
+| [Depth Anything](../en/model_doc/depth_anything) | ✅ | ❌ | ❌ |
+| [DETA](../en/model_doc/deta) | ✅ | ❌ | ❌ |
+| [DETR](../en/model_doc/detr) | ✅ | ❌ | ❌ |
+| [DialoGPT](../en/model_doc/dialogpt) | ✅ | ✅ | ✅ |
+| [DiNAT](../en/model_doc/dinat) | ✅ | ❌ | ❌ |
+| [DINOv2](../en/model_doc/dinov2) | ✅ | ❌ | ❌ |
+| [DistilBERT](../en/model_doc/distilbert) | ✅ | ✅ | ✅ |
+| [DiT](../en/model_doc/dit) | ✅ | ❌ | ✅ |
+| [DonutSwin](../en/model_doc/donut) | ✅ | ❌ | ❌ |
+| [DPR](../en/model_doc/dpr) | ✅ | ✅ | ❌ |
+| [DPT](../en/model_doc/dpt) | ✅ | ❌ | ❌ |
+| [EfficientFormer](../en/model_doc/efficientformer) | ✅ | ✅ | ❌ |
+| [EfficientNet](../en/model_doc/efficientnet) | ✅ | ❌ | ❌ |
+| [ELECTRA](../en/model_doc/electra) | ✅ | ✅ | ✅ |
+| [EnCodec](../en/model_doc/encodec) | ✅ | ❌ | ❌ |
+| [Encoder decoder](../en/model_doc/encoder-decoder) | ✅ | ✅ | ✅ |
+| [ERNIE](../en/model_doc/ernie) | ✅ | ❌ | ❌ |
+| [ErnieM](../en/model_doc/ernie_m) | ✅ | ❌ | ❌ |
+| [ESM](../en/model_doc/esm) | ✅ | ✅ | ❌ |
+| [FairSeq Machine-Translation](../en/model_doc/fsmt) | ✅ | ❌ | ❌ |
+| [Falcon](../en/model_doc/falcon) | ✅ | ❌ | ❌ |
+| [FastSpeech2Conformer](../en/model_doc/fastspeech2_conformer) | ✅ | ❌ | ❌ |
+| [FLAN-T5](../en/model_doc/flan-t5) | ✅ | ✅ | ✅ |
+| [FLAN-UL2](../en/model_doc/flan-ul2) | ✅ | ✅ | ✅ |
+| [FlauBERT](../en/model_doc/flaubert) | ✅ | ✅ | ❌ |
+| [FLAVA](../en/model_doc/flava) | ✅ | ❌ | ❌ |
+| [FNet](../en/model_doc/fnet) | ✅ | ❌ | ❌ |
+| [FocalNet](../en/model_doc/focalnet) | ✅ | ❌ | ❌ |
+| [Funnel Transformer](../en/model_doc/funnel) | ✅ | ✅ | ❌ |
+| [Fuyu](../en/model_doc/fuyu) | ✅ | ❌ | ❌ |
+| [Gemma](../en/model_doc/gemma) | ✅ | ❌ | ✅ |
+| [GIT](../en/model_doc/git) | ✅ | ❌ | ❌ |
+| [GLPN](../en/model_doc/glpn) | ✅ | ❌ | ❌ |
+| [GPT Neo](../en/model_doc/gpt_neo) | ✅ | ❌ | ✅ |
+| [GPT NeoX](../en/model_doc/gpt_neox) | ✅ | ❌ | ❌ |
+| [GPT NeoX Japanese](../en/model_doc/gpt_neox_japanese) | ✅ | ❌ | ❌ |
+| [GPT-J](../en/model_doc/gptj) | ✅ | ✅ | ✅ |
+| [GPT-Sw3](../en/model_doc/gpt-sw3) | ✅ | ✅ | ✅ |
+| [GPTBigCode](../en/model_doc/gpt_bigcode) | ✅ | ❌ | ❌ |
+| [GPTSAN-japanese](../en/model_doc/gptsan-japanese) | ✅ | ❌ | ❌ |
+| [Graphormer](../en/model_doc/graphormer) | ✅ | ❌ | ❌ |
+| [GroupViT](../en/model_doc/groupvit) | ✅ | ✅ | ❌ |
+| [HerBERT](../en/model_doc/herbert) | ✅ | ✅ | ✅ |
+| [Hubert](../en/model_doc/hubert) | ✅ | ✅ | ❌ |
+| [I-BERT](../en/model_doc/ibert) | ✅ | ❌ | ❌ |
+| [IDEFICS](../en/model_doc/idefics) | ✅ | ❌ | ❌ |
+| [ImageGPT](../en/model_doc/imagegpt) | ✅ | ❌ | ❌ |
+| [Informer](../en/model_doc/informer) | ✅ | ❌ | ❌ |
+| [InstructBLIP](../en/model_doc/instructblip) | ✅ | ❌ | ❌ |
+| [Jukebox](../en/model_doc/jukebox) | ✅ | ❌ | ❌ |
+| [KOSMOS-2](../en/model_doc/kosmos-2) | ✅ | ❌ | ❌ |
+| [LayoutLM](../en/model_doc/layoutlm) | ✅ | ✅ | ❌ |
+| [LayoutLMv2](../en/model_doc/layoutlmv2) | ✅ | ❌ | ❌ |
+| [LayoutLMv3](../en/model_doc/layoutlmv3) | ✅ | ✅ | ❌ |
+| [LayoutXLM](../en/model_doc/layoutxlm) | ✅ | ❌ | ❌ |
+| [LED](../en/model_doc/led) | ✅ | ✅ | ❌ |
+| [LeViT](../en/model_doc/levit) | ✅ | ❌ | ❌ |
+| [LiLT](../en/model_doc/lilt) | ✅ | ❌ | ❌ |
+| [LLaMA](../en/model_doc/llama) | ✅ | ❌ | ✅ |
+| [Llama2](../en/model_doc/llama2) | ✅ | ❌ | ✅ |
+| [LLaVa](../en/model_doc/llava) | ✅ | ❌ | ❌ |
+| [Longformer](../en/model_doc/longformer) | ✅ | ✅ | ❌ |
+| [LongT5](../en/model_doc/longt5) | ✅ | ❌ | ✅ |
+| [LUKE](../en/model_doc/luke) | ✅ | ❌ | ❌ |
+| [LXMERT](../en/model_doc/lxmert) | ✅ | ✅ | ❌ |
+| [M-CTC-T](../en/model_doc/mctct) | ✅ | ❌ | ❌ |
+| [M2M100](../en/model_doc/m2m_100) | ✅ | ❌ | ❌ |
+| [MADLAD-400](../en/model_doc/madlad-400) | ✅ | ✅ | ✅ |
+| [Marian](../en/model_doc/marian) | ✅ | ✅ | ✅ |
+| [MarkupLM](../en/model_doc/markuplm) | ✅ | ❌ | ❌ |
+| [Mask2Former](../en/model_doc/mask2former) | ✅ | ❌ | ❌ |
+| [MaskFormer](../en/model_doc/maskformer) | ✅ | ❌ | ❌ |
+| [MatCha](../en/model_doc/matcha) | ✅ | ❌ | ❌ |
+| [mBART](../en/model_doc/mbart) | ✅ | ✅ | ✅ |
+| [mBART-50](../en/model_doc/mbart50) | ✅ | ✅ | ✅ |
+| [MEGA](../en/model_doc/mega) | ✅ | ❌ | ❌ |
+| [Megatron-BERT](../en/model_doc/megatron-bert) | ✅ | ❌ | ❌ |
+| [Megatron-GPT2](../en/model_doc/megatron_gpt2) | ✅ | ✅ | ✅ |
+| [MGP-STR](../en/model_doc/mgp-str) | ✅ | ❌ | ❌ |
+| [Mistral](../en/model_doc/mistral) | ✅ | ❌ | ✅ |
+| [Mixtral](../en/model_doc/mixtral) | ✅ | ❌ | ❌ |
+| [mLUKE](../en/model_doc/mluke) | ✅ | ❌ | ❌ |
+| [MMS](../en/model_doc/mms) | ✅ | ✅ | ✅ |
+| [MobileBERT](../en/model_doc/mobilebert) | ✅ | ✅ | ❌ |
+| [MobileNetV1](../en/model_doc/mobilenet_v1) | ✅ | ❌ | ❌ |
+| [MobileNetV2](../en/model_doc/mobilenet_v2) | ✅ | ❌ | ❌ |
+| [MobileViT](../en/model_doc/mobilevit) | ✅ | ✅ | ❌ |
+| [MobileViTV2](../en/model_doc/mobilevitv2) | ✅ | ❌ | ❌ |
+| [MPNet](../en/model_doc/mpnet) | ✅ | ✅ | ❌ |
+| [MPT](../en/model_doc/mpt) | ✅ | ❌ | ❌ |
+| [MRA](../en/model_doc/mra) | ✅ | ❌ | ❌ |
+| [MT5](../en/model_doc/mt5) | ✅ | ✅ | ✅ |
+| [MusicGen](../en/model_doc/musicgen) | ✅ | ❌ | ❌ |
+| [MVP](../en/model_doc/mvp) | ✅ | ❌ | ❌ |
+| [NAT](../en/model_doc/nat) | ✅ | ❌ | ❌ |
+| [Nezha](../en/model_doc/nezha) | ✅ | ❌ | ❌ |
+| [NLLB](../en/model_doc/nllb) | ✅ | ❌ | ❌ |
+| [NLLB-MOE](../en/model_doc/nllb-moe) | ✅ | ❌ | ❌ |
+| [Nougat](../en/model_doc/nougat) | ✅ | ✅ | ✅ |
+| [Nyströmformer](../en/model_doc/nystromformer) | ✅ | ❌ | ❌ |
+| [OneFormer](../en/model_doc/oneformer) | ✅ | ❌ | ❌ |
+| [OpenAI GPT](../en/model_doc/openai-gpt) | ✅ | ✅ | ❌ |
+| [OpenAI GPT-2](../en/model_doc/gpt2) | ✅ | ✅ | ✅ |
+| [OpenLlama](../en/model_doc/open-llama) | ✅ | ❌ | ❌ |
+| [OPT](../en/model_doc/opt) | ✅ | ✅ | ✅ |
+| [OWL-ViT](../en/model_doc/owlvit) | ✅ | ❌ | ❌ |
+| [OWLv2](../en/model_doc/owlv2) | ✅ | ❌ | ❌ |
+| [PatchTSMixer](../en/model_doc/patchtsmixer) | ✅ | ❌ | ❌ |
+| [PatchTST](../en/model_doc/patchtst) | ✅ | ❌ | ❌ |
+| [Pegasus](../en/model_doc/pegasus) | ✅ | ✅ | ✅ |
+| [PEGASUS-X](../en/model_doc/pegasus_x) | ✅ | ❌ | ❌ |
+| [Perceiver](../en/model_doc/perceiver) | ✅ | ❌ | ❌ |
+| [Persimmon](../en/model_doc/persimmon) | ✅ | ❌ | ❌ |
+| [Phi](../en/model_doc/phi) | ✅ | ❌ | ❌ |
+| [PhoBERT](../en/model_doc/phobert) | ✅ | ✅ | ✅ |
+| [Pix2Struct](../en/model_doc/pix2struct) | ✅ | ❌ | ❌ |
+| [PLBart](../en/model_doc/plbart) | ✅ | ❌ | ❌ |
+| [PoolFormer](../en/model_doc/poolformer) | ✅ | ❌ | ❌ |
+| [Pop2Piano](../en/model_doc/pop2piano) | ✅ | ❌ | ❌ |
+| [ProphetNet](../en/model_doc/prophetnet) | ✅ | ❌ | ❌ |
+| [PVT](../en/model_doc/pvt) | ✅ | ❌ | ❌ |
+| [QDQBert](../en/model_doc/qdqbert) | ✅ | ❌ | ❌ |
+| [Qwen2](../en/model_doc/qwen2) | ✅ | ❌ | ❌ |
+| [RAG](../en/model_doc/rag) | ✅ | ✅ | ❌ |
+| [REALM](../en/model_doc/realm) | ✅ | ❌ | ❌ |
+| [Reformer](../en/model_doc/reformer) | ✅ | ❌ | ❌ |
+| [RegNet](../en/model_doc/regnet) | ✅ | ✅ | ✅ |
+| [RemBERT](../en/model_doc/rembert) | ✅ | ✅ | ❌ |
+| [ResNet](../en/model_doc/resnet) | ✅ | ✅ | ✅ |
+| [RetriBERT](../en/model_doc/retribert) | ✅ | ❌ | ❌ |
+| [RoBERTa](../en/model_doc/roberta) | ✅ | ✅ | ✅ |
+| [RoBERTa-PreLayerNorm](../en/model_doc/roberta-prelayernorm) | ✅ | ✅ | ✅ |
+| [RoCBert](../en/model_doc/roc_bert) | ✅ | ❌ | ❌ |
+| [RoFormer](../en/model_doc/roformer) | ✅ | ✅ | ✅ |
+| [RWKV](../en/model_doc/rwkv) | ✅ | ❌ | ❌ |
+| [SAM](../en/model_doc/sam) | ✅ | ✅ | ❌ |
+| [SeamlessM4T](../en/model_doc/seamless_m4t) | ✅ | ❌ | ❌ |
+| [SeamlessM4Tv2](../en/model_doc/seamless_m4t_v2) | ✅ | ❌ | ❌ |
+| [SegFormer](../en/model_doc/segformer) | ✅ | ✅ | ❌ |
+| [SegGPT](../en/model_doc/seggpt) | ✅ | ❌ | ❌ |
+| [SEW](../en/model_doc/sew) | ✅ | ❌ | ❌ |
+| [SEW-D](../en/model_doc/sew-d) | ✅ | ❌ | ❌ |
+| [SigLIP](../en/model_doc/siglip) | ✅ | ❌ | ❌ |
+| [Speech Encoder decoder](../en/model_doc/speech-encoder-decoder) | ✅ | ❌ | ✅ |
+| [Speech2Text](../en/model_doc/speech_to_text) | ✅ | ✅ | ❌ |
+| [SpeechT5](../en/model_doc/speecht5) | ✅ | ❌ | ❌ |
+| [Splinter](../en/model_doc/splinter) | ✅ | ❌ | ❌ |
+| [SqueezeBERT](../en/model_doc/squeezebert) | ✅ | ❌ | ❌ |
+| [StableLm](../en/model_doc/stablelm) | ✅ | ❌ | ❌ |
+| [Starcoder2](../en/model_doc/starcoder2) | ✅ | ❌ | ❌ |
+| [SwiftFormer](../en/model_doc/swiftformer) | ✅ | ❌ | ❌ |
+| [Swin Transformer](../en/model_doc/swin) | ✅ | ✅ | ❌ |
+| [Swin Transformer V2](../en/model_doc/swinv2) | ✅ | ❌ | ❌ |
+| [Swin2SR](../en/model_doc/swin2sr) | ✅ | ❌ | ❌ |
+| [SwitchTransformers](../en/model_doc/switch_transformers) | ✅ | ❌ | ❌ |
+| [T5](../en/model_doc/t5) | ✅ | ✅ | ✅ |
+| [T5v1.1](../en/model_doc/t5v1.1) | ✅ | ✅ | ✅ |
+| [Table Transformer](../en/model_doc/table-transformer) | ✅ | ❌ | ❌ |
+| [TAPAS](../en/model_doc/tapas) | ✅ | ✅ | ❌ |
+| [TAPEX](../en/model_doc/tapex) | ✅ | ✅ | ✅ |
+| [Time Series Transformer](../en/model_doc/time_series_transformer) | ✅ | ❌ | ❌ |
+| [TimeSformer](../en/model_doc/timesformer) | ✅ | ❌ | ❌ |
+| [Trajectory Transformer](../en/model_doc/trajectory_transformer) | ✅ | ❌ | ❌ |
+| [Transformer-XL](../en/model_doc/transfo-xl) | ✅ | ✅ | ❌ |
+| [TrOCR](../en/model_doc/trocr) | ✅ | ❌ | ❌ |
+| [TVLT](../en/model_doc/tvlt) | ✅ | ❌ | ❌ |
+| [TVP](../en/model_doc/tvp) | ✅ | ❌ | ❌ |
+| [UL2](../en/model_doc/ul2) | ✅ | ✅ | ✅ |
+| [UMT5](../en/model_doc/umt5) | ✅ | ❌ | ❌ |
+| [UniSpeech](../en/model_doc/unispeech) | ✅ | ❌ | ❌ |
+| [UniSpeechSat](../en/model_doc/unispeech-sat) | ✅ | ❌ | ❌ |
+| [UnivNet](../en/model_doc/univnet) | ✅ | ❌ | ❌ |
+| [UPerNet](../en/model_doc/upernet) | ✅ | ❌ | ❌ |
+| [VAN](../en/model_doc/van) | ✅ | ❌ | ❌ |
+| [VideoMAE](../en/model_doc/videomae) | ✅ | ❌ | ❌ |
+| [ViLT](../en/model_doc/vilt) | ✅ | ❌ | ❌ |
+| [VipLlava](../en/model_doc/vipllava) | ✅ | ❌ | ❌ |
+| [Vision Encoder decoder](../en/model_doc/vision-encoder-decoder) | ✅ | ✅ | ✅ |
+| [VisionTextDualEncoder](../en/model_doc/vision-text-dual-encoder) | ✅ | ✅ | ✅ |
+| [VisualBERT](../en/model_doc/visual_bert) | ✅ | ❌ | ❌ |
+| [ViT](../en/model_doc/vit) | ✅ | ✅ | ✅ |
+| [ViT Hybrid](../en/model_doc/vit_hybrid) | ✅ | ❌ | ❌ |
+| [VitDet](../en/model_doc/vitdet) | ✅ | ❌ | ❌ |
+| [ViTMAE](../en/model_doc/vit_mae) | ✅ | ✅ | ❌ |
+| [ViTMatte](../en/model_doc/vitmatte) | ✅ | ❌ | ❌ |
+| [ViTMSN](../en/model_doc/vit_msn) | ✅ | ❌ | ❌ |
+| [VITS](../en/model_doc/vits) | ✅ | ❌ | ❌ |
+| [ViViT](../en/model_doc/vivit) | ✅ | ❌ | ❌ |
+| [Wav2Vec2](../en/model_doc/wav2vec2) | ✅ | ✅ | ✅ |
+| [Wav2Vec2-BERT](../en/model_doc/wav2vec2-bert) | ✅ | ❌ | ❌ |
+| [Wav2Vec2-Conformer](../en/model_doc/wav2vec2-conformer) | ✅ | ❌ | ❌ |
+| [Wav2Vec2Phoneme](../en/model_doc/wav2vec2_phoneme) | ✅ | ✅ | ✅ |
+| [WavLM](../en/model_doc/wavlm) | ✅ | ❌ | ❌ |
+| [Whisper](../en/model_doc/whisper) | ✅ | ✅ | ✅ |
+| [X-CLIP](../en/model_doc/xclip) | ✅ | ❌ | ❌ |
+| [X-MOD](../en/model_doc/xmod) | ✅ | ❌ | ❌ |
+| [XGLM](../en/model_doc/xglm) | ✅ | ✅ | ✅ |
+| [XLM](../en/model_doc/xlm) | ✅ | ✅ | ❌ |
+| [XLM-ProphetNet](../en/model_doc/xlm-prophetnet) | ✅ | ❌ | ❌ |
+| [XLM-RoBERTa](../en/model_doc/xlm-roberta) | ✅ | ✅ | ✅ |
+| [XLM-RoBERTa-XL](../en/model_doc/xlm-roberta-xl) | ✅ | ❌ | ❌ |
+| [XLM-V](../en/model_doc/xlm-v) | ✅ | ✅ | ✅ |
+| [XLNet](../en/model_doc/xlnet) | ✅ | ✅ | ❌ |
+| [XLS-R](../en/model_doc/xls_r) | ✅ | ✅ | ✅ |
+| [XLSR-Wav2Vec2](../en/model_doc/xlsr_wav2vec2) | ✅ | ✅ | ✅ |
+| [YOLOS](../en/model_doc/yolos) | ✅ | ❌ | ❌ |
+| [YOSO](../en/model_doc/yoso) | ✅ | ❌ | ❌ |
From 1aee9afd1c1d588f0e105af0ddbd6247e6e9a032 Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Thu, 29 Feb 2024 03:52:13 +0100
Subject: [PATCH 048/549] FIX [`CI` / `starcoder2`] Change starcoder2 path to
correct one for slow tests (#29359)
change starcoder2 path to correct one
---
tests/models/starcoder2/test_modeling_starcoder2.py | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/tests/models/starcoder2/test_modeling_starcoder2.py b/tests/models/starcoder2/test_modeling_starcoder2.py
index dfedb2ed788a47..f0794c46dcee63 100644
--- a/tests/models/starcoder2/test_modeling_starcoder2.py
+++ b/tests/models/starcoder2/test_modeling_starcoder2.py
@@ -473,7 +473,7 @@ def test_starcoder2_batched_generation_sdpa(self):
"Hello my name is Younes and I am a student at the University of Liverpool. I am currently studying for my MSc in Computer Science. I am interested in the field of Machine Learning and I am currently working on",
"def hello_world():\n\treturn 'Hello World!'\n\n@app.route('/hello/')\ndef hello_name(name):\n\treturn 'Hello %s!' % name\n\n@app",
]
- model_id = "bigcode/starcoder2-7b_16k"
+ model_id = "bigcode/starcoder2-7b"
model = Starcoder2ForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa"
@@ -493,7 +493,7 @@ def test_starcoder2_batched_generation_eager(self):
"Hello my name is Younes and I am a student at the University of Liverpool. I am currently studying for my MSc in Computer Science. I am interested in the field of Machine Learning and I am currently working on",
"def hello_world():\n\treturn 'Hello World!'\n\n@app.route('/hello/')\ndef hello_name(name):\n\treturn 'Hello %s!' % name\n\n@app",
]
- model_id = "bigcode/starcoder2-7b_16k"
+ model_id = "bigcode/starcoder2-7b"
model = Starcoder2ForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="auto", attn_implementation="eager"
@@ -514,7 +514,7 @@ def test_starcoder2_batched_generation_fa2(self):
"Hello my name is Younes and I am a student at the University of Liverpool. I am currently studying for my MSc in Computer Science. I am interested in the field of Machine Learning and I am currently working on",
"def hello_world():\n\treturn 'Hello World!'\n\n@app.route('/hello/')\ndef hello_name(name):\n\treturn 'Hello %s!' % name\n\n@app",
]
- model_id = "bigcode/starcoder2-7b_16k"
+ model_id = "bigcode/starcoder2-7b"
model = Starcoder2ForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="auto", attn_implementation="flash_attention_2"
@@ -535,7 +535,7 @@ def test_starcoder2_batched_generation_4bit(self):
'Hello my name is Younes and I am a student at the University of Maryland. I am currently working on a project that is related to the topic of "How to make a game". I am currently working on a project',
'def hello_world():\n\treturn "Hello World"\n\n@app.route(\'/hello/\')\ndef hello_name(name):\n\treturn "Hello " + name\n\n@app.route',
]
- model_id = "bigcode/starcoder2-7b_16k"
+ model_id = "bigcode/starcoder2-7b"
model = Starcoder2ForCausalLM.from_pretrained(model_id, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
From 8d8ac9c2df13447a6e5f676dbad2907dcebdc77e Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Thu, 29 Feb 2024 03:56:16 +0100
Subject: [PATCH 049/549] FIX [`CI`]: Fix failing tests for peft integration
(#29330)
fix failing tests for peft integration
---
tests/peft_integration/test_peft_integration.py | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/tests/peft_integration/test_peft_integration.py b/tests/peft_integration/test_peft_integration.py
index 50890b886ccf4b..602ed04d9c6271 100644
--- a/tests/peft_integration/test_peft_integration.py
+++ b/tests/peft_integration/test_peft_integration.py
@@ -19,7 +19,14 @@
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, OPTForCausalLM
-from transformers.testing_utils import require_peft, require_torch, require_torch_gpu, slow, torch_device
+from transformers.testing_utils import (
+ require_bitsandbytes,
+ require_peft,
+ require_torch,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
from transformers.utils import is_torch_available
@@ -335,6 +342,7 @@ def test_peft_add_multi_adapter(self):
model.save_pretrained(tmpdirname)
@require_torch_gpu
+ @require_bitsandbytes
def test_peft_from_pretrained_kwargs(self):
"""
Simple test that tests the basic usage of PEFT model through `from_pretrained` + additional kwargs
@@ -352,6 +360,7 @@ def test_peft_from_pretrained_kwargs(self):
_ = peft_model.generate(input_ids=torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7]]).to(torch_device))
@require_torch_gpu
+ @require_bitsandbytes
def test_peft_save_quantized(self):
"""
Simple test that tests the basic usage of PEFT model save_pretrained with quantized base models
@@ -390,6 +399,7 @@ def test_peft_save_quantized(self):
self.assertTrue("model.safetensors" not in os.listdir(tmpdirname))
@require_torch_gpu
+ @require_bitsandbytes
def test_peft_save_quantized_regression(self):
"""
Simple test that tests the basic usage of PEFT model save_pretrained with quantized base models
From b647acdb53d251cec126b79e505bac11821d7c93 Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Thu, 29 Feb 2024 04:49:01 +0100
Subject: [PATCH 050/549] FIX [`CI`] `require_read_token` in the llama FA2 test
(#29361)
Update test_modeling_llama.py
---
tests/models/llama/test_modeling_llama.py | 1 +
1 file changed, 1 insertion(+)
diff --git a/tests/models/llama/test_modeling_llama.py b/tests/models/llama/test_modeling_llama.py
index 308e5d91195215..02c649c39aa0d4 100644
--- a/tests/models/llama/test_modeling_llama.py
+++ b/tests/models/llama/test_modeling_llama.py
@@ -398,6 +398,7 @@ def test_model_rope_scaling(self, scaling_type):
@require_torch_gpu
@require_bitsandbytes
@pytest.mark.flash_attn_test
+ @require_read_token
@slow
def test_flash_attn_2_generate_padding_right(self):
"""
From 44fe1a1cc41620e807813168ce66b5ced1c3ad9f Mon Sep 17 00:00:00 2001
From: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Date: Thu, 29 Feb 2024 17:19:17 +0800
Subject: [PATCH 051/549] Avoid using uncessary `get_values(MODEL_MAPPING)`
(#29362)
* more fixes
* more fixes
---------
Co-authored-by: ydshieh
---
tests/models/beit/test_modeling_beit.py | 20 ++++----
tests/models/clipseg/test_modeling_clipseg.py | 6 +--
.../data2vec/test_modeling_data2vec_vision.py | 7 ++-
tests/models/deit/test_modeling_deit.py | 19 +++----
tests/models/dpt/test_modeling_dpt.py | 8 +--
.../dpt/test_modeling_dpt_auto_backbone.py | 8 +--
tests/models/dpt/test_modeling_dpt_hybrid.py | 8 +--
.../test_modeling_efficientformer.py | 13 ++---
tests/models/glpn/test_modeling_glpn.py | 6 +--
tests/models/levit/test_modeling_levit.py | 15 +++---
.../perceiver/test_modeling_perceiver.py | 51 +++++++++++--------
tests/models/pvt/test_modeling_pvt.py | 6 +--
.../segformer/test_modeling_segformer.py | 5 +-
tests/models/vilt/test_modeling_vilt.py | 7 ++-
14 files changed, 94 insertions(+), 85 deletions(-)
diff --git a/tests/models/beit/test_modeling_beit.py b/tests/models/beit/test_modeling_beit.py
index 40b0d6aa0bd38d..f82cf40cdadcb4 100644
--- a/tests/models/beit/test_modeling_beit.py
+++ b/tests/models/beit/test_modeling_beit.py
@@ -21,7 +21,6 @@
from packaging import version
from transformers import BeitConfig
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_torch_multi_gpu, require_vision, slow, torch_device
from transformers.utils import cached_property, is_torch_available, is_vision_available
@@ -36,14 +35,13 @@
from torch import nn
from transformers import (
- MODEL_FOR_BACKBONE_MAPPING,
- MODEL_MAPPING,
BeitBackbone,
BeitForImageClassification,
BeitForMaskedImageModeling,
BeitForSemanticSegmentation,
BeitModel,
)
+ from transformers.models.auto.modeling_auto import MODEL_FOR_BACKBONE_MAPPING_NAMES, MODEL_MAPPING_NAMES
from transformers.models.beit.modeling_beit import BEIT_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -312,10 +310,10 @@ def test_training(self):
for model_class in self.all_model_classes:
# we don't test BeitForMaskedImageModeling
- if model_class in [
- *get_values(MODEL_MAPPING),
- *get_values(MODEL_FOR_BACKBONE_MAPPING),
- BeitForMaskedImageModeling,
+ if model_class.__name__ in [
+ *MODEL_MAPPING_NAMES.values(),
+ *MODEL_FOR_BACKBONE_MAPPING_NAMES.values(),
+ "BeitForMaskedImageModeling",
]:
continue
@@ -337,8 +335,12 @@ def test_training_gradient_checkpointing(self):
for model_class in self.all_model_classes:
# we don't test BeitForMaskedImageModeling
if (
- model_class
- in [*get_values(MODEL_MAPPING), *get_values(MODEL_FOR_BACKBONE_MAPPING), BeitForMaskedImageModeling]
+ model_class.__name__
+ in [
+ *MODEL_MAPPING_NAMES.values(),
+ *MODEL_FOR_BACKBONE_MAPPING_NAMES.values(),
+ "BeitForMaskedImageModeling",
+ ]
or not model_class.supports_gradient_checkpointing
):
continue
diff --git a/tests/models/clipseg/test_modeling_clipseg.py b/tests/models/clipseg/test_modeling_clipseg.py
index 0ebf08da89f9a5..f8e05caa1e15b6 100644
--- a/tests/models/clipseg/test_modeling_clipseg.py
+++ b/tests/models/clipseg/test_modeling_clipseg.py
@@ -24,8 +24,7 @@
import requests
import transformers
-from transformers import MODEL_MAPPING, CLIPSegConfig, CLIPSegProcessor, CLIPSegTextConfig, CLIPSegVisionConfig
-from transformers.models.auto import get_values
+from transformers import CLIPSegConfig, CLIPSegProcessor, CLIPSegTextConfig, CLIPSegVisionConfig
from transformers.testing_utils import (
is_flax_available,
is_pt_flax_cross_test,
@@ -52,6 +51,7 @@
from torch import nn
from transformers import CLIPSegForImageSegmentation, CLIPSegModel, CLIPSegTextModel, CLIPSegVisionModel
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.clipseg.modeling_clipseg import CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -751,7 +751,7 @@ def test_training(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
- if model_class in get_values(MODEL_MAPPING):
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
print("Model class:", model_class)
diff --git a/tests/models/data2vec/test_modeling_data2vec_vision.py b/tests/models/data2vec/test_modeling_data2vec_vision.py
index 20733cb2e428f6..3e00dd0bf314d4 100644
--- a/tests/models/data2vec/test_modeling_data2vec_vision.py
+++ b/tests/models/data2vec/test_modeling_data2vec_vision.py
@@ -18,7 +18,6 @@
import unittest
from transformers import Data2VecVisionConfig
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_torch_multi_gpu, require_vision, slow, torch_device
from transformers.utils import cached_property, is_torch_available, is_vision_available
@@ -32,11 +31,11 @@
from torch import nn
from transformers import (
- MODEL_MAPPING,
Data2VecVisionForImageClassification,
Data2VecVisionForSemanticSegmentation,
Data2VecVisionModel,
)
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.data2vec.modeling_data2vec_vision import DATA2VEC_VISION_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -235,7 +234,7 @@ def test_training(self):
config.return_dict = True
for model_class in self.all_model_classes:
- if model_class in [*get_values(MODEL_MAPPING)]:
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
model = model_class(config)
@@ -254,7 +253,7 @@ def test_training_gradient_checkpointing(self):
config.return_dict = True
for model_class in self.all_model_classes:
- if model_class in [*get_values(MODEL_MAPPING)] or not model_class.supports_gradient_checkpointing:
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values() or not model_class.supports_gradient_checkpointing:
continue
# TODO: remove the following 3 lines once we have a MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING
# this can then be incorporated into _prepare_for_class in test_modeling_common.py
diff --git a/tests/models/deit/test_modeling_deit.py b/tests/models/deit/test_modeling_deit.py
index 87ac1690966003..07f581bfeb2b9b 100644
--- a/tests/models/deit/test_modeling_deit.py
+++ b/tests/models/deit/test_modeling_deit.py
@@ -19,7 +19,6 @@
import warnings
from transformers import DeiTConfig
-from transformers.models.auto import get_values
from transformers.testing_utils import (
require_accelerate,
require_torch,
@@ -41,14 +40,16 @@
from torch import nn
from transformers import (
- MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
- MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
- MODEL_MAPPING,
DeiTForImageClassification,
DeiTForImageClassificationWithTeacher,
DeiTForMaskedImageModeling,
DeiTModel,
)
+ from transformers.models.auto.modeling_auto import (
+ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_MAPPING_NAMES,
+ )
from transformers.models.deit.modeling_deit import DEIT_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -269,7 +270,7 @@ def test_training(self):
for model_class in self.all_model_classes:
# DeiTForImageClassificationWithTeacher supports inference-only
if (
- model_class in get_values(MODEL_MAPPING)
+ model_class.__name__ in MODEL_MAPPING_NAMES.values()
or model_class.__name__ == "DeiTForImageClassificationWithTeacher"
):
continue
@@ -289,7 +290,7 @@ def test_training_gradient_checkpointing(self):
config.return_dict = True
for model_class in self.all_model_classes:
- if model_class in get_values(MODEL_MAPPING) or not model_class.supports_gradient_checkpointing:
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values() or not model_class.supports_gradient_checkpointing:
continue
# DeiTForImageClassificationWithTeacher supports inference-only
if model_class.__name__ == "DeiTForImageClassificationWithTeacher":
@@ -325,10 +326,10 @@ def test_problem_types(self):
for model_class in self.all_model_classes:
if (
- model_class
+ model_class.__name__
not in [
- *get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING),
- *get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
+ *MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES.values(),
+ *MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES.values(),
]
or model_class.__name__ == "DeiTForImageClassificationWithTeacher"
):
diff --git a/tests/models/dpt/test_modeling_dpt.py b/tests/models/dpt/test_modeling_dpt.py
index 2c092062791f7d..ffd6edbad4bff1 100644
--- a/tests/models/dpt/test_modeling_dpt.py
+++ b/tests/models/dpt/test_modeling_dpt.py
@@ -19,7 +19,6 @@
from transformers import DPTConfig
from transformers.file_utils import is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from ...test_configuration_common import ConfigTester
@@ -31,7 +30,8 @@
import torch
from torch import nn
- from transformers import MODEL_MAPPING, DPTForDepthEstimation, DPTForSemanticSegmentation, DPTModel
+ from transformers import DPTForDepthEstimation, DPTForSemanticSegmentation, DPTModel
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.dpt.modeling_dpt import DPT_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -214,7 +214,7 @@ def test_training(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
- if model_class in get_values(MODEL_MAPPING):
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
model = model_class(config)
@@ -233,7 +233,7 @@ def test_training_gradient_checkpointing(self):
config.use_cache = False
config.return_dict = True
- if model_class in get_values(MODEL_MAPPING) or not model_class.supports_gradient_checkpointing:
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values() or not model_class.supports_gradient_checkpointing:
continue
model = model_class(config)
model.to(torch_device)
diff --git a/tests/models/dpt/test_modeling_dpt_auto_backbone.py b/tests/models/dpt/test_modeling_dpt_auto_backbone.py
index b2408465e4aae2..ea500b47a3c88a 100644
--- a/tests/models/dpt/test_modeling_dpt_auto_backbone.py
+++ b/tests/models/dpt/test_modeling_dpt_auto_backbone.py
@@ -19,7 +19,6 @@
from transformers import Dinov2Config, DPTConfig
from transformers.file_utils import is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from ...test_configuration_common import ConfigTester
@@ -30,7 +29,8 @@
if is_torch_available():
import torch
- from transformers import MODEL_MAPPING, DPTForDepthEstimation
+ from transformers import DPTForDepthEstimation
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.dpt.modeling_dpt import DPT_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -166,7 +166,7 @@ def test_training(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
- if model_class in get_values(MODEL_MAPPING):
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
model = model_class(config)
@@ -185,7 +185,7 @@ def test_training_gradient_checkpointing(self):
config.use_cache = False
config.return_dict = True
- if model_class in get_values(MODEL_MAPPING) or not model_class.supports_gradient_checkpointing:
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values() or not model_class.supports_gradient_checkpointing:
continue
model = model_class(config)
model.to(torch_device)
diff --git a/tests/models/dpt/test_modeling_dpt_hybrid.py b/tests/models/dpt/test_modeling_dpt_hybrid.py
index 2621c7438bd6da..13a0cf4db8ca67 100644
--- a/tests/models/dpt/test_modeling_dpt_hybrid.py
+++ b/tests/models/dpt/test_modeling_dpt_hybrid.py
@@ -19,7 +19,6 @@
from transformers import DPTConfig
from transformers.file_utils import is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from ...test_configuration_common import ConfigTester
@@ -31,7 +30,8 @@
import torch
from torch import nn
- from transformers import MODEL_MAPPING, DPTForDepthEstimation, DPTForSemanticSegmentation, DPTModel
+ from transformers import DPTForDepthEstimation, DPTForSemanticSegmentation, DPTModel
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.dpt.modeling_dpt import DPT_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -229,7 +229,7 @@ def test_training(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
- if model_class in get_values(MODEL_MAPPING):
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
model = model_class(config)
@@ -248,7 +248,7 @@ def test_training_gradient_checkpointing(self):
config.use_cache = False
config.return_dict = True
- if model_class in get_values(MODEL_MAPPING) or not model_class.supports_gradient_checkpointing:
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values() or not model_class.supports_gradient_checkpointing:
continue
model = model_class(config)
model.to(torch_device)
diff --git a/tests/models/efficientformer/test_modeling_efficientformer.py b/tests/models/efficientformer/test_modeling_efficientformer.py
index 2d6176960a5c5f..070c7fccae6053 100644
--- a/tests/models/efficientformer/test_modeling_efficientformer.py
+++ b/tests/models/efficientformer/test_modeling_efficientformer.py
@@ -20,7 +20,6 @@
from typing import List
from transformers import EfficientFormerConfig
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from transformers.utils import cached_property, is_torch_available, is_vision_available
@@ -33,12 +32,14 @@
import torch
from transformers import (
- MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
- MODEL_MAPPING,
EfficientFormerForImageClassification,
EfficientFormerForImageClassificationWithTeacher,
EfficientFormerModel,
)
+ from transformers.models.auto.modeling_auto import (
+ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_MAPPING_NAMES,
+ )
from transformers.models.efficientformer.modeling_efficientformer import (
EFFICIENTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
)
@@ -308,7 +309,7 @@ def test_training(self):
for model_class in self.all_model_classes:
# EfficientFormerForImageClassificationWithTeacher supports inference-only
if (
- model_class in get_values(MODEL_MAPPING)
+ model_class.__name__ in MODEL_MAPPING_NAMES.values()
or model_class.__name__ == "EfficientFormerForImageClassificationWithTeacher"
):
continue
@@ -330,9 +331,9 @@ def test_problem_types(self):
for model_class in self.all_model_classes:
if (
- model_class
+ model_class.__name__
not in [
- *get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
+ *MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES.values(),
]
or model_class.__name__ == "EfficientFormerForImageClassificationWithTeacher"
):
diff --git a/tests/models/glpn/test_modeling_glpn.py b/tests/models/glpn/test_modeling_glpn.py
index 90f8996984d32c..aab49c849101cd 100644
--- a/tests/models/glpn/test_modeling_glpn.py
+++ b/tests/models/glpn/test_modeling_glpn.py
@@ -18,7 +18,6 @@
import unittest
from transformers import is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from ...test_configuration_common import ConfigTester
@@ -29,7 +28,8 @@
if is_torch_available():
import torch
- from transformers import MODEL_MAPPING, GLPNConfig, GLPNForDepthEstimation, GLPNModel
+ from transformers import GLPNConfig, GLPNForDepthEstimation, GLPNModel
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.glpn.modeling_glpn import GLPN_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -291,7 +291,7 @@ def test_training(self):
config.return_dict = True
for model_class in self.all_model_classes:
- if model_class in get_values(MODEL_MAPPING):
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
# TODO: remove the following 3 lines once we have a MODEL_FOR_DEPTH_ESTIMATION_MAPPING
# this can then be incorporated into _prepare_for_class in test_modeling_common.py
diff --git a/tests/models/levit/test_modeling_levit.py b/tests/models/levit/test_modeling_levit.py
index b6d9832704a521..fee3eaa086bd73 100644
--- a/tests/models/levit/test_modeling_levit.py
+++ b/tests/models/levit/test_modeling_levit.py
@@ -21,7 +21,6 @@
from transformers import LevitConfig
from transformers.file_utils import cached_property, is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from ...test_configuration_common import ConfigTester
@@ -33,12 +32,14 @@
import torch
from transformers import (
- MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
- MODEL_MAPPING,
LevitForImageClassification,
LevitForImageClassificationWithTeacher,
LevitModel,
)
+ from transformers.models.auto.modeling_auto import (
+ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_MAPPING_NAMES,
+ )
from transformers.models.levit.modeling_levit import LEVIT_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -297,7 +298,7 @@ def test_training(self):
for model_class in self.all_model_classes:
# LevitForImageClassificationWithTeacher supports inference-only
if (
- model_class in get_values(MODEL_MAPPING)
+ model_class.__name__ in MODEL_MAPPING_NAMES.values()
or model_class.__name__ == "LevitForImageClassificationWithTeacher"
):
continue
@@ -317,7 +318,7 @@ def test_training_gradient_checkpointing(self):
config.return_dict = True
for model_class in self.all_model_classes:
- if model_class in get_values(MODEL_MAPPING) or not model_class.supports_gradient_checkpointing:
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values() or not model_class.supports_gradient_checkpointing:
continue
# LevitForImageClassificationWithTeacher supports inference-only
if model_class.__name__ == "LevitForImageClassificationWithTeacher":
@@ -341,9 +342,9 @@ def test_problem_types(self):
for model_class in self.all_model_classes:
if (
- model_class
+ model_class.__name__
not in [
- *get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
+ *MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES.values(),
]
or model_class.__name__ == "LevitForImageClassificationWithTeacher"
):
diff --git a/tests/models/perceiver/test_modeling_perceiver.py b/tests/models/perceiver/test_modeling_perceiver.py
index aeb9b80debad35..a529c4430ff312 100644
--- a/tests/models/perceiver/test_modeling_perceiver.py
+++ b/tests/models/perceiver/test_modeling_perceiver.py
@@ -26,7 +26,6 @@
from datasets import load_dataset
from transformers import PerceiverConfig
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_torch_multi_gpu, require_vision, slow, torch_device
from transformers.utils import is_torch_available, is_vision_available
@@ -40,11 +39,6 @@
from torch import nn
from transformers import (
- MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
- MODEL_FOR_MASKED_LM_MAPPING,
- MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
- MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
- MODEL_MAPPING,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
@@ -55,6 +49,13 @@
PerceiverModel,
PerceiverTokenizer,
)
+ from transformers.models.auto.modeling_auto import (
+ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_FOR_MASKED_LM_MAPPING_NAMES,
+ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES,
+ MODEL_MAPPING_NAMES,
+ )
from transformers.models.perceiver.modeling_perceiver import PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -317,16 +318,19 @@ def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict["subsampled_output_points"] = self.model_tester.subsampling
if return_labels:
- if model_class in [
- *get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING),
- *get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
+ if model_class.__name__ in [
+ *MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES.values(),
+ "PerceiverForImageClassificationLearned",
+ "PerceiverForImageClassificationFourier",
+ "PerceiverForImageClassificationConvProcessing",
+ *MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES.values(),
]:
inputs_dict["labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
- elif model_class in [
- *get_values(MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING),
- *get_values(MODEL_FOR_MASKED_LM_MAPPING),
+ elif model_class.__name__ in [
+ *MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES.values(),
+ *MODEL_FOR_MASKED_LM_MAPPING_NAMES.values(),
]:
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
@@ -380,10 +384,10 @@ def test_training(self):
return
for model_class in self.all_model_classes:
- if model_class in [
- *get_values(MODEL_MAPPING),
- PerceiverForOpticalFlow,
- PerceiverForMultimodalAutoencoding,
+ if model_class.__name__ in [
+ *MODEL_MAPPING_NAMES.values(),
+ "PerceiverForOpticalFlow",
+ "PerceiverForMultimodalAutoencoding",
]:
continue
@@ -727,11 +731,14 @@ def test_correct_missing_keys(self):
for model_class in self.all_model_classes:
# most Perceiver models don't have a typical head like is the case with BERT
- if model_class in [
- PerceiverForOpticalFlow,
- PerceiverForMultimodalAutoencoding,
- *get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING),
- *get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
+ if model_class.__name__ in [
+ "PerceiverForOpticalFlow",
+ "PerceiverForMultimodalAutoencoding",
+ *MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES.values(),
+ "PerceiverForImageClassificationLearned",
+ "PerceiverForImageClassificationFourier",
+ "PerceiverForImageClassificationConvProcessing",
+ *MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES.values(),
]:
continue
@@ -753,7 +760,7 @@ def test_problem_types(self):
]
for model_class in self.all_model_classes:
- if model_class not in get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING):
+ if model_class.__name__ not in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES.values():
continue
config, inputs, input_mask, _, _ = self.model_tester.prepare_config_and_inputs(model_class=model_class)
diff --git a/tests/models/pvt/test_modeling_pvt.py b/tests/models/pvt/test_modeling_pvt.py
index d17041ecfaa55f..3b8c917f1d7592 100644
--- a/tests/models/pvt/test_modeling_pvt.py
+++ b/tests/models/pvt/test_modeling_pvt.py
@@ -18,7 +18,6 @@
import unittest
from transformers import is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import (
require_accelerate,
require_torch,
@@ -36,7 +35,8 @@
if is_torch_available():
import torch
- from transformers import MODEL_MAPPING, PvtConfig, PvtForImageClassification, PvtImageProcessor, PvtModel
+ from transformers import PvtConfig, PvtForImageClassification, PvtImageProcessor, PvtModel
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.pvt.modeling_pvt import PVT_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -243,7 +243,7 @@ def test_training(self):
config.return_dict = True
for model_class in self.all_model_classes:
- if model_class in get_values(MODEL_MAPPING):
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
model = model_class(config)
model.to(torch_device)
diff --git a/tests/models/segformer/test_modeling_segformer.py b/tests/models/segformer/test_modeling_segformer.py
index 8cb7cbad42f2d0..de64de5ad1b976 100644
--- a/tests/models/segformer/test_modeling_segformer.py
+++ b/tests/models/segformer/test_modeling_segformer.py
@@ -18,7 +18,6 @@
import unittest
from transformers import SegformerConfig, is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, slow, torch_device
from ...test_configuration_common import ConfigTester
@@ -30,11 +29,11 @@
import torch
from transformers import (
- MODEL_MAPPING,
SegformerForImageClassification,
SegformerForSemanticSegmentation,
SegformerModel,
)
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.segformer.modeling_segformer import SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST
@@ -324,7 +323,7 @@ def test_training(self):
config.return_dict = True
for model_class in self.all_model_classes:
- if model_class in get_values(MODEL_MAPPING):
+ if model_class.__name__ in MODEL_MAPPING_NAMES.values():
continue
model = model_class(config)
diff --git a/tests/models/vilt/test_modeling_vilt.py b/tests/models/vilt/test_modeling_vilt.py
index e17d6ce61b302f..f885afab08678c 100644
--- a/tests/models/vilt/test_modeling_vilt.py
+++ b/tests/models/vilt/test_modeling_vilt.py
@@ -20,7 +20,6 @@
from packaging import version
from transformers import ViltConfig, is_torch_available, is_vision_available
-from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from transformers.utils import cached_property
@@ -33,7 +32,6 @@
import torch
from transformers import (
- MODEL_MAPPING,
ViltForImageAndTextRetrieval,
ViltForImagesAndTextClassification,
ViltForMaskedLM,
@@ -41,6 +39,7 @@
ViltForTokenClassification,
ViltModel,
)
+ from transformers.models.auto.modeling_auto import MODEL_MAPPING_NAMES
from transformers.models.vilt.modeling_vilt import VILT_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
@@ -284,7 +283,7 @@ def test_training(self):
config.modality_type_vocab_size = 3
# ViltForImageAndTextRetrieval doesn't support training for now
- if model_class in [*get_values(MODEL_MAPPING), ViltForImageAndTextRetrieval]:
+ if model_class.__name__ in [*MODEL_MAPPING_NAMES.values(), "ViltForImageAndTextRetrieval"]:
continue
model = model_class(config)
@@ -307,7 +306,7 @@ def test_training_gradient_checkpointing(self):
# ViltForImageAndTextRetrieval doesn't support training for now
if (
- model_class in [*get_values(MODEL_MAPPING), ViltForImageAndTextRetrieval]
+ model_class.__name__ in [*MODEL_MAPPING_NAMES.values(), "ViltForImageAndTextRetrieval"]
or not model_class.supports_gradient_checkpointing
):
continue
From bb4f816ad4993a5ed15f8cfd7dae67573c88e1d7 Mon Sep 17 00:00:00 2001
From: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Date: Thu, 29 Feb 2024 11:09:50 +0100
Subject: [PATCH 052/549] Patch YOLOS and others (#29353)
Fix issue
---
.../conditional_detr/modeling_conditional_detr.py | 7 ++++---
.../deformable_detr/modeling_deformable_detr.py | 7 ++++---
src/transformers/models/deta/modeling_deta.py | 7 ++++---
src/transformers/models/detr/modeling_detr.py | 7 ++++---
.../models/mask2former/modeling_mask2former.py | 13 +++++++------
.../models/maskformer/modeling_maskformer.py | 13 +++++++------
.../models/oneformer/modeling_oneformer.py | 13 +++++++------
.../table_transformer/modeling_table_transformer.py | 7 ++++---
src/transformers/models/yolos/modeling_yolos.py | 7 ++++---
9 files changed, 45 insertions(+), 36 deletions(-)
diff --git a/src/transformers/models/conditional_detr/modeling_conditional_detr.py b/src/transformers/models/conditional_detr/modeling_conditional_detr.py
index 2a5e06ea2b4abc..b6ea7cdf4cc3af 100644
--- a/src/transformers/models/conditional_detr/modeling_conditional_detr.py
+++ b/src/transformers/models/conditional_detr/modeling_conditional_detr.py
@@ -2514,9 +2514,10 @@ def forward(self, outputs, targets):
num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
world_size = 1
- if PartialState._shared_state != {}:
- num_boxes = reduce(num_boxes)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_boxes = reduce(num_boxes)
+ world_size = PartialState().num_processes
num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
# Compute all the requested losses
diff --git a/src/transformers/models/deformable_detr/modeling_deformable_detr.py b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
index e9252167e7b4b1..1b6222c4cfc413 100755
--- a/src/transformers/models/deformable_detr/modeling_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
@@ -2282,9 +2282,10 @@ def forward(self, outputs, targets):
num_boxes = sum(len(t["class_labels"]) for t in targets)
num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
world_size = 1
- if PartialState._shared_state != {}:
- num_boxes = reduce(num_boxes)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_boxes = reduce(num_boxes)
+ world_size = PartialState().num_processes
num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
# Compute all the requested losses
diff --git a/src/transformers/models/deta/modeling_deta.py b/src/transformers/models/deta/modeling_deta.py
index 5d0b48b45d13ac..0c2dfdf3b0a24c 100644
--- a/src/transformers/models/deta/modeling_deta.py
+++ b/src/transformers/models/deta/modeling_deta.py
@@ -2345,9 +2345,10 @@ def forward(self, outputs, targets):
num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
# Check that we have initialized the distributed state
world_size = 1
- if PartialState._shared_state != {}:
- num_boxes = reduce(num_boxes)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_boxes = reduce(num_boxes)
+ world_size = PartialState().num_processes
num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
# Compute all the requested losses
diff --git a/src/transformers/models/detr/modeling_detr.py b/src/transformers/models/detr/modeling_detr.py
index 0fa912eb1d5192..1e548b61d3a7d2 100644
--- a/src/transformers/models/detr/modeling_detr.py
+++ b/src/transformers/models/detr/modeling_detr.py
@@ -2210,9 +2210,10 @@ def forward(self, outputs, targets):
num_boxes = sum(len(t["class_labels"]) for t in targets)
num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
world_size = 1
- if PartialState._shared_state != {}:
- num_boxes = reduce(num_boxes)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_boxes = reduce(num_boxes)
+ world_size = PartialState().num_processes
num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
# Compute all the requested losses
diff --git a/src/transformers/models/mask2former/modeling_mask2former.py b/src/transformers/models/mask2former/modeling_mask2former.py
index bf86b5ba6039e6..3e82cebb1dc9d0 100644
--- a/src/transformers/models/mask2former/modeling_mask2former.py
+++ b/src/transformers/models/mask2former/modeling_mask2former.py
@@ -791,14 +791,15 @@ def get_num_masks(self, class_labels: torch.Tensor, device: torch.device) -> tor
Computes the average number of target masks across the batch, for normalization purposes.
"""
num_masks = sum([len(classes) for classes in class_labels])
- num_masks_pt = torch.as_tensor(num_masks, dtype=torch.float, device=device)
+ num_masks = torch.as_tensor(num_masks, dtype=torch.float, device=device)
world_size = 1
- if PartialState._shared_state != {}:
- num_masks_pt = reduce(num_masks_pt)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_masks = reduce(num_masks)
+ world_size = PartialState().num_processes
- num_masks_pt = torch.clamp(num_masks_pt / world_size, min=1)
- return num_masks_pt
+ num_masks = torch.clamp(num_masks / world_size, min=1)
+ return num_masks
# Copied from transformers.models.deformable_detr.modeling_deformable_detr.multi_scale_deformable_attention
diff --git a/src/transformers/models/maskformer/modeling_maskformer.py b/src/transformers/models/maskformer/modeling_maskformer.py
index f2b171b32dc9e4..1addaae323dcd4 100644
--- a/src/transformers/models/maskformer/modeling_maskformer.py
+++ b/src/transformers/models/maskformer/modeling_maskformer.py
@@ -1198,14 +1198,15 @@ def get_num_masks(self, class_labels: torch.Tensor, device: torch.device) -> tor
Computes the average number of target masks across the batch, for normalization purposes.
"""
num_masks = sum([len(classes) for classes in class_labels])
- num_masks_pt = torch.as_tensor(num_masks, dtype=torch.float, device=device)
+ num_masks = torch.as_tensor(num_masks, dtype=torch.float, device=device)
world_size = 1
- if PartialState._shared_state != {}:
- num_masks_pt = reduce(num_masks_pt)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_masks = reduce(num_masks)
+ world_size = PartialState().num_processes
- num_masks_pt = torch.clamp(num_masks_pt / world_size, min=1)
- return num_masks_pt
+ num_masks = torch.clamp(num_masks / world_size, min=1)
+ return num_masks
class MaskFormerFPNConvLayer(nn.Module):
diff --git a/src/transformers/models/oneformer/modeling_oneformer.py b/src/transformers/models/oneformer/modeling_oneformer.py
index 586fd7345c5645..f8f61c52dd05cd 100644
--- a/src/transformers/models/oneformer/modeling_oneformer.py
+++ b/src/transformers/models/oneformer/modeling_oneformer.py
@@ -727,14 +727,15 @@ def get_num_masks(self, class_labels: torch.Tensor, device: torch.device) -> tor
Computes the average number of target masks across the batch, for normalization purposes.
"""
num_masks = sum([len(classes) for classes in class_labels])
- num_masks_pt = torch.as_tensor([num_masks], dtype=torch.float, device=device)
+ num_masks = torch.as_tensor([num_masks], dtype=torch.float, device=device)
world_size = 1
- if PartialState._shared_state != {}:
- num_masks_pt = reduce(num_masks_pt)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_masks = reduce(num_masks)
+ world_size = PartialState().num_processes
- num_masks_pt = torch.clamp(num_masks_pt / world_size, min=1)
- return num_masks_pt
+ num_masks = torch.clamp(num_masks / world_size, min=1)
+ return num_masks
@dataclass
diff --git a/src/transformers/models/table_transformer/modeling_table_transformer.py b/src/transformers/models/table_transformer/modeling_table_transformer.py
index 8a16917c3c76b8..7f86b0ab53320b 100644
--- a/src/transformers/models/table_transformer/modeling_table_transformer.py
+++ b/src/transformers/models/table_transformer/modeling_table_transformer.py
@@ -1757,9 +1757,10 @@ def forward(self, outputs, targets):
num_boxes = sum(len(t["class_labels"]) for t in targets)
num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
world_size = 1
- if PartialState._shared_state != {}:
- num_boxes = reduce(num_boxes)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_boxes = reduce(num_boxes)
+ world_size = PartialState().num_processes
num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
# Compute all the requested losses
diff --git a/src/transformers/models/yolos/modeling_yolos.py b/src/transformers/models/yolos/modeling_yolos.py
index 237429ae707d4c..86ab375cdf8346 100755
--- a/src/transformers/models/yolos/modeling_yolos.py
+++ b/src/transformers/models/yolos/modeling_yolos.py
@@ -1079,9 +1079,10 @@ def forward(self, outputs, targets):
num_boxes = sum(len(t["class_labels"]) for t in targets)
num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
world_size = 1
- if PartialState._shared_state != {}:
- num_boxes = reduce(num_boxes)
- world_size = PartialState().num_processes
+ if is_accelerate_available():
+ if PartialState._shared_state != {}:
+ num_boxes = reduce(num_boxes)
+ world_size = PartialState().num_processes
num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
# Compute all the requested losses
From 0ad770c3733f9478a8d9d0bc18cc6143877b47a2 Mon Sep 17 00:00:00 2001
From: Lucain
Date: Thu, 29 Feb 2024 11:25:16 +0100
Subject: [PATCH 053/549] Fix @require_read_token in tests (#29367)
---
src/transformers/testing_utils.py | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py
index 0efaf77a98a73c..fd5974d8d5bb54 100644
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -38,7 +38,6 @@
from unittest import mock
from unittest.mock import patch
-import huggingface_hub
import urllib3
from transformers import logging as transformers_logging
@@ -466,11 +465,11 @@ def require_read_token(fn):
"""
A decorator that loads the HF token for tests that require to load gated models.
"""
- token = os.getenv("HF_HUB_READ_TOKEN", None)
+ token = os.getenv("HF_HUB_READ_TOKEN")
@wraps(fn)
def _inner(*args, **kwargs):
- with patch.object(huggingface_hub.utils._headers, "get_token", return_value=token):
+ with patch("huggingface_hub.utils._headers.get_token", return_value=token):
return fn(*args, **kwargs)
return _inner
From 5ee0868a4b1193ef8a0f5cc7db02a9bfe5b07038 Mon Sep 17 00:00:00 2001
From: Song Fuchang
Date: Fri, 1 Mar 2024 09:12:51 +0800
Subject: [PATCH 054/549] Expose `offload_buffers` parameter of `accelerate` to
`PreTrainedModel.from_pretrained` method (#28755)
Expose offload_buffers parameter to from_pretrained method
---
src/transformers/modeling_utils.py | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 38dde4ec91e267..7bda8a20165b5e 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -2745,6 +2745,8 @@ def from_pretrained(
If `True`, will temporarily offload the CPU state dict to the hard drive to avoid getting out of CPU
RAM if the weight of the CPU state dict + the biggest shard of the checkpoint does not fit. Defaults to
`True` when there is some disk offload.
+ offload_buffers (`bool`, *optional*):
+ Whether or not to offload the buffers with the model parameters.
quantization_config (`Union[QuantizationConfigMixin,Dict]`, *optional*):
A dictionary of configuration parameters or a QuantizationConfigMixin object for quantization (e.g
bitsandbytes, gptq). There may be other quantization-related kwargs, including `load_in_4bit` and
@@ -2835,6 +2837,7 @@ def from_pretrained(
max_memory = kwargs.pop("max_memory", None)
offload_folder = kwargs.pop("offload_folder", None)
offload_state_dict = kwargs.pop("offload_state_dict", False)
+ offload_buffers = kwargs.pop("offload_buffers", False)
load_in_8bit = kwargs.pop("load_in_8bit", False)
load_in_4bit = kwargs.pop("load_in_4bit", False)
quantization_config = kwargs.pop("quantization_config", None)
@@ -3554,6 +3557,7 @@ def from_pretrained(
"device_map": device_map,
"offload_dir": offload_folder,
"offload_index": offload_index,
+ "offload_buffers": offload_buffers,
}
if "skip_keys" in inspect.signature(dispatch_model).parameters:
device_map_kwargs["skip_keys"] = model._skip_keys_device_placement
From 2858d6c634b7ba3348abecdfd2cc403e50991929 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Leon=20Engl=C3=A4nder?=
Date: Fri, 1 Mar 2024 02:58:19 +0100
Subject: [PATCH 055/549] Fix Base Model Name of LlamaForQuestionAnswering
(#29258)
* LlamaForQuestionAnswering self.transformer->self.model
* fix "Copied from" string
* Llama QA model: set base_model_prefix = "transformer"
---
src/transformers/models/llama/modeling_llama.py | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 4ea8a208a92315..43de19d329e936 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -1454,6 +1454,8 @@ def forward(
LLAMA_START_DOCSTRING,
)
class LlamaForQuestionAnswering(LlamaPreTrainedModel):
+ base_model_prefix = "transformer"
+
# Copied from transformers.models.bloom.modeling_bloom.BloomForQuestionAnswering.__init__ with Bloom->Llama
def __init__(self, config):
super().__init__(config)
From 50db7ca4e874e211dd18d9b9ee429f62ef7d7e8f Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Fri, 1 Mar 2024 03:01:53 +0100
Subject: [PATCH 056/549] FIX [`quantization` / `ESM`] Fix ESM 8bit / 4bit with
bitsandbytes (#29329)
* fix ESM 8bit
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fixup
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
src/transformers/models/esm/modeling_esm.py | 2 +-
.../quantizers/quantizer_bnb_4bit.py | 2 +-
.../quantizers/quantizer_bnb_8bit.py | 2 +-
tests/models/esm/test_modeling_esm.py | 20 ++++++++++++++++---
4 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/src/transformers/models/esm/modeling_esm.py b/src/transformers/models/esm/modeling_esm.py
index 57c436224099cc..2349ce580023d4 100755
--- a/src/transformers/models/esm/modeling_esm.py
+++ b/src/transformers/models/esm/modeling_esm.py
@@ -377,7 +377,7 @@ def forward(
if head_mask is not None:
attention_probs = attention_probs * head_mask
- context_layer = torch.matmul(attention_probs, value_layer)
+ context_layer = torch.matmul(attention_probs.to(value_layer.dtype), value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
diff --git a/src/transformers/quantizers/quantizer_bnb_4bit.py b/src/transformers/quantizers/quantizer_bnb_4bit.py
index 6cea1b5512392d..494bf1382e9f77 100644
--- a/src/transformers/quantizers/quantizer_bnb_4bit.py
+++ b/src/transformers/quantizers/quantizer_bnb_4bit.py
@@ -121,7 +121,7 @@ def check_quantized_param(
import bitsandbytes as bnb
module, tensor_name = get_module_from_name(model, param_name)
- if isinstance(module._parameters[tensor_name], bnb.nn.Params4bit):
+ if isinstance(module._parameters.get(tensor_name, None), bnb.nn.Params4bit):
# Add here check for loaded components' dtypes once serialization is implemented
return True
elif isinstance(module, bnb.nn.Linear4bit) and tensor_name == "bias":
diff --git a/src/transformers/quantizers/quantizer_bnb_8bit.py b/src/transformers/quantizers/quantizer_bnb_8bit.py
index 193da44d2c855f..cc6942857af8f6 100644
--- a/src/transformers/quantizers/quantizer_bnb_8bit.py
+++ b/src/transformers/quantizers/quantizer_bnb_8bit.py
@@ -139,7 +139,7 @@ def check_quantized_param(
import bitsandbytes as bnb
module, tensor_name = get_module_from_name(model, param_name)
- if isinstance(module._parameters[tensor_name], bnb.nn.Int8Params):
+ if isinstance(module._parameters.get(tensor_name, None), bnb.nn.Int8Params):
if self.pre_quantized:
if param_name.replace("weight", "SCB") not in state_dict.keys():
raise ValueError("Missing quantization component `SCB`")
diff --git a/tests/models/esm/test_modeling_esm.py b/tests/models/esm/test_modeling_esm.py
index d09326df606b34..7e99f86bbf626b 100644
--- a/tests/models/esm/test_modeling_esm.py
+++ b/tests/models/esm/test_modeling_esm.py
@@ -18,7 +18,7 @@
import unittest
from transformers import EsmConfig, is_torch_available
-from transformers.testing_utils import TestCasePlus, require_torch, slow, torch_device
+from transformers.testing_utils import TestCasePlus, require_bitsandbytes, require_torch, slow, torch_device
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
@@ -303,9 +303,9 @@ def test_resize_tokens_embeddings(self):
pass
+@slow
@require_torch
class EsmModelIntegrationTest(TestCasePlus):
- @slow
def test_inference_masked_lm(self):
with torch.no_grad():
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D")
@@ -323,7 +323,6 @@ def test_inference_masked_lm(self):
)
self.assertTrue(torch.allclose(output[:, :3, :3], expected_slice, atol=1e-4))
- @slow
def test_inference_no_head(self):
with torch.no_grad():
model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
@@ -336,3 +335,18 @@ def test_inference_no_head(self):
[[[0.1444, 0.5413, 0.3248], [0.3034, 0.0053, 0.3108], [0.3228, -0.2499, 0.3415]]]
)
self.assertTrue(torch.allclose(output[:, :3, :3], expected_slice, atol=1e-4))
+
+ @require_bitsandbytes
+ def test_inference_bitsandbytes(self):
+ model = EsmForMaskedLM.from_pretrained("facebook/esm2_t36_3B_UR50D", load_in_8bit=True)
+
+ input_ids = torch.tensor([[0, 6, 4, 13, 5, 4, 16, 12, 11, 7, 2]])
+ # Just test if inference works
+ with torch.no_grad():
+ _ = model(input_ids)[0]
+
+ model = EsmForMaskedLM.from_pretrained("facebook/esm2_t36_3B_UR50D", load_in_4bit=True)
+
+ input_ids = torch.tensor([[0, 6, 4, 13, 5, 4, 16, 12, 11, 7, 2]])
+ # Just test if inference works
+ _ = model(input_ids)[0]
From e7b983706586c0b809437851f3ba5863b4eda9c0 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Fri, 1 Mar 2024 08:59:26 +0100
Subject: [PATCH 057/549] =?UTF-8?q?[`Llama=20+=20AWQ`]=20fix=20`prepare=5F?=
=?UTF-8?q?inputs=5Ffor=5Fgeneration`=20=20=F0=9F=AB=A0=20(#29381)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* use the generation config 🫠
* fixup
---
src/transformers/models/gemma/modeling_gemma.py | 2 +-
src/transformers/models/llama/modeling_llama.py | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index e78ff54be865ea..ea239193afc352 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -1161,7 +1161,7 @@ def prepare_inputs_for_generation(
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
- if getattr(self.model.layers[0].self_attn, "past_key_value", None) is not None:
+ if self.generation_config.cache_implementation == "static":
# generation with static cache
cache_position = kwargs.get("cache_position", None)
if cache_position is None:
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 43de19d329e936..8ca9397cab740b 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -1277,7 +1277,7 @@ def prepare_inputs_for_generation(
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
- if getattr(self.model.layers[0].self_attn, "past_key_value", None) is not None:
+ if self.generation_config.cache_implementation == "static":
# generation with static cache
cache_position = kwargs.get("cache_position", None)
if cache_position is None:
From 0a0a279e994ee794acf6102179524f51931e6d61 Mon Sep 17 00:00:00 2001
From: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Date: Fri, 1 Mar 2024 09:22:31 +0000
Subject: [PATCH 058/549] =?UTF-8?q?=F0=9F=9A=A8=F0=9F=9A=A8[Whisper=20Tok]?=
=?UTF-8?q?=20Update=20integration=20test=20(#29368)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* [Whisper Tok] Update integration test
* make style
---
.../whisper/test_tokenization_whisper.py | 38 ++++---------------
1 file changed, 8 insertions(+), 30 deletions(-)
diff --git a/tests/models/whisper/test_tokenization_whisper.py b/tests/models/whisper/test_tokenization_whisper.py
index 731abd3a283e4b..170857cffb98cb 100644
--- a/tests/models/whisper/test_tokenization_whisper.py
+++ b/tests/models/whisper/test_tokenization_whisper.py
@@ -16,7 +16,7 @@
from transformers.models.whisper import WhisperTokenizer, WhisperTokenizerFast
from transformers.models.whisper.tokenization_whisper import _combine_tokens_into_words, _find_longest_common_sequence
-from transformers.testing_utils import require_jinja, slow
+from transformers.testing_utils import slow
from ...test_tokenization_common import TokenizerTesterMixin
@@ -67,26 +67,26 @@ def test_full_tokenizer(self):
tokenizer = WhisperTokenizer.from_pretrained(self.tmpdirname)
tokens = tokenizer.tokenize("This is a test")
- self.assertListEqual(tokens, ["This", "Ġis", "Ġa", "Ġ", "test"])
+ self.assertListEqual(tokens, ["This", "Ġis", "Ġa", "Ġtest"])
self.assertListEqual(
tokenizer.convert_tokens_to_ids(tokens),
- [5723, 307, 257, 220, 31636],
+ [5723, 307, 257, 1500],
)
tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
self.assertListEqual(
tokens,
- ["I", "Ġwas", "Ġborn", "Ġin", "Ġ9", "2000", ",", "Ġand", "Ġ", "this", "Ġis", "Ġfals", "é", "."], # fmt: skip
- ) # fmt: skip
+ ["I", "Ġwas", "Ġborn", "Ġin", "Ġ9", "2000", ",", "Ġand", "Ġthis", "Ġis", "Ġfals", "é", "."], # fmt: skip
+ )
ids = tokenizer.convert_tokens_to_ids(tokens)
- self.assertListEqual(ids, [40, 390, 4232, 294, 1722, 25743, 11, 293, 220, 11176, 307, 16720, 526, 13])
+ self.assertListEqual(ids, [40, 390, 4232, 294, 1722, 25743, 11, 293, 341, 307, 16720, 526, 13])
back_tokens = tokenizer.convert_ids_to_tokens(ids)
self.assertListEqual(
back_tokens,
- ["I", "Ġwas", "Ġborn", "Ġin", "Ġ9", "2000", ",", "Ġand", "Ġ", "this", "Ġis", "Ġfals", "é", "."], # fmt: skip
- ) # fmt: skip
+ ["I", "Ġwas", "Ġborn", "Ġin", "Ġ9", "2000", ",", "Ġand", "Ġthis", "Ġis", "Ġfals", "é", "."], # fmt: skip
+ )
def test_tokenizer_slow_store_full_signature(self):
pass
@@ -499,25 +499,3 @@ def test_offset_decoding(self):
output = multilingual_tokenizer.decode(INPUT_TOKENS, output_offsets=True)["offsets"]
self.assertEqual(output, [])
-
- @require_jinja
- def test_tokenization_for_chat(self):
- multilingual_tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
- # This is in English, but it's just here to make sure the chat control tokens are being added properly
- test_chats = [
- [{"role": "system", "content": "You are a helpful chatbot."}, {"role": "user", "content": "Hello!"}],
- [
- {"role": "system", "content": "You are a helpful chatbot."},
- {"role": "user", "content": "Hello!"},
- {"role": "assistant", "content": "Nice to meet you."},
- ],
- [{"role": "assistant", "content": "Nice to meet you."}, {"role": "user", "content": "Hello!"}],
- ]
- tokenized_chats = [multilingual_tokenizer.apply_chat_template(test_chat) for test_chat in test_chats]
- expected_tokens = [
- [3223, 366, 257, 4961, 5081, 18870, 13, 50257, 15947, 0, 50257],
- [3223, 366, 257, 4961, 5081, 18870, 13, 50257, 15947, 0, 50257, 37717, 220, 1353, 1677, 291, 13, 50257],
- [37717, 220, 1353, 1677, 291, 13, 50257, 15947, 0, 50257],
- ]
- for tokenized_chat, expected_tokens in zip(tokenized_chats, expected_tokens):
- self.assertListEqual(tokenized_chat, expected_tokens)
From f1b1379f37c6b9626bb1c795d89be4c0a606f957 Mon Sep 17 00:00:00 2001
From: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Date: Fri, 1 Mar 2024 09:42:13 +0000
Subject: [PATCH 059/549] [`YOLOS`] Fix - return padded annotations (#29300)
* Fix yolos processing
* Add back slow marker - protects for pycocotools in slow
* Slow decorator goes above copied from header
---
.../image_processing_conditional_detr.py | 3 +-
.../image_processing_deformable_detr.py | 3 +-
.../models/detr/image_processing_detr.py | 3 +-
.../models/yolos/image_processing_yolos.py | 11 +++-
.../test_image_processing_conditional_detr.py | 1 -
.../test_image_processing_deformable_detr.py | 1 -
.../models/deta/test_image_processing_deta.py | 1 -
.../models/detr/test_image_processing_detr.py | 1 -
.../yolos/test_image_processing_yolos.py | 53 +++++++++----------
9 files changed, 38 insertions(+), 39 deletions(-)
diff --git a/src/transformers/models/conditional_detr/image_processing_conditional_detr.py b/src/transformers/models/conditional_detr/image_processing_conditional_detr.py
index 1a473fb841a845..e88bfc8fe230df 100644
--- a/src/transformers/models/conditional_detr/image_processing_conditional_detr.py
+++ b/src/transformers/models/conditional_detr/image_processing_conditional_detr.py
@@ -1323,7 +1323,6 @@ def preprocess(
validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)
# Here, the pad() method pads to the maximum of (width, height). It does not need to be validated.
-
validate_preprocess_arguments(
do_rescale=do_rescale,
rescale_factor=rescale_factor,
@@ -1434,8 +1433,8 @@ def preprocess(
return_pixel_mask=True,
data_format=data_format,
input_data_format=input_data_format,
- return_tensors=return_tensors,
update_bboxes=do_convert_annotations,
+ return_tensors=return_tensors,
)
else:
images = [
diff --git a/src/transformers/models/deformable_detr/image_processing_deformable_detr.py b/src/transformers/models/deformable_detr/image_processing_deformable_detr.py
index cd3ac90a47adf3..5525eeeb8c58d5 100644
--- a/src/transformers/models/deformable_detr/image_processing_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/image_processing_deformable_detr.py
@@ -1321,7 +1321,6 @@ def preprocess(
validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)
# Here, the pad() method pads to the maximum of (width, height). It does not need to be validated.
-
validate_preprocess_arguments(
do_rescale=do_rescale,
rescale_factor=rescale_factor,
@@ -1432,8 +1431,8 @@ def preprocess(
return_pixel_mask=True,
data_format=data_format,
input_data_format=input_data_format,
- return_tensors=return_tensors,
update_bboxes=do_convert_annotations,
+ return_tensors=return_tensors,
)
else:
images = [
diff --git a/src/transformers/models/detr/image_processing_detr.py b/src/transformers/models/detr/image_processing_detr.py
index 71768a8e7b0da1..e0e59cbc7c40c6 100644
--- a/src/transformers/models/detr/image_processing_detr.py
+++ b/src/transformers/models/detr/image_processing_detr.py
@@ -1293,7 +1293,6 @@ def preprocess(
validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)
# Here, the pad() method pads to the maximum of (width, height). It does not need to be validated.
-
validate_preprocess_arguments(
do_rescale=do_rescale,
rescale_factor=rescale_factor,
@@ -1404,8 +1403,8 @@ def preprocess(
return_pixel_mask=True,
data_format=data_format,
input_data_format=input_data_format,
- return_tensors=return_tensors,
update_bboxes=do_convert_annotations,
+ return_tensors=return_tensors,
)
else:
images = [
diff --git a/src/transformers/models/yolos/image_processing_yolos.py b/src/transformers/models/yolos/image_processing_yolos.py
index f77e27ec40d9e5..c4e44854a0da43 100644
--- a/src/transformers/models/yolos/image_processing_yolos.py
+++ b/src/transformers/models/yolos/image_processing_yolos.py
@@ -1095,7 +1095,14 @@ def pad(
]
data["pixel_mask"] = masks
- return BatchFeature(data=data, tensor_type=return_tensors)
+ encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+
+ if annotations is not None:
+ encoded_inputs["labels"] = [
+ BatchFeature(annotation, tensor_type=return_tensors) for annotation in padded_annotations
+ ]
+
+ return encoded_inputs
def preprocess(
self,
@@ -1314,7 +1321,7 @@ def preprocess(
if do_convert_annotations and annotations is not None:
annotations = [
- self.normalize_annotation(annotation, get_image_size(image))
+ self.normalize_annotation(annotation, get_image_size(image, input_data_format))
for annotation, image in zip(annotations, images)
]
diff --git a/tests/models/conditional_detr/test_image_processing_conditional_detr.py b/tests/models/conditional_detr/test_image_processing_conditional_detr.py
index bb16529f3fa342..e340f4247d47df 100644
--- a/tests/models/conditional_detr/test_image_processing_conditional_detr.py
+++ b/tests/models/conditional_detr/test_image_processing_conditional_detr.py
@@ -368,7 +368,6 @@ def test_batched_coco_detection_annotations(self):
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
- @slow
# Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_batched_coco_panoptic_annotations with Detr->ConditionalDetr
def test_batched_coco_panoptic_annotations(self):
# prepare image, target and masks_path
diff --git a/tests/models/deformable_detr/test_image_processing_deformable_detr.py b/tests/models/deformable_detr/test_image_processing_deformable_detr.py
index 18ae6595b1736f..50df72496ffc3e 100644
--- a/tests/models/deformable_detr/test_image_processing_deformable_detr.py
+++ b/tests/models/deformable_detr/test_image_processing_deformable_detr.py
@@ -370,7 +370,6 @@ def test_batched_coco_detection_annotations(self):
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
- @slow
# Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_batched_coco_panoptic_annotations with Detr->DeformableDetr
def test_batched_coco_panoptic_annotations(self):
# prepare image, target and masks_path
diff --git a/tests/models/deta/test_image_processing_deta.py b/tests/models/deta/test_image_processing_deta.py
index 109b2f05a8e6a5..ad17f0b5a17809 100644
--- a/tests/models/deta/test_image_processing_deta.py
+++ b/tests/models/deta/test_image_processing_deta.py
@@ -364,7 +364,6 @@ def test_batched_coco_detection_annotations(self):
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
- @slow
# Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_batched_coco_panoptic_annotations with Detr->Deta
def test_batched_coco_panoptic_annotations(self):
# prepare image, target and masks_path
diff --git a/tests/models/detr/test_image_processing_detr.py b/tests/models/detr/test_image_processing_detr.py
index 9d1f169efe260c..c79c1d7b01962a 100644
--- a/tests/models/detr/test_image_processing_detr.py
+++ b/tests/models/detr/test_image_processing_detr.py
@@ -426,7 +426,6 @@ def test_batched_coco_detection_annotations(self):
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
- @slow
def test_batched_coco_panoptic_annotations(self):
# prepare image, target and masks_path
image_0 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
diff --git a/tests/models/yolos/test_image_processing_yolos.py b/tests/models/yolos/test_image_processing_yolos.py
index 4bdde658cdf992..a1bc2ff172f749 100644
--- a/tests/models/yolos/test_image_processing_yolos.py
+++ b/tests/models/yolos/test_image_processing_yolos.py
@@ -288,8 +288,8 @@ def test_call_pytorch_with_coco_panoptic_annotations(self):
expected_size = torch.tensor([800, 1056])
self.assertTrue(torch.allclose(encoding["labels"][0]["size"], expected_size))
+ # Output size is slight different from DETR as yolos takes mod of 16
@slow
- # Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_batched_coco_detection_annotations with Detr->Yolos
def test_batched_coco_detection_annotations(self):
image_0 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
image_1 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png").resize((800, 800))
@@ -325,7 +325,7 @@ def test_batched_coco_detection_annotations(self):
)
# Check the pixel values have been padded
- postprocessed_height, postprocessed_width = 800, 1066
+ postprocessed_height, postprocessed_width = 800, 1056
expected_shape = torch.Size([2, 3, postprocessed_height, postprocessed_width])
self.assertEqual(encoding["pixel_values"].shape, expected_shape)
@@ -344,20 +344,20 @@ def test_batched_coco_detection_annotations(self):
)
expected_boxes_1 = torch.tensor(
[
- [0.4130, 0.2765, 0.0453, 0.2215],
- [0.1272, 0.2016, 0.1561, 0.0940],
- [0.3757, 0.4933, 0.7488, 0.9865],
- [0.3759, 0.5002, 0.7492, 0.9955],
- [0.1971, 0.5456, 0.3532, 0.8646],
- [0.5790, 0.4115, 0.3430, 0.7161],
+ [0.4169, 0.2765, 0.0458, 0.2215],
+ [0.1284, 0.2016, 0.1576, 0.0940],
+ [0.3792, 0.4933, 0.7559, 0.9865],
+ [0.3794, 0.5002, 0.7563, 0.9955],
+ [0.1990, 0.5456, 0.3566, 0.8646],
+ [0.5845, 0.4115, 0.3462, 0.7161],
]
)
- self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1e-3))
- self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1e-3))
+ self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, atol=1e-3))
+ self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, atol=1e-3))
# Check the masks have also been padded
- self.assertEqual(encoding["labels"][0]["masks"].shape, torch.Size([6, 800, 1066]))
- self.assertEqual(encoding["labels"][1]["masks"].shape, torch.Size([6, 800, 1066]))
+ self.assertEqual(encoding["labels"][0]["masks"].shape, torch.Size([6, 800, 1056]))
+ self.assertEqual(encoding["labels"][1]["masks"].shape, torch.Size([6, 800, 1056]))
# Check if do_convert_annotations=False, then the annotations are not converted to centre_x, centre_y, width, height
# format and not in the range [0, 1]
@@ -404,11 +404,10 @@ def test_batched_coco_detection_annotations(self):
unnormalized_boxes_1[:, 1] + unnormalized_boxes_1[:, 3] / 2,
]
).T
- self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
- self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
+ self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, atol=1))
+ self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, atol=1))
- @slow
- # Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_batched_coco_panoptic_annotations with Detr->Yolos
+ # Output size is slight different from DETR as yolos takes mod of 16
def test_batched_coco_panoptic_annotations(self):
# prepare image, target and masks_path
image_0 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
@@ -448,7 +447,7 @@ def test_batched_coco_panoptic_annotations(self):
)
# Check the pixel values have been padded
- postprocessed_height, postprocessed_width = 800, 1066
+ postprocessed_height, postprocessed_width = 800, 1056
expected_shape = torch.Size([2, 3, postprocessed_height, postprocessed_width])
self.assertEqual(encoding["pixel_values"].shape, expected_shape)
@@ -467,20 +466,20 @@ def test_batched_coco_panoptic_annotations(self):
)
expected_boxes_1 = torch.tensor(
[
- [0.1576, 0.3262, 0.2814, 0.5175],
- [0.4634, 0.2463, 0.2720, 0.4275],
- [0.3002, 0.2956, 0.5985, 0.5913],
- [0.1013, 0.1200, 0.1238, 0.0550],
- [0.3297, 0.1656, 0.0347, 0.1312],
- [0.2997, 0.2994, 0.5994, 0.5987],
+ [0.1591, 0.3262, 0.2841, 0.5175],
+ [0.4678, 0.2463, 0.2746, 0.4275],
+ [0.3030, 0.2956, 0.6042, 0.5913],
+ [0.1023, 0.1200, 0.1250, 0.0550],
+ [0.3329, 0.1656, 0.0350, 0.1312],
+ [0.3026, 0.2994, 0.6051, 0.5987],
]
)
- self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1e-3))
- self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1e-3))
+ self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, atol=1e-3))
+ self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, atol=1e-3))
# Check the masks have also been padded
- self.assertEqual(encoding["labels"][0]["masks"].shape, torch.Size([6, 800, 1066]))
- self.assertEqual(encoding["labels"][1]["masks"].shape, torch.Size([6, 800, 1066]))
+ self.assertEqual(encoding["labels"][0]["masks"].shape, torch.Size([6, 800, 1056]))
+ self.assertEqual(encoding["labels"][1]["masks"].shape, torch.Size([6, 800, 1056]))
# Check if do_convert_annotations=False, then the annotations are not converted to centre_x, centre_y, width, height
# format and not in the range [0, 1]
From 15f8296a9b493eaa0770557fe2e931677fb62e2f Mon Sep 17 00:00:00 2001
From: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com>
Date: Fri, 1 Mar 2024 18:29:21 +0800
Subject: [PATCH 060/549] Support subfolder with `AutoProcessor` (#29169)
enable subfolder
---
src/transformers/feature_extraction_utils.py | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/transformers/feature_extraction_utils.py b/src/transformers/feature_extraction_utils.py
index bed343e48d6238..b0df39e1642b9a 100644
--- a/src/transformers/feature_extraction_utils.py
+++ b/src/transformers/feature_extraction_utils.py
@@ -453,6 +453,7 @@ def get_feature_extractor_dict(
force_download = kwargs.pop("force_download", False)
resume_download = kwargs.pop("resume_download", False)
proxies = kwargs.pop("proxies", None)
+ subfolder = kwargs.pop("subfolder", None)
token = kwargs.pop("token", None)
use_auth_token = kwargs.pop("use_auth_token", None)
local_files_only = kwargs.pop("local_files_only", False)
@@ -502,6 +503,7 @@ def get_feature_extractor_dict(
proxies=proxies,
resume_download=resume_download,
local_files_only=local_files_only,
+ subfolder=subfolder,
token=token,
user_agent=user_agent,
revision=revision,
From cec773345aeffce3c04e8891303a3f748de7141e Mon Sep 17 00:00:00 2001
From: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Date: Fri, 1 Mar 2024 10:32:36 -0500
Subject: [PATCH 061/549] Fix llama + gemma accelete tests (#29380)
---
tests/models/gemma/test_modeling_gemma.py | 4 ++++
tests/models/llama/test_modeling_llama.py | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/tests/models/gemma/test_modeling_gemma.py b/tests/models/gemma/test_modeling_gemma.py
index 6385e4cbf5a809..1b32f1b16ee486 100644
--- a/tests/models/gemma/test_modeling_gemma.py
+++ b/tests/models/gemma/test_modeling_gemma.py
@@ -298,6 +298,10 @@ class GemmaModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixi
test_headmasking = False
test_pruning = False
+ # Need to remove 0.9 in `test_cpu_offload`
+ # This is because we are hitting edge cases with the causal_mask buffer
+ model_split_percents = [0.5, 0.6]
+
# TODO (ydshieh): Check this. See https://app.circleci.com/pipelines/github/huggingface/transformers/79245/workflows/9490ef58-79c2-410d-8f51-e3495156cf9c/jobs/1012146
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
diff --git a/tests/models/llama/test_modeling_llama.py b/tests/models/llama/test_modeling_llama.py
index 02c649c39aa0d4..9c5eccd2d29e30 100644
--- a/tests/models/llama/test_modeling_llama.py
+++ b/tests/models/llama/test_modeling_llama.py
@@ -302,6 +302,10 @@ class LlamaModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixi
test_pruning = False
fx_compatible = True
+ # Need to use `0.8` instead of `0.9` for `test_cpu_offload`
+ # This is because we are hitting edge cases with the causal_mask buffer
+ model_split_percents = [0.5, 0.7, 0.8]
+
def setUp(self):
self.model_tester = LlamaModelTester(self)
self.config_tester = ConfigTester(self, config_class=LlamaConfig, hidden_size=37)
From 1a7c117df96adac7b60a1f6f0f228d71b1ed1283 Mon Sep 17 00:00:00 2001
From: Zach Mueller
Date: Fri, 1 Mar 2024 12:00:29 -0500
Subject: [PATCH 062/549] Fix deprecated arg issue (#29372)
* Fix deprecated arg issue
* Trainer check too
* Check for dict or dataclass
* Simplify, make config always AcceleratorConfig
* Upstream to Trainer
---
src/transformers/trainer.py | 14 +-------------
src/transformers/training_args.py | 8 +++++---
tests/trainer/test_trainer.py | 14 ++++++++++++++
3 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 1b70db000ccfeb..414d97eb527354 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -80,7 +80,6 @@
TrainerState,
)
from .trainer_pt_utils import (
- AcceleratorConfig,
DistributedTensorGatherer,
IterableDatasetShard,
LabelSmoother,
@@ -4116,21 +4115,10 @@ def create_accelerator_and_postprocess(self):
gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
# create accelerator object
- accelerator_kwargs = {}
- if self.args.accelerator_config is not None:
- accelerator_kwargs = self.args.accelerator_config
- # dict and AcceleratorConfigs are parseable, json files are not
- if isinstance(accelerator_kwargs, AcceleratorConfig):
- accelerator_kwargs = accelerator_kwargs.to_dict()
- elif isinstance(accelerator_kwargs, dict):
- # Some values may need to go through non-accelerate aligned defaults
- # and we need to run the `__post_init__` to set them
- accelerator_kwargs = AcceleratorConfig(**accelerator_kwargs).to_dict()
-
self.accelerator = Accelerator(
deepspeed_plugin=self.args.deepspeed_plugin,
gradient_accumulation_plugin=gradient_accumulation_plugin,
- **accelerator_kwargs,
+ **self.args.accelerator_config.to_dict(),
)
# some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
self.gather_function = self.accelerator.gather_for_metrics
diff --git a/src/transformers/training_args.py b/src/transformers/training_args.py
index 19ab24c205cf72..ba89d914d76135 100644
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -1737,9 +1737,11 @@ def __post_init__(self):
os.environ[f"{prefix}USE_ORIG_PARAMS"] = self.fsdp_config.get("use_orig_params", "true")
if is_accelerate_available():
- if not isinstance(self.accelerator_config, (AcceleratorConfig, dict)):
+ if not isinstance(self.accelerator_config, (AcceleratorConfig)):
if self.accelerator_config is None:
self.accelerator_config = AcceleratorConfig()
+ elif isinstance(self.accelerator_config, dict):
+ self.accelerator_config = AcceleratorConfig(**self.accelerator_config)
else:
self.accelerator_config = AcceleratorConfig.from_json_file(self.accelerator_config)
if self.dispatch_batches is not None:
@@ -1748,7 +1750,7 @@ def __post_init__(self):
" `--accelerator_config {'dispatch_batches':VALUE} instead",
FutureWarning,
)
- self.accelerator_config["dispatch_batches"] = self.dispatch_batches
+ self.accelerator_config.dispatch_batches = self.dispatch_batches
if self.split_batches is not None:
warnings.warn(
@@ -1756,7 +1758,7 @@ def __post_init__(self):
" `--accelerator_config {'split_batches':VALUE} instead",
FutureWarning,
)
- self.accelerator_config["split_batches"] = self.split_batches
+ self.accelerator_config.split_batches = self.split_batches
if self.tpu_metrics_debug:
warnings.warn(
diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py
index 65eeb6d6238431..1ebbe1ca7a86eb 100644
--- a/tests/trainer/test_trainer.py
+++ b/tests/trainer/test_trainer.py
@@ -2633,6 +2633,20 @@ def test_accelerator_config_from_dict_with_deprecated_args(self):
self.assertEqual(trainer.accelerator.even_batches, False)
self.assertEqual(trainer.accelerator.dispatch_batches, None)
+ def test_accelerator_config_only_deprecated_args(self):
+ with tempfile.TemporaryDirectory() as tmp_dir:
+ with self.assertWarns(FutureWarning) as cm:
+ args = RegressionTrainingArguments(
+ output_dir=tmp_dir,
+ split_batches=True,
+ )
+ self.assertIn("split_batches", str(cm.warnings[0].message))
+ config = RegressionModelConfig(a=1.5, b=2.5)
+ model = RegressionPreTrainedModel(config)
+ eval_dataset = SampleIterableDataset()
+ trainer = Trainer(model=model, args=args, eval_dataset=eval_dataset)
+ self.assertEqual(trainer.accelerator.split_batches, True)
+
@require_torch
@is_staging_test
From 831bc25d8fdb85768402f772cf65cc3d7872b211 Mon Sep 17 00:00:00 2001
From: David Valente <74915610+DavidAfonsoValente@users.noreply.github.com>
Date: Fri, 1 Mar 2024 18:04:40 +0100
Subject: [PATCH 063/549] Correct zero division error in inverse sqrt scheduler
(#28982)
* Correct zero division error in inverse sqrt scheduler
* default timescale to 10_000
---
src/transformers/optimization.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/optimization.py b/src/transformers/optimization.py
index b3861b371a2393..65a41d1b1a44f2 100644
--- a/src/transformers/optimization.py
+++ b/src/transformers/optimization.py
@@ -317,7 +317,7 @@ def get_inverse_sqrt_schedule(
# https://github.com/google-research/big_vision/blob/f071ce68852d56099437004fd70057597a95f6ef/big_vision/utils.py#L930
if timescale is None:
- timescale = num_warmup_steps
+ timescale = num_warmup_steps or 10_000
lr_lambda = partial(_get_inverse_sqrt_schedule_lr_lambda, num_warmup_steps=num_warmup_steps, timescale=timescale)
return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch)
From aade711d1ee225036be22a90bdd1f04eb1c0ba36 Mon Sep 17 00:00:00 2001
From: Fanli Lin
Date: Mon, 4 Mar 2024 15:24:38 +0800
Subject: [PATCH 064/549] [tests] enable automatic speech recognition pipeline
tests on XPU (#29308)
* use require_torch_gpu
* enable on XPU
---
.../test_pipelines_automatic_speech_recognition.py | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tests/pipelines/test_pipelines_automatic_speech_recognition.py b/tests/pipelines/test_pipelines_automatic_speech_recognition.py
index d2af7e44687fbc..2e01ab2731d3b4 100644
--- a/tests/pipelines/test_pipelines_automatic_speech_recognition.py
+++ b/tests/pipelines/test_pipelines_automatic_speech_recognition.py
@@ -1221,7 +1221,7 @@ def test_whisper_longform(self):
processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
- model = model.to("cuda")
+ model = model.to(torch_device)
pipe = pipeline(
"automatic-speech-recognition",
@@ -1229,7 +1229,7 @@ def test_whisper_longform(self):
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
- device="cuda:0",
+ device=torch_device,
)
ds = load_dataset("distil-whisper/meanwhile", "default")["test"]
@@ -1246,7 +1246,7 @@ def test_seamless_v2(self):
pipe = pipeline(
"automatic-speech-recognition",
model="facebook/seamless-m4t-v2-large",
- device="cuda:0",
+ device=torch_device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
From 38953a75c120a6c1cd713718a9af0ed553c5113d Mon Sep 17 00:00:00 2001
From: Poedator <24738311+poedator@users.noreply.github.com>
Date: Mon, 4 Mar 2024 10:26:01 +0300
Subject: [PATCH 065/549] update path to hub files in the error message
(#29369)
update path to hub files
need to add `tree/` to path to files at HF hub.
see example path:
`https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main`
---
src/transformers/utils/hub.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/utils/hub.py b/src/transformers/utils/hub.py
index 984fba1b6b743b..47ca63e7a31503 100644
--- a/src/transformers/utils/hub.py
+++ b/src/transformers/utils/hub.py
@@ -368,7 +368,7 @@ def cached_file(
if _raise_exceptions_for_missing_entries:
raise EnvironmentError(
f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout "
- f"'https://huggingface.co/{path_or_repo_id}/{revision}' for available files."
+ f"'https://huggingface.co/{path_or_repo_id}/tree/{revision}' for available files."
)
else:
return None
From 39ef3fb248ba288897f35337f4086054c69332e5 Mon Sep 17 00:00:00 2001
From: Siming Dai <908660116@qq.com>
Date: Mon, 4 Mar 2024 16:08:56 +0800
Subject: [PATCH 066/549] [Mixtral] Fixes attention masking in the loss
(#29363)
Fix mixtral load balancing loss
Co-authored-by: dingkunbo
---
src/transformers/models/mixtral/modeling_mixtral.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/transformers/models/mixtral/modeling_mixtral.py b/src/transformers/models/mixtral/modeling_mixtral.py
index 01ea7282d780b7..12733dfdd90497 100644
--- a/src/transformers/models/mixtral/modeling_mixtral.py
+++ b/src/transformers/models/mixtral/modeling_mixtral.py
@@ -123,8 +123,8 @@ def load_balancing_loss_func(
# Compute the mask that masks all padding tokens as 0 with the same shape of expert_mask
expert_attention_mask = (
attention_mask[None, :, :, None, None]
- .expand((num_hidden_layers, batch_size, sequence_length, 2, num_experts))
- .reshape(-1, 2, num_experts)
+ .expand((num_hidden_layers, batch_size, sequence_length, top_k, num_experts))
+ .reshape(-1, top_k, num_experts)
.to(compute_device)
)
From 704b3f74f9685e1772acd9949f65b0c5bbd64539 Mon Sep 17 00:00:00 2001
From: Y4hL <43219534+Y4hL@users.noreply.github.com>
Date: Mon, 4 Mar 2024 11:19:13 +0200
Subject: [PATCH 067/549] Add mlx support to BatchEncoding.convert_to_tensors
(#29406)
* Add mlx support
* Fix import order and use def instead of lambda
* Another fix for ruff format :)
* Add detecting mlx from repr, add is_mlx_array
---
src/transformers/tokenization_utils_base.py | 11 ++++++++
src/transformers/utils/__init__.py | 1 +
src/transformers/utils/generic.py | 30 ++++++++++++++++++---
src/transformers/utils/import_utils.py | 5 ++++
4 files changed, 44 insertions(+), 3 deletions(-)
diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py
index a5701c34dca5eb..054146ad637481 100644
--- a/src/transformers/tokenization_utils_base.py
+++ b/src/transformers/tokenization_utils_base.py
@@ -48,6 +48,7 @@
extract_commit_hash,
is_flax_available,
is_jax_tensor,
+ is_mlx_available,
is_numpy_array,
is_offline_mode,
is_remote_url,
@@ -726,6 +727,16 @@ def as_tensor(value, dtype=None):
as_tensor = jnp.array
is_tensor = is_jax_tensor
+
+ elif tensor_type == TensorType.MLX:
+ if not is_mlx_available():
+ raise ImportError("Unable to convert output to MLX tensors format, MLX is not installed.")
+ import mlx.core as mx
+
+ as_tensor = mx.array
+
+ def is_tensor(obj):
+ return isinstance(obj, mx.array)
else:
def as_tensor(value, dtype=None):
diff --git a/src/transformers/utils/__init__.py b/src/transformers/utils/__init__.py
index 154077924beadf..03e2663350794b 100644
--- a/src/transformers/utils/__init__.py
+++ b/src/transformers/utils/__init__.py
@@ -134,6 +134,7 @@
is_keras_nlp_available,
is_levenshtein_available,
is_librosa_available,
+ is_mlx_available,
is_natten_available,
is_ninja_available,
is_nltk_available,
diff --git a/src/transformers/utils/generic.py b/src/transformers/utils/generic.py
index d73698d8c93253..28e63ce45b8eae 100644
--- a/src/transformers/utils/generic.py
+++ b/src/transformers/utils/generic.py
@@ -28,7 +28,14 @@
import numpy as np
from packaging import version
-from .import_utils import get_torch_version, is_flax_available, is_tf_available, is_torch_available, is_torch_fx_proxy
+from .import_utils import (
+ get_torch_version,
+ is_flax_available,
+ is_mlx_available,
+ is_tf_available,
+ is_torch_available,
+ is_torch_fx_proxy,
+)
if is_flax_available():
@@ -87,6 +94,8 @@ def infer_framework_from_repr(x):
return "jax"
elif representation.startswith(" Union[
_torchaudio_available = _is_package_available("torchaudio")
_torchdistx_available = _is_package_available("torchdistx")
_torchvision_available = _is_package_available("torchvision")
+_mlx_available = _is_package_available("mlx")
_torch_version = "N/A"
@@ -923,6 +924,10 @@ def is_jinja_available():
return _jinja_available
+def is_mlx_available():
+ return _mlx_available
+
+
# docstyle-ignore
CV2_IMPORT_ERROR = """
{0} requires the OpenCV library but it was not found in your environment. You can install it with:
From c38a12270a11e237cf65d085fbbaf0c4b7976b67 Mon Sep 17 00:00:00 2001
From: Traun Leyden
Date: Mon, 4 Mar 2024 10:23:40 +0100
Subject: [PATCH 068/549] Workaround for #27758 to avoid ZeroDivisionError
(#28756)
---
src/transformers/trainer.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 414d97eb527354..efbe7bea171af5 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2080,7 +2080,8 @@ def _inner_training_loop(
# add remaining tr_loss
self._total_loss_scalar += tr_loss.item()
- train_loss = self._total_loss_scalar / self.state.global_step
+ effective_global_step = max(self.state.global_step, 0.001) # Avoid ZeroDivisionError
+ train_loss = self._total_loss_scalar / effective_global_step
metrics = speed_metrics(
"train",
From 5e4b69dc12980ce4ee387cb449bfb1169b4f74c3 Mon Sep 17 00:00:00 2001
From: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Date: Mon, 4 Mar 2024 11:51:16 +0100
Subject: [PATCH 069/549] Convert SlimSAM checkpoints (#28379)
* First commit
* Improve conversion script
* Convert more checkpoints
* Update src/transformers/models/sam/convert_sam_original_to_hf_format.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Rename file
* More updates
* Update docstring
* Update script
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
...l_to_hf_format.py => convert_sam_to_hf.py} | 136 ++++++++++++------
utils/not_doctested.txt | 2 +-
2 files changed, 91 insertions(+), 47 deletions(-)
rename src/transformers/models/sam/{convert_sam_original_to_hf_format.py => convert_sam_to_hf.py} (69%)
diff --git a/src/transformers/models/sam/convert_sam_original_to_hf_format.py b/src/transformers/models/sam/convert_sam_to_hf.py
similarity index 69%
rename from src/transformers/models/sam/convert_sam_original_to_hf_format.py
rename to src/transformers/models/sam/convert_sam_to_hf.py
index b3cb45b3470139..be375494f059d0 100644
--- a/src/transformers/models/sam/convert_sam_original_to_hf_format.py
+++ b/src/transformers/models/sam/convert_sam_to_hf.py
@@ -14,6 +14,10 @@
# limitations under the License.
"""
Convert SAM checkpoints from the original repository.
+
+URL: https://github.com/facebookresearch/segment-anything.
+
+Also supports converting the SlimSAM checkpoints from https://github.com/czg1225/SlimSAM/tree/master.
"""
import argparse
import re
@@ -33,6 +37,47 @@
)
+def get_config(model_name):
+ if "slimsam-50" in model_name:
+ vision_config = SamVisionConfig(
+ hidden_size=384,
+ mlp_dim=1536,
+ num_hidden_layers=12,
+ num_attention_heads=12,
+ global_attn_indexes=[2, 5, 8, 11],
+ )
+ elif "slimsam-77" in model_name:
+ vision_config = SamVisionConfig(
+ hidden_size=168,
+ mlp_dim=696,
+ num_hidden_layers=12,
+ num_attention_heads=12,
+ global_attn_indexes=[2, 5, 8, 11],
+ )
+ elif "sam_vit_b" in model_name:
+ vision_config = SamVisionConfig()
+ elif "sam_vit_l" in model_name:
+ vision_config = SamVisionConfig(
+ hidden_size=1024,
+ num_hidden_layers=24,
+ num_attention_heads=16,
+ global_attn_indexes=[5, 11, 17, 23],
+ )
+ elif "sam_vit_h" in model_name:
+ vision_config = SamVisionConfig(
+ hidden_size=1280,
+ num_hidden_layers=32,
+ num_attention_heads=16,
+ global_attn_indexes=[7, 15, 23, 31],
+ )
+
+ config = SamConfig(
+ vision_config=vision_config,
+ )
+
+ return config
+
+
KEYS_TO_MODIFY_MAPPING = {
"iou_prediction_head.layers.0": "iou_prediction_head.proj_in",
"iou_prediction_head.layers.1": "iou_prediction_head.layers.0",
@@ -88,63 +133,47 @@ def replace_keys(state_dict):
return model_state_dict
-def convert_sam_checkpoint(model_name, pytorch_dump_folder, push_to_hub, model_hub_id="ybelkada/segment-anything"):
- checkpoint_path = hf_hub_download(model_hub_id, f"checkpoints/{model_name}.pth")
-
- if "sam_vit_b" in model_name:
- config = SamConfig()
- elif "sam_vit_l" in model_name:
- vision_config = SamVisionConfig(
- hidden_size=1024,
- num_hidden_layers=24,
- num_attention_heads=16,
- global_attn_indexes=[5, 11, 17, 23],
- )
-
- config = SamConfig(
- vision_config=vision_config,
- )
- elif "sam_vit_h" in model_name:
- vision_config = SamVisionConfig(
- hidden_size=1280,
- num_hidden_layers=32,
- num_attention_heads=16,
- global_attn_indexes=[7, 15, 23, 31],
- )
-
- config = SamConfig(
- vision_config=vision_config,
- )
+def convert_sam_checkpoint(model_name, checkpoint_path, pytorch_dump_folder, push_to_hub):
+ config = get_config(model_name)
state_dict = torch.load(checkpoint_path, map_location="cpu")
state_dict = replace_keys(state_dict)
image_processor = SamImageProcessor()
-
processor = SamProcessor(image_processor=image_processor)
hf_model = SamModel(config)
+ hf_model.eval()
+
+ device = "cuda" if torch.cuda.is_available() else "cpu"
hf_model.load_state_dict(state_dict)
- hf_model = hf_model.to("cuda")
+ hf_model = hf_model.to(device)
img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
- input_points = [[[400, 650]]]
+ input_points = [[[500, 375]]]
input_labels = [[1]]
- inputs = processor(images=np.array(raw_image), return_tensors="pt").to("cuda")
+ inputs = processor(images=np.array(raw_image), return_tensors="pt").to(device)
with torch.no_grad():
output = hf_model(**inputs)
scores = output.iou_scores.squeeze()
- if model_name == "sam_vit_h_4b8939":
- assert scores[-1].item() == 0.579890251159668
+ if model_name == "sam_vit_b_01ec64":
+ inputs = processor(
+ images=np.array(raw_image), input_points=input_points, input_labels=input_labels, return_tensors="pt"
+ ).to(device)
+
+ with torch.no_grad():
+ output = hf_model(**inputs)
+ scores = output.iou_scores.squeeze()
+ elif model_name == "sam_vit_h_4b8939":
inputs = processor(
images=np.array(raw_image), input_points=input_points, input_labels=input_labels, return_tensors="pt"
- ).to("cuda")
+ ).to(device)
with torch.no_grad():
output = hf_model(**inputs)
@@ -154,7 +183,7 @@ def convert_sam_checkpoint(model_name, pytorch_dump_folder, push_to_hub, model_h
input_boxes = ((75, 275, 1725, 850),)
- inputs = processor(images=np.array(raw_image), input_boxes=input_boxes, return_tensors="pt").to("cuda")
+ inputs = processor(images=np.array(raw_image), input_boxes=input_boxes, return_tensors="pt").to(device)
with torch.no_grad():
output = hf_model(**inputs)
@@ -168,7 +197,7 @@ def convert_sam_checkpoint(model_name, pytorch_dump_folder, push_to_hub, model_h
inputs = processor(
images=np.array(raw_image), input_points=input_points, input_labels=input_labels, return_tensors="pt"
- ).to("cuda")
+ ).to(device)
with torch.no_grad():
output = hf_model(**inputs)
@@ -176,16 +205,31 @@ def convert_sam_checkpoint(model_name, pytorch_dump_folder, push_to_hub, model_h
assert scores[-1].item() == 0.9936047792434692
+ if pytorch_dump_folder is not None:
+ processor.save_pretrained(pytorch_dump_folder)
+ hf_model.save_pretrained(pytorch_dump_folder)
+
+ if push_to_hub:
+ repo_id = f"nielsr/{model_name}" if "slimsam" in model_name else f"meta/{model_name}"
+ processor.push_to_hub(repo_id)
+ hf_model.push_to_hub(repo_id)
+
if __name__ == "__main__":
parser = argparse.ArgumentParser()
- choices = ["sam_vit_b_01ec64", "sam_vit_h_4b8939", "sam_vit_l_0b3195"]
+ choices = ["sam_vit_b_01ec64", "sam_vit_h_4b8939", "sam_vit_l_0b3195", "slimsam-50-uniform", "slimsam-77-uniform"]
parser.add_argument(
"--model_name",
default="sam_vit_h_4b8939",
choices=choices,
type=str,
- help="Path to hf config.json of model to convert",
+ help="Name of the original model to convert",
+ )
+ parser.add_argument(
+ "--checkpoint_path",
+ type=str,
+ required=False,
+ help="Path to the original checkpoint",
)
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
parser.add_argument(
@@ -193,14 +237,14 @@ def convert_sam_checkpoint(model_name, pytorch_dump_folder, push_to_hub, model_h
action="store_true",
help="Whether to push the model and processor to the hub after converting",
)
- parser.add_argument(
- "--model_hub_id",
- default="ybelkada/segment-anything",
- choices=choices,
- type=str,
- help="Path to hf config.json of model to convert",
- )
args = parser.parse_args()
- convert_sam_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub, args.model_hub_id)
+ if "slimsam" in args.model_name:
+ checkpoint_path = args.checkpoint_path
+ if checkpoint_path is None:
+ raise ValueError("You need to provide a checkpoint path for SlimSAM models.")
+ else:
+ checkpoint_path = hf_hub_download("ybelkada/segment-anything", f"checkpoints/{args.model_name}.pth")
+
+ convert_sam_checkpoint(args.model_name, checkpoint_path, args.pytorch_dump_folder_path, args.push_to_hub)
diff --git a/utils/not_doctested.txt b/utils/not_doctested.txt
index daf47b1cb1caec..3e4c78cd9c4e74 100644
--- a/utils/not_doctested.txt
+++ b/utils/not_doctested.txt
@@ -784,7 +784,7 @@ src/transformers/models/rwkv/configuration_rwkv.py
src/transformers/models/rwkv/convert_rwkv_checkpoint_to_hf.py
src/transformers/models/rwkv/modeling_rwkv.py
src/transformers/models/sam/configuration_sam.py
-src/transformers/models/sam/convert_sam_original_to_hf_format.py
+src/transformers/models/sam/convert_sam_to_hf.py
src/transformers/models/sam/image_processing_sam.py
src/transformers/models/sam/modeling_sam.py
src/transformers/models/sam/modeling_tf_sam.py
From 81220cba61d469879f460925b237405211b0cc55 Mon Sep 17 00:00:00 2001
From: "Sean (Seok-Won) Yi"
Date: Mon, 4 Mar 2024 19:53:58 +0900
Subject: [PATCH 070/549] Fix: Fixed the previous tracking URI setting logic to
prevent clashes with original MLflow code. (#29096)
* Changed logic for setting the tracking URI.
The previous code was calling the `mlflow.set_tracking_uri` function
regardless of whether or not the environment variable
`MLFLOW_TRACKING_URI` is even set. This led to clashes with the original
MLflow implementation and therefore the logic was changed to only
calling the function when the environment variable is explicitly set.
* Check if tracking URI has already been set.
The previous code did not consider the possibility that the tracking URI
may already be set elsewhere and was therefore (erroneously) overriding
previously set tracking URIs using the environment variable.
* Removed redundant parentheses.
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Fix docstring to reflect library convention properly.
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Fix docstring to reflect library convention properly.
"Unset by default" is the correct expression rather than "Default to `None`."
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---
.../integrations/integration_utils.py | 23 +++++++++++--------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/src/transformers/integrations/integration_utils.py b/src/transformers/integrations/integration_utils.py
index 9367256c870058..05c864fb4be3d8 100644
--- a/src/transformers/integrations/integration_utils.py
+++ b/src/transformers/integrations/integration_utils.py
@@ -960,9 +960,9 @@ def setup(self, args, state, model):
remote server, e.g. s3 or GCS. If set to `True` or *1*, will copy each saved checkpoint on each save in
[`TrainingArguments`]'s `output_dir` to the local or remote artifact storage. Using it without a remote
storage will just copy the files to your artifact location.
- - **MLFLOW_TRACKING_URI** (`str`, *optional*, defaults to `""`):
- Whether to store runs at a specific path or remote server. Default to an empty string which will store runs
- at `./mlruns` locally.
+ - **MLFLOW_TRACKING_URI** (`str`, *optional*):
+ Whether to store runs at a specific path or remote server. Unset by default, which skips setting the
+ tracking URI entirely.
- **MLFLOW_EXPERIMENT_NAME** (`str`, *optional*, defaults to `None`):
Whether to use an MLflow experiment_name under which to launch the run. Default to `None` which will point
to the `Default` experiment in MLflow. Otherwise, it is a case sensitive name of the experiment to be
@@ -982,7 +982,7 @@ def setup(self, args, state, model):
"""
self._log_artifacts = os.getenv("HF_MLFLOW_LOG_ARTIFACTS", "FALSE").upper() in ENV_VARS_TRUE_VALUES
self._nested_run = os.getenv("MLFLOW_NESTED_RUN", "FALSE").upper() in ENV_VARS_TRUE_VALUES
- self._tracking_uri = os.getenv("MLFLOW_TRACKING_URI", "")
+ self._tracking_uri = os.getenv("MLFLOW_TRACKING_URI", None)
self._experiment_name = os.getenv("MLFLOW_EXPERIMENT_NAME", None)
self._flatten_params = os.getenv("MLFLOW_FLATTEN_PARAMS", "FALSE").upper() in ENV_VARS_TRUE_VALUES
self._run_id = os.getenv("MLFLOW_RUN_ID", None)
@@ -997,12 +997,17 @@ def setup(self, args, state, model):
f" tags={self._nested_run}, tracking_uri={self._tracking_uri}"
)
if state.is_world_process_zero:
- self._ml_flow.set_tracking_uri(self._tracking_uri)
-
- if self._tracking_uri == "":
- logger.debug(f"MLflow tracking URI is not set. Runs will be stored at {os.path.realpath('./mlruns')}")
+ if not self._ml_flow.is_tracking_uri_set():
+ if self._tracking_uri:
+ self._ml_flow.set_tracking_uri(self._tracking_uri)
+ logger.debug(f"MLflow tracking URI is set to {self._tracking_uri}")
+ else:
+ logger.debug(
+ "Environment variable `MLFLOW_TRACKING_URI` is not provided and therefore will not be"
+ " explicitly set."
+ )
else:
- logger.debug(f"MLflow tracking URI is set to {self._tracking_uri}")
+ logger.debug(f"MLflow tracking URI is set to {self._ml_flow.get_tracking_uri()}")
if self._ml_flow.active_run() is None or self._nested_run or self._run_id:
if self._experiment_name:
From 8ef98628646d2e23b70a2052f96bf1e7b5f9c04a Mon Sep 17 00:00:00 2001
From: Nick DeGroot
Date: Mon, 4 Mar 2024 03:04:49 -0800
Subject: [PATCH 071/549] Fix OneFormer `post_process_instance_segmentation`
for panoptic tasks (#29304)
* :bug: Fix oneformer instance post processing when using panoptic task type
* :white_check_mark: Add unit test for oneformer instance post processing panoptic bug
---------
Co-authored-by: Nick DeGroot <1966472+nickthegroot@users.noreply.github.com>
---
.../models/oneformer/image_processing_oneformer.py | 8 ++++----
.../oneformer/test_image_processing_oneformer.py | 13 +++++++++++++
2 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/src/transformers/models/oneformer/image_processing_oneformer.py b/src/transformers/models/oneformer/image_processing_oneformer.py
index d9b0c0168682ab..9f865f8efd9b94 100644
--- a/src/transformers/models/oneformer/image_processing_oneformer.py
+++ b/src/transformers/models/oneformer/image_processing_oneformer.py
@@ -1244,8 +1244,8 @@ def post_process_instance_segmentation(
# if this is panoptic segmentation, we only keep the "thing" classes
if task_type == "panoptic":
keep = torch.zeros_like(scores_per_image).bool()
- for i, lab in enumerate(labels_per_image):
- keep[i] = lab in self.metadata["thing_ids"]
+ for j, lab in enumerate(labels_per_image):
+ keep[j] = lab in self.metadata["thing_ids"]
scores_per_image = scores_per_image[keep]
labels_per_image = labels_per_image[keep]
@@ -1258,8 +1258,8 @@ def post_process_instance_segmentation(
continue
if "ade20k" in self.class_info_file and not is_demo and "instance" in task_type:
- for i in range(labels_per_image.shape[0]):
- labels_per_image[i] = self.metadata["thing_ids"].index(labels_per_image[i].item())
+ for j in range(labels_per_image.shape[0]):
+ labels_per_image[j] = self.metadata["thing_ids"].index(labels_per_image[j].item())
# Get segmentation map and segment information of batch item
target_size = target_sizes[i] if target_sizes is not None else None
diff --git a/tests/models/oneformer/test_image_processing_oneformer.py b/tests/models/oneformer/test_image_processing_oneformer.py
index 4a9e560463adf0..abec659a8bfc87 100644
--- a/tests/models/oneformer/test_image_processing_oneformer.py
+++ b/tests/models/oneformer/test_image_processing_oneformer.py
@@ -295,6 +295,19 @@ def test_post_process_instance_segmentation(self):
el["segmentation"].shape, (self.image_processor_tester.height, self.image_processor_tester.width)
)
+ segmentation_with_opts = image_processor.post_process_instance_segmentation(
+ outputs,
+ threshold=0,
+ target_sizes=[(1, 4) for _ in range(self.image_processor_tester.batch_size)],
+ task_type="panoptic",
+ )
+ self.assertTrue(len(segmentation_with_opts) == self.image_processor_tester.batch_size)
+ for el in segmentation_with_opts:
+ self.assertTrue("segmentation" in el)
+ self.assertTrue("segments_info" in el)
+ self.assertEqual(type(el["segments_info"]), list)
+ self.assertEqual(el["segmentation"].shape, (1, 4))
+
def test_post_process_panoptic_segmentation(self):
image_processor = self.image_processing_class(
num_labels=self.image_processor_tester.num_classes,
From 1681a6d452b60ff3652a96f03541dfa491124192 Mon Sep 17 00:00:00 2001
From: Zach Mueller
Date: Mon, 4 Mar 2024 06:17:42 -0500
Subject: [PATCH 072/549] =?UTF-8?q?=F0=9F=9A=A8=20Fully=20revert=20atomic?=
=?UTF-8?q?=20checkpointing=20=F0=9F=9A=A8=20(#29370)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Fully revert atomic checkpointing
---
src/transformers/trainer.py | 53 +++++------------------
tests/trainer/test_trainer.py | 16 +------
tests/trainer/test_trainer_distributed.py | 15 -------
3 files changed, 12 insertions(+), 72 deletions(-)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index efbe7bea171af5..5f192bf6ef10f0 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2491,21 +2491,13 @@ def _save_checkpoint(self, model, trial, metrics=None):
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
- if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
- logger.warning(
- f"Checkpoint destination directory {output_dir} already exists and is non-empty. "
- "Saving will proceed but saved results may be invalid."
- )
- staging_output_dir = output_dir
- else:
- staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
- self.save_model(staging_output_dir, _internal_call=True)
+ self.save_model(output_dir, _internal_call=True)
if not self.args.save_only_model:
# Save optimizer and scheduler
- self._save_optimizer_and_scheduler(staging_output_dir)
+ self._save_optimizer_and_scheduler(output_dir)
# Save RNG state
- self._save_rng_state(staging_output_dir)
+ self._save_rng_state(output_dir)
# Determine the new best metric / best model checkpoint
if metrics is not None and self.args.metric_for_best_model is not None:
@@ -2525,39 +2517,16 @@ def _save_checkpoint(self, model, trial, metrics=None):
# Save the Trainer state
if self.args.should_save:
- self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
+ self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
if self.args.push_to_hub:
- self._push_from_checkpoint(staging_output_dir)
-
- # Place checkpoint in final location after all saving is finished.
- # First wait for everyone to finish writing
- self.args.distributed_state.wait_for_everyone()
-
- # Then go through the rewriting process, only renaming and rotating from main process(es)
- if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
- if staging_output_dir != output_dir:
- if os.path.exists(staging_output_dir):
- os.rename(staging_output_dir, output_dir)
-
- # Ensure rename completed in cases where os.rename is not atomic
- # And can only happen on non-windows based systems
- if os.name != "nt":
- fd = os.open(output_dir, os.O_RDONLY)
- os.fsync(fd)
- os.close(fd)
-
- # Maybe delete some older checkpoints.
- if self.args.should_save:
- # Solely rely on numerical checkpoint id for rotation.
- # mtime is not reliable especially on some fuse fs in cloud environments.
- self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
- elif self.is_local_process_zero():
- # Clean up the remaining staging checkpoint folders on other nodes
- if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
- shutil.rmtree(staging_output_dir)
-
- self.args.distributed_state.wait_for_everyone()
+ self._push_from_checkpoint(output_dir)
+
+ # Maybe delete some older checkpoints.
+ if self.args.should_save:
+ # Solely rely on numerical checkpoint id for rotation.
+ # mtime is not reliable especially on some fuse fs in cloud environments.
+ self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
def _save_rng_state(self, output_dir):
# Save RNG state in non-distributed training
diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py
index 1ebbe1ca7a86eb..98f3c96b4ea890 100644
--- a/tests/trainer/test_trainer.py
+++ b/tests/trainer/test_trainer.py
@@ -84,8 +84,7 @@
slow,
torch_device,
)
-from transformers.tokenization_utils_base import PreTrainedTokenizerBase
-from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR, HPSearchBackend, get_last_checkpoint
+from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR, HPSearchBackend
from transformers.training_args import OptimizerNames
from transformers.utils import (
SAFE_WEIGHTS_INDEX_NAME,
@@ -1406,19 +1405,6 @@ def test_save_checkpoints(self):
trainer.train()
self.check_saved_checkpoints(tmpdir, 5, int(self.n_epochs * 64 / self.batch_size), False)
- def test_save_checkpoints_is_atomic(self):
- class UnsaveableTokenizer(PreTrainedTokenizerBase):
- def save_pretrained(self, *args, **kwargs):
- raise OSError("simulated file write error")
-
- with tempfile.TemporaryDirectory() as tmpdir:
- trainer = get_regression_trainer(output_dir=tmpdir, save_steps=5)
- # Attach unsaveable tokenizer to partially fail checkpointing
- trainer.tokenizer = UnsaveableTokenizer()
- with self.assertRaises(OSError) as _context:
- trainer.train()
- assert get_last_checkpoint(tmpdir) is None
-
@require_safetensors
def test_safe_checkpoints(self):
for save_safetensors in [True, False]:
diff --git a/tests/trainer/test_trainer_distributed.py b/tests/trainer/test_trainer_distributed.py
index 2850d6c40b4e1c..8f867cf0beba37 100644
--- a/tests/trainer/test_trainer_distributed.py
+++ b/tests/trainer/test_trainer_distributed.py
@@ -12,7 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-from pathlib import Path
from typing import Dict
import numpy as np
@@ -237,20 +236,6 @@ def compute_metrics(p: EvalPrediction) -> Dict:
trainer.args.eval_accumulation_steps = None
- # Check that saving does indeed work with temp dir rotation
- # If this fails, will see a FileNotFoundError
- model = RegressionModel()
- training_args.max_steps = 1
- opt = torch.optim.Adam(model.parameters(), lr=1e-3)
- sched = torch.optim.lr_scheduler.LambdaLR(opt, lambda x: 1)
- trainer = Trainer(
- model, training_args, optimizers=(opt, sched), data_collator=DummyDataCollator(), eval_dataset=dataset
- )
- trainer._save_checkpoint(model=None, trial=None)
- # Check that the temp folder does not exist
- assert not (Path(training_args.output_dir) / "tmp-checkpoint-0").exists()
- assert (Path(training_args.output_dir) / "checkpoint-0").exists()
-
# Check that `dispatch_batches=False` will work on a finite iterable dataset
train_dataset = FiniteIterableDataset(label_names=["labels", "extra"], length=1)
From 7941769e557c850c8f599146a1371cf429ec0707 Mon Sep 17 00:00:00 2001
From: Sven Schultze
Date: Mon, 4 Mar 2024 14:12:35 +0100
Subject: [PATCH 073/549] Fix grad_norm unserializable tensor log failure
(#29212)
* Fix grad_norm unserializable tensor log failure
* Fix origin of grad_norm logs to be in deepspeed get_global_grad_norm()
---
src/transformers/trainer.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 5f192bf6ef10f0..99792019846210 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2011,7 +2011,7 @@ def _inner_training_loop(
is_accelerate_available()
and self.accelerator.distributed_type == DistributedType.DEEPSPEED
):
- grad_norm = model.get_global_grad_norm()
+ grad_norm = model.get_global_grad_norm().item()
else:
grad_norm = _grad_norm.item() if _grad_norm is not None else None
From bcd23a54f12a29eb6c3c6541935d4b12de17a6fc Mon Sep 17 00:00:00 2001
From: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>
Date: Mon, 4 Mar 2024 13:24:40 +0000
Subject: [PATCH 074/549] Avoid edge case in audio utils (#28836)
---
src/transformers/audio_utils.py | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/src/transformers/audio_utils.py b/src/transformers/audio_utils.py
index 5819f0723fb658..a76e671712f40d 100644
--- a/src/transformers/audio_utils.py
+++ b/src/transformers/audio_utils.py
@@ -412,6 +412,12 @@ def spectrogram(
if np.iscomplexobj(waveform):
raise ValueError("Complex-valued input waveforms are not currently supported")
+ if power is None and mel_filters is not None:
+ raise ValueError(
+ "You have provided `mel_filters` but `power` is `None`. Mel spectrogram computation is not yet supported for complex-valued spectrogram."
+ "Specify `power` to fix this issue."
+ )
+
# center pad the waveform
if center:
padding = [(int(frame_length // 2), int(frame_length // 2))]
From ed74d97871468f3a4695ede50abdc0b55717a84d Mon Sep 17 00:00:00 2001
From: Donggeun Yu
Date: Mon, 4 Mar 2024 23:18:09 +0900
Subject: [PATCH 075/549] DeformableDETR support bfloat16 (#29232)
* Update ms_deform_attn_cuda.cu
* Update ms_deform_attn_cuda.cuh
* Update modeling_deformable_detr.py
* Update src/transformers/models/deformable_detr/modeling_deformable_detr.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update modeling_deformable_detr.py
* python utils/check_copies.py --fix_and_overwrite
* Fix dtype missmatch error
* Update test_modeling_deformable_detr.py
* Update test_modeling_deformable_detr.py
* Update modeling_deformable_detr.py
* Update modeling_deformable_detr.py
* Support DeformableDETR with bfloat16
* Add test code
* Use AT_DISPATCH_FLOATING_TYPES_AND2
Use AT_DISPATCH_FLOATING_TYPES_AND2
* Update tests/models/deformable_detr/test_modeling_deformable_detr.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update tests/models/deformable_detr/test_modeling_deformable_detr.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Fix not found require_torch_bf16 function
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---
.../deformable_detr/cuda/ms_deform_attn_cuda.cu | 4 ++--
.../cuda/ms_deform_attn_cuda.cuh | 4 ++--
.../deformable_detr/cuda/ms_deform_attn_cuda.h | 17 +++++++++++++++++
.../deformable_detr/modeling_deformable_detr.py | 1 -
.../test_modeling_deformable_detr.py | 13 +++++++++++++
5 files changed, 34 insertions(+), 5 deletions(-)
diff --git a/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cu b/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cu
index e8e265219cc38d..a9bf01d56ac4c6 100644
--- a/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cu
+++ b/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cu
@@ -64,7 +64,7 @@ at::Tensor ms_deform_attn_cuda_forward(
for (int n = 0; n < batch/im2col_step_; ++n)
{
auto columns = output_n.select(0, n);
- AT_DISPATCH_FLOATING_TYPES_AND_HALF(value.type(), "ms_deform_attn_forward_cuda", ([&] {
+ AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, value.type(), "ms_deform_attn_forward_cuda", ([&] {
ms_deformable_im2col_cuda(at::cuda::getCurrentCUDAStream(),
value.data() + n * im2col_step_ * per_value_size,
spatial_shapes.data(),
@@ -134,7 +134,7 @@ std::vector ms_deform_attn_cuda_backward(
for (int n = 0; n < batch/im2col_step_; ++n)
{
auto grad_output_g = grad_output_n.select(0, n);
- AT_DISPATCH_FLOATING_TYPES_AND_HALF(value.type(), "ms_deform_attn_backward_cuda", ([&] {
+ AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, value.type(), "ms_deform_attn_backward_cuda", ([&] {
ms_deformable_col2im_cuda(at::cuda::getCurrentCUDAStream(),
grad_output_g.data(),
value.data() + n * im2col_step_ * per_value_size,
diff --git a/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cuh b/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cuh
index 5bde73a5a96b8b..95385869659b92 100644
--- a/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cuh
+++ b/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cuh
@@ -72,7 +72,7 @@ at::Tensor ms_deform_attn_cuda_forward(
for (int n = 0; n < batch/im2col_step_; ++n)
{
auto columns = output_n.select(0, n);
- AT_DISPATCH_FLOATING_TYPES_AND_HALF(value.type(), "ms_deform_attn_forward_cuda", ([&] {
+ AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, value.type(), "ms_deform_attn_forward_cuda", ([&] {
ms_deformable_im2col_cuda(at::cuda::getCurrentCUDAStream(),
value.data() + n * im2col_step_ * per_value_size,
spatial_shapes.data(),
@@ -142,7 +142,7 @@ std::vector ms_deform_attn_cuda_backward(
for (int n = 0; n < batch/im2col_step_; ++n)
{
auto grad_output_g = grad_output_n.select(0, n);
- AT_DISPATCH_FLOATING_TYPES_AND_HALF(value.type(), "ms_deform_attn_backward_cuda", ([&] {
+ AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, value.type(), "ms_deform_attn_backward_cuda", ([&] {
ms_deformable_col2im_cuda(at::cuda::getCurrentCUDAStream(),
grad_output_g.data(),
value.data() + n * im2col_step_ * per_value_size,
diff --git a/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.h b/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.h
index fbcf4543e66bb1..d8c21b4e54dcd7 100644
--- a/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.h
+++ b/src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.h
@@ -19,6 +19,14 @@ at::Tensor ms_deform_attn_cuda_forward(
const at::Tensor &attn_weight,
const int im2col_step);
+at::Tensor ms_deform_attn_cuda_forward_bf16(
+ const at::Tensor &value,
+ const at::Tensor &spatial_shapes,
+ const at::Tensor &level_start_index,
+ const at::Tensor &sampling_loc,
+ const at::Tensor &attn_weight,
+ const int im2col_step);
+
std::vector ms_deform_attn_cuda_backward(
const at::Tensor &value,
const at::Tensor &spatial_shapes,
@@ -27,3 +35,12 @@ std::vector ms_deform_attn_cuda_backward(
const at::Tensor &attn_weight,
const at::Tensor &grad_output,
const int im2col_step);
+
+std::vector ms_deform_attn_cuda_backward_bf16(
+ const at::Tensor &value,
+ const at::Tensor &spatial_shapes,
+ const at::Tensor &level_start_index,
+ const at::Tensor &sampling_loc,
+ const at::Tensor &attn_weight,
+ const at::Tensor &grad_output,
+ const int im2col_step);
diff --git a/src/transformers/models/deformable_detr/modeling_deformable_detr.py b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
index 1b6222c4cfc413..4c122832ff2027 100755
--- a/src/transformers/models/deformable_detr/modeling_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
@@ -1758,7 +1758,6 @@ def forward(
spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=source_flatten.device)
level_start_index = torch.cat((spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1]))
valid_ratios = torch.stack([self.get_valid_ratio(m, dtype=source_flatten.dtype) for m in masks], 1)
- valid_ratios = valid_ratios.float()
# Fourth, sent source_flatten + mask_flatten + lvl_pos_embed_flatten (backbone + proj layer output) through encoder
# Also provide spatial_shapes, level_start_index and valid_ratios
diff --git a/tests/models/deformable_detr/test_modeling_deformable_detr.py b/tests/models/deformable_detr/test_modeling_deformable_detr.py
index 5b123884e9cc53..7a83c4f1ed80a8 100644
--- a/tests/models/deformable_detr/test_modeling_deformable_detr.py
+++ b/tests/models/deformable_detr/test_modeling_deformable_detr.py
@@ -26,6 +26,7 @@
require_timm,
require_torch,
require_torch_accelerator,
+ require_torch_bf16,
require_vision,
slow,
torch_device,
@@ -591,6 +592,18 @@ def create_and_check_model_fp16_forward(self):
output = model(**inputs)["last_hidden_state"]
self.parent.assertFalse(torch.isnan(output).any().item())
+ @require_torch_bf16
+ def create_and_check_model_bf16_forward(self):
+ model_class = DeformableDetrForObjectDetection
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+ model = model_class(config, torch_dtype=torch.bfloat16)
+ model.to(torch_device)
+ model.eval()
+ inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+ output = model(**inputs)["last_hidden_state"]
+ self.parent.assertFalse(torch.isnan(output).any().item())
+
TOLERANCE = 1e-4
From 836921fdeb498820b71dcc7b70e990e828f4c6bc Mon Sep 17 00:00:00 2001
From: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Date: Mon, 4 Mar 2024 18:49:02 +0100
Subject: [PATCH 076/549] Add UDOP (#22940)
* First draft
* More improvements
* More improvements
* More fixes
* Fix copies
* More improvements
* More fixes
* More improvements
* Convert checkpoint
* More improvements, set up tests
* Fix more tests
* Add UdopModel
* More improvements
* Fix equivalence test
* More fixes
* Redesign model
* Extend conversion script
* Use real inputs for conversion script
* Add image processor
* Improve conversion script
* Add UdopTokenizer
* Add fast tokenizer
* Add converter
* Update README's
* Add processor
* Add fully fledged tokenizer
* Add fast tokenizer
* Use processor in conversion script
* Add tokenizer tests
* Fix one more test
* Fix more tests
* Fix tokenizer tests
* Enable fast tokenizer tests
* Fix more tests
* Fix additional_special_tokens of fast tokenizer
* Fix tokenizer tests
* Fix more tests
* Fix equivalence test
* Rename image to pixel_values
* Rename seg_data to bbox
* More renamings
* Remove vis_special_token
* More improvements
* Add docs
* Fix copied from
* Update slow tokenizer
* Update fast tokenizer design
* Make text input optional
* Add first draft of processor tests
* Fix more processor tests
* Fix decoder_start_token_id
* Fix test_initialization
* Add integration test
* More improvements
* Improve processor, add test
* Add more copied from
* Add more copied from
* Add more copied from
* Add more copied from
* Remove print statement
* Update README and auto mapping
* Delete files
* Delete another file
* Remove code
* Fix test
* Fix docs
* Remove asserts
* Add doc tests
* Include UDOP in exotic model tests
* Add expected tesseract decodings
* Add sentencepiece
* Use same design as T5
* Add UdopEncoderModel
* Add UdopEncoderModel to tests
* More fixes
* Fix fast tokenizer
* Fix one more test
* Remove parallelisable attribute
* Fix copies
* Remove legacy file
* Copy from T5Tokenizer
* Fix rebase
* More fixes, copy from T5
* More fixes
* Fix init
* Use ArthurZ/udop for tests
* Make all model tests pass
* Remove UdopForConditionalGeneration from auto mapping
* Fix more tests
* fixups
* more fixups
* fix the tokenizers
* remove un-necessary changes
* nits
* nits
* replace truncate_sequences_boxes with truncate_sequences for fix-copies
* nit current path
* add a test for input ids
* ids that we should get taken from c9f7a32f57440d90ff79890270d376a1cc0acb68
* nits converting
* nits
* apply ruff
* nits
* nits
* style
* fix slow order of addition
* fix udop fast range as well
* fixup
* nits
* Add docstrings
* Fix gradient checkpointing
* Update code examples
* Skip tests
* Update integration test
* Address comment
* Make fixup
* Remove extra ids from tokenizer
* Skip test
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update year
* Address comment
* Address more comments
* Address comments
* Add copied from
* Update CI
* Rename script
* Update model id
* Add AddedToken, skip tests
* Update CI
* Fix doc tests
* Do not use Tesseract for the doc tests
* Remove kwargs
* Add original inputs
* Update casting
* Fix doc test
* Update question
* Update question
* Use LayoutLMv3ImageProcessor
* Update organization
* Improve docs
* Update forward signature
* Make images optional
* Remove deprecated device argument
* Add comment, add add_prefix_space
* More improvements
* Remove kwargs
---------
Co-authored-by: ArthurZucker
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
.circleci/create_circleci_config.py | 2 +
README.md | 1 +
README_es.md | 1 +
README_fr.md | 1 +
README_hd.md | 1 +
README_ja.md | 1 +
README_ko.md | 1 +
README_zh-hans.md | 1 +
README_zh-hant.md | 1 +
docs/source/en/_toctree.yml | 2 +
docs/source/en/index.md | 1 +
docs/source/en/model_doc/udop.md | 102 +
src/transformers/__init__.py | 26 +
src/transformers/convert_slow_tokenizer.py | 12 +
src/transformers/models/__init__.py | 1 +
.../models/auto/configuration_auto.py | 3 +
.../models/auto/image_processing_auto.py | 1 +
src/transformers/models/auto/modeling_auto.py | 1 +
.../models/auto/tokenization_auto.py | 7 +
src/transformers/models/udop/__init__.py | 98 +
.../models/udop/configuration_udop.py | 162 ++
.../models/udop/convert_udop_to_hf.py | 213 ++
src/transformers/models/udop/modeling_udop.py | 2030 +++++++++++++++++
.../models/udop/processing_udop.py | 204 ++
.../models/udop/tokenization_udop.py | 1483 ++++++++++++
.../models/udop/tokenization_udop_fast.py | 1012 ++++++++
src/transformers/utils/dummy_pt_objects.py | 31 +
.../utils/dummy_sentencepiece_objects.py | 7 +
.../utils/dummy_tokenizers_objects.py | 7 +
tests/models/udop/__init__.py | 0
tests/models/udop/test_modeling_udop.py | 567 +++++
tests/models/udop/test_processor_udop.py | 508 +++++
tests/models/udop/test_tokenization_udop.py | 1886 +++++++++++++++
utils/check_config_attributes.py | 2 +
utils/check_repo.py | 2 +
35 files changed, 8378 insertions(+)
create mode 100644 docs/source/en/model_doc/udop.md
create mode 100644 src/transformers/models/udop/__init__.py
create mode 100644 src/transformers/models/udop/configuration_udop.py
create mode 100644 src/transformers/models/udop/convert_udop_to_hf.py
create mode 100644 src/transformers/models/udop/modeling_udop.py
create mode 100644 src/transformers/models/udop/processing_udop.py
create mode 100644 src/transformers/models/udop/tokenization_udop.py
create mode 100644 src/transformers/models/udop/tokenization_udop_fast.py
create mode 100644 tests/models/udop/__init__.py
create mode 100644 tests/models/udop/test_modeling_udop.py
create mode 100644 tests/models/udop/test_processor_udop.py
create mode 100644 tests/models/udop/test_tokenization_udop.py
diff --git a/.circleci/create_circleci_config.py b/.circleci/create_circleci_config.py
index 7f271ff0819f78..45a58737a8ddff 100644
--- a/.circleci/create_circleci_config.py
+++ b/.circleci/create_circleci_config.py
@@ -475,6 +475,7 @@ def job_name(self):
"pip install -U --upgrade-strategy eager 'git+https://github.com/facebookresearch/detectron2.git'",
"sudo apt install tesseract-ocr",
"pip install -U --upgrade-strategy eager pytesseract",
+ "pip install --upgrade-strategy eager sentencepiece",
"pip install -U --upgrade-strategy eager natten==0.15.1+torch210cpu -f https://shi-labs.com/natten/wheels",
"pip install -U --upgrade-strategy eager python-Levenshtein",
"pip install -U --upgrade-strategy eager opencv-python",
@@ -485,6 +486,7 @@ def job_name(self):
"tests/models/*layoutlmv*",
"tests/models/*nat",
"tests/models/deta",
+ "tests/models/udop",
"tests/models/nougat",
],
pytest_num_workers=1,
diff --git a/README.md b/README.md
index 54e228a1150266..30f7cd08a77643 100644
--- a/README.md
+++ b/README.md
@@ -511,6 +511,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal.
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (from Microsoft Research) released with the paper [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
diff --git a/README_es.md b/README_es.md
index b3c6845000d2b4..6e808e0e2b1cf1 100644
--- a/README_es.md
+++ b/README_es.md
@@ -484,6 +484,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal.
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (from Microsoft Research) released with the paper [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
diff --git a/README_fr.md b/README_fr.md
index 4b87eba5bbe1ba..3bd57830076a5f 100644
--- a/README_fr.md
+++ b/README_fr.md
@@ -505,6 +505,7 @@ Nombre actuel de points de contrôle : ![](https://img.shields.io/endpoint?url=h
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (de Microsoft), publié dans l'article [TrOCR : Reconnaissance optique de caractères basée sur un transformateur avec des modèles pré-entraînés](https://arxiv.org/abs/2109.10282) par Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (de l'UNC Chapel Hill) a été publié dans l'article [TVLT : Transformer Vision-Language sans texte](https://arxiv.org/abs/2209.14156) par Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal.
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (d'Intel) a été publié dans l'article [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) par Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (de Microsoft Research) publié dans l'article [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) parZineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (de Google Research) a été publié dans l'article [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) par Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler.
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (de Google Research) a été publié dans l'article [UniMax : Échantillonnage linguistique plus équitable et plus efficace pour l'entraînement préalable multilingue à grande échelle](https://openreview.net/forum?id=kXwdL1cWOAi) par Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (de Microsoft Research) a été publié dans l'article [UniSpeech : Apprentissage unifié de la représentation de la parole avec des données étiquetées et non étiquetées](https://arxiv.org/abs/2101.07597) par Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
diff --git a/README_hd.md b/README_hd.md
index e68d9d39ba6242..0353eb4d8fbda6 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -458,6 +458,7 @@ conda install conda-forge::transformers
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft) released with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal.
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (Microsoft Research से) Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. द्वाराअनुसंधान पत्र [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) के साथ जारी किया गया
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research से) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. द्वाराअनुसंधान पत्र [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) के साथ जारी किया गया
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (माइक्रोसॉफ्ट रिसर्च से) साथ में दिया गया पेपर [UniSpeech: यूनिफाइड स्पीच रिप्रेजेंटेशन लर्निंग विद लेबलेड एंड अनलेबल्ड डेटा](https://arxiv.org/abs/2101.07597) चेंगई वांग, यू वू, याओ कियान, केनिची कुमातानी, शुजी लियू, फुरु वेई, माइकल ज़ेंग, ज़ुएदोंग हुआंग द्वारा।
diff --git a/README_ja.md b/README_ja.md
index d314b07140f504..599865ab5a7d49 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -518,6 +518,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (Microsoft から), Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei から公開された研究論文: [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282)
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill から), Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal から公開された研究論文: [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156)
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (Intel から), Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding から公開された研究論文: [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995)
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (Microsoft Research から) Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. から公開された研究論文 [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623)
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (Google Research から) Yi Tay, Mostafa Dehghani, Vinh Q から公開された研究論文: [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research から) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. から公開された研究論文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi)
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research から) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang から公開された研究論文: [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)
diff --git a/README_ko.md b/README_ko.md
index f8679087ad1787..e48159c7999339 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -433,6 +433,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (Microsoft 에서) Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei 의 [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 논문과 함께 발표했습니다.
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill 에서) Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal 의 [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) 논문과 함께 발표했습니다.
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (Intel 에서) Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding 의 [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) 논문과 함께 발표했습니다.
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (Microsoft Research 에서 제공)은 Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.의 [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623)논문과 함께 발표했습니다.
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (Google Research 에서) Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzle 의 [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) 논문과 함께 발표했습니다.
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research 에서 제공)은 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.의 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi)논문과 함께 발표했습니다.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research 에서) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 의 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index 1832870d52ff24..a9e1997da38c83 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -457,6 +457,7 @@ conda install conda-forge::transformers
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (来自 Microsoft) 伴随论文 [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 由 Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei 发布。
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (来自 UNC Chapel Hill) 伴随论文 [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) 由 Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal 发布。
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (来自 Intel) 伴随论文 [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) 由 Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding 发布.
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (来自 Microsoft Research) 伴随论文 [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) 由 Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal 发布。
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (来自 Google Research) 伴随论文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) 由 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant 发布。
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (来自 Microsoft Research) 伴随论文 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 由 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index 2bf31890f359d7..2c724f309ef304 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -469,6 +469,7 @@ conda install conda-forge::transformers
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft) released with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal.
1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
+1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (from Microsoft Research) released with the paper [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.
1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index ff6e91dbcf25d6..76d8a2ba7d7d75 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -770,6 +770,8 @@
title: TVLT
- local: model_doc/tvp
title: TVP
+ - local: model_doc/udop
+ title: UDOP
- local: model_doc/vilt
title: ViLT
- local: model_doc/vipllava
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
index 34995edec39c7d..36216962d2da34 100644
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -279,6 +279,7 @@ Flax), PyTorch, and/or TensorFlow.
| [TrOCR](model_doc/trocr) | ✅ | ❌ | ❌ |
| [TVLT](model_doc/tvlt) | ✅ | ❌ | ❌ |
| [TVP](model_doc/tvp) | ✅ | ❌ | ❌ |
+| [UDOP](model_doc/udop) | ✅ | ❌ | ❌ |
| [UL2](model_doc/ul2) | ✅ | ✅ | ✅ |
| [UMT5](model_doc/umt5) | ✅ | ❌ | ❌ |
| [UniSpeech](model_doc/unispeech) | ✅ | ❌ | ❌ |
diff --git a/docs/source/en/model_doc/udop.md b/docs/source/en/model_doc/udop.md
new file mode 100644
index 00000000000000..b84ec160f705cc
--- /dev/null
+++ b/docs/source/en/model_doc/udop.md
@@ -0,0 +1,102 @@
+
+
+# UDOP
+
+## Overview
+
+The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.
+UDOP adopts an encoder-decoder Transformer architecture based on [T5](t5) for document AI tasks like document image classification, document parsing and document visual question answering.
+
+The abstract from the paper is the following:
+
+We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark (DUE).*
+
+
+
+ UDOP architecture. Taken from the original paper.
+
+## Usage tips
+
+- In addition to *input_ids*, [`UdopForConditionalGeneration`] also expects the input `bbox`, which are
+ the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
+ as Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) (there's a [Python wrapper](https://pypi.org/project/pytesseract/) available). Each bounding box should be in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
+ position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
+ scale. To normalize, you can use the following function:
+
+```python
+def normalize_bbox(bbox, width, height):
+ return [
+ int(1000 * (bbox[0] / width)),
+ int(1000 * (bbox[1] / height)),
+ int(1000 * (bbox[2] / width)),
+ int(1000 * (bbox[3] / height)),
+ ]
+```
+
+Here, `width` and `height` correspond to the width and height of the original document in which the token
+occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
+
+```python
+from PIL import Image
+
+# Document can be a png, jpg, etc. PDFs must be converted to images.
+image = Image.open(name_of_your_document).convert("RGB")
+
+width, height = image.size
+```
+
+- At inference time, it's recommended to use the `generate` method to autoregressively generate text given a document image.
+- One can use [`UdopProcessor`] to prepare images and text for the model. By default, this class uses the Tesseract engine to extract a list of words
+and boxes (coordinates) from a given document. Its functionality is equivalent to that of [`LayoutLMv3Processor`], hence it supports passing either
+`apply_ocr=False` in case you prefer to use your own OCR engine or `apply_ocr=True` in case you want the default OCR engine to be used.
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr).
+The original code can be found [here](https://github.com/microsoft/UDOP).
+
+
+## UdopConfig
+
+[[autodoc]] UdopConfig
+
+## UdopTokenizer
+
+[[autodoc]] UdopTokenizer
+ - build_inputs_with_special_tokens
+ - get_special_tokens_mask
+ - create_token_type_ids_from_sequences
+ - save_vocabulary
+
+## UdopTokenizerFast
+
+[[autodoc]] UdopTokenizerFast
+
+## UdopProcessor
+
+[[autodoc]] UdopProcessor
+ - __call__
+
+## UdopModel
+
+[[autodoc]] UdopModel
+ - forward
+
+## UdopForConditionalGeneration
+
+[[autodoc]] UdopForConditionalGeneration
+ - forward
+
+## UdopEncoderModel
+
+[[autodoc]] UdopEncoderModel
+ - forward
\ No newline at end of file
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index 027cf495466c50..6cdd561b41e1ba 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -856,6 +856,11 @@
"TvpConfig",
"TvpProcessor",
],
+ "models.udop": [
+ "UDOP_PRETRAINED_CONFIG_ARCHIVE_MAP",
+ "UdopConfig",
+ "UdopProcessor",
+ ],
"models.umt5": ["UMT5Config"],
"models.unispeech": [
"UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP",
@@ -1135,6 +1140,7 @@
_import_structure["models.speech_to_text"].append("Speech2TextTokenizer")
_import_structure["models.speecht5"].append("SpeechT5Tokenizer")
_import_structure["models.t5"].append("T5Tokenizer")
+ _import_structure["models.udop"].append("UdopTokenizer")
_import_structure["models.xglm"].append("XGLMTokenizer")
_import_structure["models.xlm_prophetnet"].append("XLMProphetNetTokenizer")
_import_structure["models.xlm_roberta"].append("XLMRobertaTokenizer")
@@ -1214,6 +1220,7 @@
_import_structure["models.splinter"].append("SplinterTokenizerFast")
_import_structure["models.squeezebert"].append("SqueezeBertTokenizerFast")
_import_structure["models.t5"].append("T5TokenizerFast")
+ _import_structure["models.udop"].append("UdopTokenizerFast")
_import_structure["models.whisper"].append("WhisperTokenizerFast")
_import_structure["models.xglm"].append("XGLMTokenizerFast")
_import_structure["models.xlm_roberta"].append("XLMRobertaTokenizerFast")
@@ -3411,6 +3418,15 @@
"TvpPreTrainedModel",
]
)
+ _import_structure["models.udop"].extend(
+ [
+ "UDOP_PRETRAINED_MODEL_ARCHIVE_LIST",
+ "UdopEncoderModel",
+ "UdopForConditionalGeneration",
+ "UdopModel",
+ "UdopPreTrainedModel",
+ ],
+ )
_import_structure["models.umt5"].extend(
[
"UMT5EncoderModel",
@@ -5640,6 +5656,7 @@
TvpConfig,
TvpProcessor,
)
+ from .models.udop import UDOP_PRETRAINED_CONFIG_ARCHIVE_MAP, UdopConfig, UdopProcessor
from .models.umt5 import UMT5Config
from .models.unispeech import (
UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -5915,6 +5932,7 @@
from .models.speech_to_text import Speech2TextTokenizer
from .models.speecht5 import SpeechT5Tokenizer
from .models.t5 import T5Tokenizer
+ from .models.udop import UdopTokenizer
from .models.xglm import XGLMTokenizer
from .models.xlm_prophetnet import XLMProphetNetTokenizer
from .models.xlm_roberta import XLMRobertaTokenizer
@@ -5987,6 +6005,7 @@
from .models.splinter import SplinterTokenizerFast
from .models.squeezebert import SqueezeBertTokenizerFast
from .models.t5 import T5TokenizerFast
+ from .models.udop import UdopTokenizerFast
from .models.whisper import WhisperTokenizerFast
from .models.xglm import XGLMTokenizerFast
from .models.xlm_roberta import XLMRobertaTokenizerFast
@@ -7827,6 +7846,13 @@
TvpModel,
TvpPreTrainedModel,
)
+ from .models.udop import (
+ UDOP_PRETRAINED_MODEL_ARCHIVE_LIST,
+ UdopEncoderModel,
+ UdopForConditionalGeneration,
+ UdopModel,
+ UdopPreTrainedModel,
+ )
from .models.umt5 import (
UMT5EncoderModel,
UMT5ForConditionalGeneration,
diff --git a/src/transformers/convert_slow_tokenizer.py b/src/transformers/convert_slow_tokenizer.py
index c44592f8a0f9fb..707bfae89db56f 100644
--- a/src/transformers/convert_slow_tokenizer.py
+++ b/src/transformers/convert_slow_tokenizer.py
@@ -1039,6 +1039,17 @@ def post_processor(self):
)
+class UdopConverter(SpmConverter):
+ def post_processor(self):
+ return processors.TemplateProcessing(
+ single=["$A", ""],
+ pair=["$A", "", "$B", ""],
+ special_tokens=[
+ ("", self.original_tokenizer.convert_tokens_to_ids("")),
+ ],
+ )
+
+
class WhisperConverter(Converter):
def converted(self) -> Tokenizer:
vocab = self.original_tokenizer.encoder
@@ -1471,6 +1482,7 @@ def converted(self) -> Tokenizer:
"SeamlessM4TTokenizer": SeamlessM4TConverter,
"SqueezeBertTokenizer": BertConverter,
"T5Tokenizer": T5Converter,
+ "UdopTokenizer": UdopConverter,
"WhisperTokenizer": WhisperConverter,
"XLMRobertaTokenizer": XLMRobertaConverter,
"XLNetTokenizer": XLNetConverter,
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index ebb3db25fb96be..89ca6ab2b8660c 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -220,6 +220,7 @@
trocr,
tvlt,
tvp,
+ udop,
umt5,
unispeech,
unispeech_sat,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index 7bc637f3e1060a..87ff925e55eaa1 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -231,6 +231,7 @@
("trocr", "TrOCRConfig"),
("tvlt", "TvltConfig"),
("tvp", "TvpConfig"),
+ ("udop", "UdopConfig"),
("umt5", "UMT5Config"),
("unispeech", "UniSpeechConfig"),
("unispeech-sat", "UniSpeechSatConfig"),
@@ -454,6 +455,7 @@
("transfo-xl", "TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("tvlt", "TVLT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("tvp", "TVP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+ ("udop", "UDOP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("unispeech", "UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("unispeech-sat", "UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("univnet", "UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -715,6 +717,7 @@
("trocr", "TrOCR"),
("tvlt", "TVLT"),
("tvp", "TVP"),
+ ("udop", "UDOP"),
("ul2", "UL2"),
("umt5", "UMT5"),
("unispeech", "UniSpeech"),
diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
index aef894a425bae1..50e9266cdee161 100644
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -108,6 +108,7 @@
("timesformer", "VideoMAEImageProcessor"),
("tvlt", "TvltImageProcessor"),
("tvp", "TvpImageProcessor"),
+ ("udop", "LayoutLMv3ImageProcessor"),
("upernet", "SegformerImageProcessor"),
("van", "ConvNextImageProcessor"),
("videomae", "VideoMAEImageProcessor"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 05b519d2bcd16b..0d28d224f19106 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -219,6 +219,7 @@
("transfo-xl", "TransfoXLModel"),
("tvlt", "TvltModel"),
("tvp", "TvpModel"),
+ ("udop", "UdopModel"),
("umt5", "UMT5Model"),
("unispeech", "UniSpeechModel"),
("unispeech-sat", "UniSpeechSatModel"),
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index 2c21f1cd529c74..d586068fb9c095 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -418,6 +418,13 @@
("tapex", ("TapexTokenizer", None)),
("transfo-xl", ("TransfoXLTokenizer", None)),
("tvp", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+ (
+ "udop",
+ (
+ "UdopTokenizer" if is_sentencepiece_available() else None,
+ "UdopTokenizerFast" if is_tokenizers_available() else None,
+ ),
+ ),
(
"umt5",
(
diff --git a/src/transformers/models/udop/__init__.py b/src/transformers/models/udop/__init__.py
new file mode 100644
index 00000000000000..5066fde6af1d15
--- /dev/null
+++ b/src/transformers/models/udop/__init__.py
@@ -0,0 +1,98 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from ...utils import (
+ OptionalDependencyNotAvailable,
+ _LazyModule,
+ is_sentencepiece_available,
+ is_tokenizers_available,
+ is_torch_available,
+)
+
+
+_import_structure = {
+ "configuration_udop": ["UDOP_PRETRAINED_CONFIG_ARCHIVE_MAP", "UdopConfig"],
+ "processing_udop": ["UdopProcessor"],
+}
+
+try:
+ if not is_sentencepiece_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["tokenization_udop"] = ["UdopTokenizer"]
+
+try:
+ if not is_tokenizers_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["tokenization_udop_fast"] = ["UdopTokenizerFast"]
+
+try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["modeling_udop"] = [
+ "UDOP_PRETRAINED_MODEL_ARCHIVE_LIST",
+ "UdopForConditionalGeneration",
+ "UdopPreTrainedModel",
+ "UdopModel",
+ "UdopEncoderModel",
+ ]
+
+if TYPE_CHECKING:
+ from .configuration_udop import UDOP_PRETRAINED_CONFIG_ARCHIVE_MAP, UdopConfig
+ from .processing_udop import UdopProcessor
+
+ try:
+ if not is_sentencepiece_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .tokenization_udop import UdopTokenizer
+
+ try:
+ if not is_tokenizers_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .tokenization_udop_fast import UdopTokenizerFast
+
+ try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .modeling_udop import (
+ UDOP_PRETRAINED_MODEL_ARCHIVE_LIST,
+ UdopEncoderModel,
+ UdopForConditionalGeneration,
+ UdopModel,
+ UdopPreTrainedModel,
+ )
+
+else:
+ import sys
+
+ sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/udop/configuration_udop.py b/src/transformers/models/udop/configuration_udop.py
new file mode 100644
index 00000000000000..8647a7bae29acf
--- /dev/null
+++ b/src/transformers/models/udop/configuration_udop.py
@@ -0,0 +1,162 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" UDOP model configuration"""
+
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+UDOP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+ "microsoft/udop-large": "https://huggingface.co/microsoft/udop-large/resolve/main/config.json",
+}
+
+
+class UdopConfig(PretrainedConfig):
+ r"""
+ This is the configuration class to store the configuration of a [`UdopForConditionalGeneration`]. It is used to
+ instantiate a UDOP model according to the specified arguments, defining the model architecture. Instantiating a
+ configuration with the defaults will yield a similar configuration to that of the UDOP
+ [microsoft/udop-large](https://huggingface.co/microsoft/udop-large) architecture.
+
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+ documentation from [`PretrainedConfig`] for more information.
+
+ Arguments:
+ vocab_size (`int`, *optional*, defaults to 33201):
+ Vocabulary size of the UDOP model. Defines the number of different tokens that can be represented by the
+ `inputs_ids` passed when calling [`UdopForConditionalGeneration`].
+ d_model (`int`, *optional*, defaults to 1024):
+ Size of the encoder layers and the pooler layer.
+ d_kv (`int`, *optional*, defaults to 64):
+ Size of the key, query, value projections per attention head. The `inner_dim` of the projection layer will
+ be defined as `num_heads * d_kv`.
+ d_ff (`int`, *optional*, defaults to 4096):
+ Size of the intermediate feed forward layer in each `UdopBlock`.
+ num_layers (`int`, *optional*, defaults to 24):
+ Number of hidden layers in the Transformer encoder and decoder.
+ num_decoder_layers (`int`, *optional*):
+ Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.
+ num_heads (`int`, *optional*, defaults to 16):
+ Number of attention heads for each attention layer in the Transformer encoder and decoder.
+ relative_attention_num_buckets (`int`, *optional*, defaults to 32):
+ The number of buckets to use for each attention layer.
+ relative_attention_max_distance (`int`, *optional*, defaults to 128):
+ The maximum distance of the longer sequences for the bucket separation.
+ relative_bias_args (`List[dict]`, *optional*, defaults to `[{'type': '1d'}, {'type': 'horizontal'}, {'type': 'vertical'}]`):
+ A list of dictionaries containing the arguments for the relative bias layers.
+ dropout_rate (`float`, *optional*, defaults to 0.1):
+ The ratio for all dropout layers.
+ layer_norm_epsilon (`float`, *optional*, defaults to 1e-06):
+ The epsilon used by the layer normalization layers.
+ initializer_factor (`float`, *optional*, defaults to 1.0):
+ A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+ testing).
+ feed_forward_proj (`string`, *optional*, defaults to `"relu"`):
+ Type of feed forward layer to be used. Should be one of `"relu"` or `"gated-gelu"`. Udopv1.1 uses the
+ `"gated-gelu"` feed forward projection. Original Udop uses `"relu"`.
+ is_encoder_decoder (`bool`, *optional*, defaults to `True`):
+ Whether the model should behave as an encoder/decoder or not.
+ use_cache (`bool`, *optional*, defaults to `True`):
+ Whether or not the model should return the last key/values attentions (not used by all models).
+ pad_token_id (`int`, *optional*, defaults to 0):
+ The id of the padding token in the vocabulary.
+ eos_token_id (`int`, *optional*, defaults to 1):
+ The id of the end-of-sequence token in the vocabulary.
+ max_2d_position_embeddings (`int`, *optional*, defaults to 1024):
+ The maximum absolute position embeddings for relative position encoding.
+ image_size (`int`, *optional*, defaults to 224):
+ The size of the input images.
+ patch_size (`int`, *optional*, defaults to 16):
+ The patch size used by the vision encoder.
+ num_channels (`int`, *optional*, defaults to 3):
+ The number of channels in the input images.
+ """
+
+ model_type = "udop"
+ keys_to_ignore_at_inference = ["past_key_values"]
+ attribute_map = {"hidden_size": "d_model", "num_attention_heads": "num_heads", "num_hidden_layers": "num_layers"}
+
+ def __init__(
+ self,
+ vocab_size=33201,
+ d_model=1024,
+ d_kv=64,
+ d_ff=4096,
+ num_layers=24,
+ num_decoder_layers=None,
+ num_heads=16,
+ relative_attention_num_buckets=32,
+ relative_attention_max_distance=128,
+ relative_bias_args=[{"type": "1d"}, {"type": "horizontal"}, {"type": "vertical"}],
+ dropout_rate=0.1,
+ layer_norm_epsilon=1e-6,
+ initializer_factor=1.0,
+ feed_forward_proj="relu",
+ is_encoder_decoder=True,
+ use_cache=True,
+ pad_token_id=0,
+ eos_token_id=1,
+ max_2d_position_embeddings=1024,
+ image_size=224,
+ patch_size=16,
+ num_channels=3,
+ **kwargs,
+ ):
+ self.vocab_size = vocab_size
+ self.d_model = d_model
+ self.d_kv = d_kv
+ self.d_ff = d_ff
+ self.num_layers = num_layers
+ self.num_decoder_layers = (
+ num_decoder_layers if num_decoder_layers is not None else self.num_layers
+ ) # default = symmetry
+ self.num_heads = num_heads
+ self.relative_attention_num_buckets = relative_attention_num_buckets
+ self.relative_attention_max_distance = relative_attention_max_distance
+ self.dropout_rate = dropout_rate
+ self.layer_norm_epsilon = layer_norm_epsilon
+ self.initializer_factor = initializer_factor
+ self.feed_forward_proj = feed_forward_proj
+ self.use_cache = use_cache
+
+ # UDOP attributes
+ self.max_2d_position_embeddings = max_2d_position_embeddings
+ self.image_size = image_size
+ self.patch_size = patch_size
+ self.num_channels = num_channels
+ if not isinstance(relative_bias_args, list):
+ raise ValueError("`relative_bias_args` should be a list of dictionaries.")
+ self.relative_bias_args = relative_bias_args
+
+ act_info = self.feed_forward_proj.split("-")
+ self.dense_act_fn = act_info[-1]
+ self.is_gated_act = act_info[0] == "gated"
+
+ if len(act_info) > 1 and act_info[0] != "gated" or len(act_info) > 2:
+ raise ValueError(
+ f"`feed_forward_proj`: {feed_forward_proj} is not a valid activation function of the dense layer."
+ "Please make sure `feed_forward_proj` is of the format `gated-{ACT_FN}` or `{ACT_FN}`, e.g. "
+ "'gated-gelu' or 'relu'"
+ )
+
+ super().__init__(
+ pad_token_id=pad_token_id,
+ eos_token_id=eos_token_id,
+ is_encoder_decoder=is_encoder_decoder,
+ **kwargs,
+ )
diff --git a/src/transformers/models/udop/convert_udop_to_hf.py b/src/transformers/models/udop/convert_udop_to_hf.py
new file mode 100644
index 00000000000000..f9cf07f1286bf1
--- /dev/null
+++ b/src/transformers/models/udop/convert_udop_to_hf.py
@@ -0,0 +1,213 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert UDOP checkpoints from the original repository. URL: https://github.com/microsoft/i-Code/tree/main/i-Code-Doc"""
+
+
+import argparse
+
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+from torchvision import transforms as T
+
+from transformers import (
+ LayoutLMv3ImageProcessor,
+ UdopConfig,
+ UdopForConditionalGeneration,
+ UdopProcessor,
+ UdopTokenizer,
+)
+from transformers.image_utils import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+
+
+def original_transform(image, image_size=224):
+ transform = T.Compose(
+ [
+ T.Resize([image_size, image_size]),
+ T.ToTensor(),
+ T.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
+ ]
+ )
+
+ image = transform(image)
+ return image
+
+
+def get_image():
+ filepath = hf_hub_download(
+ repo_id="hf-internal-testing/fixtures_docvqa", filename="document_2.png", repo_type="dataset"
+ )
+ image = Image.open(filepath).convert("RGB")
+
+ return image
+
+
+def prepare_dummy_inputs(tokenizer, image_processor):
+ prompt = "Question answering. What is the name of the company?"
+ prompt = "Question answering. In which year is the report made?"
+ prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
+
+ image = get_image()
+ # words, boxes = apply_tesseract(image, lang=None)
+ # fmt: off
+ words = ['7', 'ITC', 'Limited', 'REPORT', 'AND', 'ACCOUNTS', '2013', 'ITC’s', 'Brands:', 'An', 'Asset', 'for', 'the', 'Nation', 'The', 'consumer', 'needs', 'and', 'aspirations', 'they', 'fulfil,', 'the', 'benefit', 'they', 'generate', 'for', 'millions', 'across', 'ITC’s', 'value', 'chains,', 'the', 'future-ready', 'capabilities', 'that', 'support', 'them,', 'and', 'the', 'value', 'that', 'they', 'create', 'for', 'the', 'country,', 'have', 'made', 'ITC’s', 'brands', 'national', 'assets,', 'adding', 'to', 'India’s', 'competitiveness.', 'It', 'is', 'ITC’s', 'aspiration', 'to', 'be', 'the', 'No', '1', 'FMCG', 'player', 'in', 'the', 'country,', 'driven', 'by', 'its', 'new', 'FMCG', 'businesses.', 'A', 'recent', 'Nielsen', 'report', 'has', 'highlighted', 'that', "ITC's", 'new', 'FMCG', 'businesses', 'are', 'the', 'fastest', 'growing', 'among', 'the', 'top', 'consumer', 'goods', 'companies', 'operating', 'in', 'India.', 'ITC', 'takes', 'justifiable', 'pride', 'that,', 'along', 'with', 'generating', 'economic', 'value,', 'these', 'celebrated', 'Indian', 'brands', 'also', 'drive', 'the', 'creation', 'of', 'larger', 'societal', 'capital', 'through', 'the', 'virtuous', 'cycle', 'of', 'sustainable', 'and', 'inclusive', 'growth.', 'DI', 'WILLS', '*', ';', 'LOVE', 'DELIGHTFULLY', 'SOFT', 'SKIN?', 'aia', 'Ans', 'Source:', 'https://www.industrydocuments.ucsf.edu/docs/snbx0223']
+ boxes = [[0, 45, 67, 80], [72, 56, 109, 67], [116, 56, 189, 67], [198, 59, 253, 66], [257, 59, 285, 66], [289, 59, 365, 66], [372, 59, 407, 66], [74, 136, 161, 158], [175, 137, 306, 158], [318, 137, 363, 158], [374, 137, 472, 158], [483, 136, 529, 158], [540, 137, 593, 158], [608, 137, 717, 158], [73, 194, 100, 203], [106, 196, 177, 203], [183, 194, 227, 203], [233, 194, 259, 203], [265, 194, 344, 205], [74, 211, 104, 222], [109, 210, 141, 221], [147, 211, 169, 220], [175, 210, 223, 220], [229, 211, 259, 222], [265, 211, 329, 222], [334, 210, 352, 220], [74, 227, 127, 236], [133, 229, 180, 236], [187, 227, 221, 236], [226, 227, 264, 236], [270, 227, 320, 237], [327, 227, 349, 236], [74, 243, 161, 254], [166, 243, 249, 254], [254, 243, 281, 252], [286, 244, 342, 254], [74, 260, 112, 270], [119, 260, 145, 269], [151, 260, 174, 269], [179, 260, 217, 269], [222, 260, 249, 269], [254, 260, 285, 271], [290, 260, 335, 269], [340, 259, 359, 269], [74, 276, 95, 284], [101, 276, 156, 287], [164, 276, 198, 284], [203, 276, 244, 284], [251, 275, 285, 284], [291, 276, 340, 284], [74, 292, 129, 301], [135, 292, 185, 302], [192, 292, 242, 303], [248, 292, 261, 301], [267, 292, 312, 301], [74, 308, 195, 319], [75, 335, 82, 344], [88, 335, 98, 344], [105, 335, 138, 344], [144, 335, 214, 346], [220, 336, 233, 344], [239, 335, 256, 344], [262, 335, 283, 344], [290, 335, 309, 344], [316, 335, 320, 344], [74, 351, 119, 360], [126, 352, 170, 362], [176, 352, 186, 360], [192, 352, 214, 360], [220, 352, 276, 362], [282, 352, 326, 360], [333, 352, 349, 362], [74, 368, 89, 377], [95, 370, 124, 377], [129, 367, 175, 377], [181, 368, 266, 377], [272, 368, 283, 376], [289, 368, 333, 377], [74, 384, 126, 393], [134, 385, 175, 395], [181, 384, 206, 393], [212, 384, 292, 395], [298, 384, 325, 393], [330, 384, 366, 393], [74, 403, 103, 409], [109, 400, 154, 409], [161, 401, 241, 409], [247, 403, 269, 409], [275, 401, 296, 409], [302, 400, 349, 409], [74, 417, 131, 428], [137, 419, 186, 428], [192, 417, 214, 426], [219, 417, 242, 428], [248, 419, 319, 426], [74, 433, 119, 444], [125, 433, 204, 444], [210, 433, 278, 444], [285, 433, 295, 441], [302, 433, 340, 442], [75, 449, 98, 458], [104, 449, 142, 458], [146, 449, 215, 460], [221, 449, 258, 460], [263, 449, 293, 459], [300, 449, 339, 460], [74, 466, 101, 474], [108, 466, 185, 476], [191, 466, 261, 474], [267, 466, 309, 476], [315, 466, 354, 474], [74, 482, 151, 491], [158, 482, 201, 491], [208, 482, 258, 491], [263, 482, 292, 491], [298, 482, 333, 491], [338, 482, 360, 491], [74, 498, 131, 507], [137, 498, 150, 507], [156, 498, 197, 509], [202, 498, 257, 507], [263, 498, 310, 509], [74, 515, 128, 525], [134, 515, 156, 523], [161, 515, 218, 523], [223, 515, 261, 525], [267, 514, 280, 523], [74, 531, 156, 540], [162, 531, 188, 540], [195, 531, 257, 540], [263, 531, 315, 542], [871, 199, 878, 202], [883, 199, 908, 202], [894, 251, 904, 257], [841, 268, 841, 270], [784, 373, 811, 378], [816, 373, 896, 378], [784, 381, 811, 387], [815, 381, 847, 387], [645, 908, 670, 915], [692, 908, 712, 915], [220, 984, 285, 993], [293, 983, 779, 996]]
+ # fmt: on
+ text_list = []
+ bbox_list = []
+ for text, box in zip(words, boxes):
+ if text == "":
+ continue
+ sub_tokens = tokenizer.tokenize(text)
+ for sub_token in sub_tokens:
+ text_list.append(sub_token)
+ bbox_list.append(box)
+
+ input_ids = tokenizer.convert_tokens_to_ids(text_list)
+
+ input_ids = prompt_ids + input_ids
+ bbox = [[0, 0, 0, 0]] * len(prompt_ids) + bbox_list
+
+ pixel_values = image_processor(image, return_tensors="pt").pixel_values
+ original_pixel_values = original_transform(image, image_size=image_processor.size["height"]).unsqueeze(0)
+ # verify pixel values
+ assert torch.allclose(original_pixel_values, pixel_values)
+ print("Pixel values are ok!")
+
+ return torch.tensor(input_ids).unsqueeze(0), torch.tensor(bbox).unsqueeze(0).float(), pixel_values
+
+
+def convert_udop_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_hub=False):
+ # model_name to checkpoint_path
+ name_to_checkpoint_path = {
+ "udop-large": "/Users/nielsrogge/Documents/UDOP/udop-unimodel-large-224/pytorch_model.bin",
+ "udop-large-512": "/Users/nielsrogge/Documents/UDOP/udop-unimodel-large-512/pytorch_model.bin",
+ "udop-large-512-300k": "/Users/nielsrogge/Documents/UDOP/udop-unimodel-large-512-300k-steps/pytorch_model.bin",
+ }
+
+ # load original state dict
+ checkpoint_path = name_to_checkpoint_path[model_name]
+ state_dict = torch.load(checkpoint_path, map_location="cpu")
+
+ print("Checkpoint path:", checkpoint_path)
+
+ # create HF model
+ image_size = 512 if "512" in model_name else 224
+ config = UdopConfig(decoder_start_token_id=0, image_size=image_size)
+ model = UdopForConditionalGeneration(config)
+ model.eval()
+
+ # rename keys
+ state_dict = {k.replace("cell2dembedding", "cell_2d_embedding"): v for k, v in state_dict.items()}
+
+ # load weights
+ missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
+ print("Missing keys:", missing_keys)
+ print("Unexpected keys:", unexpected_keys)
+ assert missing_keys == ["encoder.embed_patches.proj.weight", "encoder.embed_patches.proj.bias"]
+ assert unexpected_keys == ["pos_embed"]
+
+ # prepare dummy inputs
+ tokenizer = UdopTokenizer.from_pretrained("t5-base", legacy=True)
+ size = {"height": image_size, "width": image_size}
+ image_processor = LayoutLMv3ImageProcessor(
+ image_mean=IMAGENET_DEFAULT_MEAN, image_std=IMAGENET_DEFAULT_STD, size=size
+ )
+ processor = UdopProcessor(image_processor=image_processor, tokenizer=tokenizer)
+ input_ids, bbox, image = prepare_dummy_inputs(tokenizer, image_processor)
+ prompt = "Question answering. In which year is the report made?"
+ encoding = processor(images=get_image(), text=prompt, return_tensors="pt")
+
+ input_ids = encoding.input_ids
+ try:
+ EXPECTED_INPUT_IDS = torch.tensor([[11860, 18243, 5, 86, 84, 215, 19, 8, 934, 263, 58, 1, 489, 27, 3838, 7363, 4083, 14536, 3430, 5686, 5911, 17161, 134, 2038, 27, 3838, 22, 7, 4688, 7, 10, 389, 18202, 21, 8, 11046, 37, 3733, 523, 11, 38, 2388, 1628, 3, 13133, 23334, 6, 8, 1656, 79, 3806, 21, 4040, 640, 27, 3838, 22, 7, 701, 16534, 6, 8, 3, 76, 2693, 18, 23015, 5644, 24, 380, 3, 6015, 6, 11, 8, 701, 24, 79, 482, 21, 3, 88, 684, 6, 43, 263, 27, 3838, 22, 7, 3635, 1157, 4089, 6, 2651, 12, 1547, 22, 7, 3265, 655, 5, 19, 27, 3838, 22, 7, 38, 2388, 257, 12, 36, 8, 465, 209, 13409, 12150, 1959, 16, 8, 684, 6, 6737, 57, 165, 126, 13409, 12150, 1623, 5, 71, 1100, 30298, 934, 65, 12566, 24, 27, 3838, 31, 7, 126, 13409, 12150, 1623, 33, 8, 10391, 1710, 859, 8, 420, 3733, 4968, 688, 2699, 16, 1547, 5, 27, 3838, 1217, 131, 99, 23, 179, 6064, 24, 6, 590, 28, 3, 11600, 1456, 701, 6, 175, 9443, 2557, 3635, 92, 1262, 8, 3409, 13, 2186, 3, 27908, 1784, 190, 8, 3, 5771, 17, 13281, 4005, 13, 5086, 11, 13066, 1170, 5, 10826, 16309, 134, 3, 2, 276, 26, 3, 55, 391, 13570, 5, 10315, 309, 3577, 19114, 371, 4254, 5121, 5055, 6245, 3, 10047, 3162, 58, 3, 9, 61, 1713, 2703, 476, 667, 25158, 301, 6058, 6038, 476, 3765, 9149, 10, 4893, 1303, 1986, 5, 13580, 7, 8224, 28244, 7, 5, 76, 75, 7, 89, 5, 15, 1259, 87, 7171, 7, 87, 7, 29, 115, 226, 4305, 2773, 1]]) # fmt: skip
+ torch.testing.assert_close(EXPECTED_INPUT_IDS, input_ids)
+ bbox = encoding.bbox.float()
+ pixel_values = encoding.pixel_values
+ except Exception:
+ print("Input_ids don't match, preparing dummy inputs")
+ input_ids, bbox, pixel_values = prepare_dummy_inputs(tokenizer, image_processor)
+
+ # Verify single forward pass
+ print("Testing single forward pass..")
+ with torch.no_grad():
+ decoder_input_ids = torch.tensor([[101]])
+ outputs = model(input_ids=input_ids, bbox=bbox, pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)
+ print("Shape of logits:", outputs.logits.shape)
+ print("First values of logits:", outputs.logits[0, :3, :3])
+
+ # tensor([[-18.5262, 1.5087, -15.7051]]) on linux
+ # tensor([[-19.4976, 0.8515, -17.1873]]) on mac
+ try:
+ assert torch.allclose(outputs.logits[0, :3, :3], torch.tensor([[-18.5262, 1.5087, -15.7051]]), atol=1e-4)
+ print("Looks ok!")
+ except Exception:
+ print("logits don't match let's try to generate")
+
+ # Verify autoregressive decoding
+ print("Testing generation...")
+ model_kwargs = {"bbox": bbox, "pixel_values": pixel_values}
+ outputs = model.generate(input_ids=input_ids, **model_kwargs, max_new_tokens=20)
+
+ print("Generated:", tokenizer.batch_decode(outputs, skip_special_tokens=True))
+
+ # autoregressive decoding with original input data
+ print("Testing generation with original inputs...")
+ filepath = hf_hub_download(repo_id="nielsr/test-image", filename="input_ids_udop.pt", repo_type="dataset")
+ input_ids = torch.load(filepath)
+ filepath = hf_hub_download(repo_id="nielsr/test-image", filename="bbox_udop.pt", repo_type="dataset")
+ bbox = torch.load(filepath)
+ pixel_values_filename = "pixel_values_udop_512.pt" if "512" in model_name else "pixel_values_udop_224.pt"
+ filepath = hf_hub_download(repo_id="nielsr/test-image", filename=pixel_values_filename, repo_type="dataset")
+ pixel_values = torch.load(filepath)
+
+ print("Decoded input ids:", tokenizer.decode(input_ids[0], skip_special_tokens=True))
+ print("Bbox shape:", bbox.shape)
+
+ model_kwargs = {"bbox": bbox, "pixel_values": pixel_values}
+ outputs = model.generate(input_ids=input_ids, **model_kwargs, max_new_tokens=20)
+ generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+ print("Generated:", generated_text)
+
+ if pytorch_dump_folder_path is not None:
+ model.save_pretrained(pytorch_dump_folder_path)
+ tokenizer.save_pretrained(pytorch_dump_folder_path)
+
+ if push_to_hub:
+ model.push_to_hub(f"microsoft/{model_name}")
+ processor.push_to_hub(f"microsoft/{model_name}")
+ # BIG note here: to save the fast tokenizer files in the repo on the hub, you need to do the following:
+ # see https://discuss.huggingface.co/t/convert-slow-xlmrobertatokenizer-to-fast-one/20876
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ # Required parameters
+ parser.add_argument(
+ "--model_name",
+ default="udop-large",
+ type=str,
+ choices=["udop-large", "udop-large-512", "udop-large-512-300k"],
+ help=("Name of the UDOP model you'd like to convert."),
+ )
+ parser.add_argument(
+ "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
+ )
+ parser.add_argument(
+ "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+ )
+
+ args = parser.parse_args()
+ convert_udop_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
diff --git a/src/transformers/models/udop/modeling_udop.py b/src/transformers/models/udop/modeling_udop.py
new file mode 100644
index 00000000000000..62192eea7f5a5e
--- /dev/null
+++ b/src/transformers/models/udop/modeling_udop.py
@@ -0,0 +1,2030 @@
+# coding=utf-8
+# Copyright 2024 Microsoft Research and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch UDOP model."""
+
+import collections
+import logging
+import math
+import random
+from abc import ABC, abstractmethod
+from copy import deepcopy
+from dataclasses import dataclass
+from typing import Any, Dict, Optional, Sequence, Tuple, Union
+
+import torch
+from torch import Tensor, nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import UdopConfig
+from transformers.modeling_outputs import (
+ Seq2SeqLMOutput,
+ Seq2SeqModelOutput,
+)
+
+from ...activations import ACT2FN
+from ...modeling_utils import PreTrainedModel
+from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ...utils import (
+ ModelOutput,
+ add_start_docstrings,
+ add_start_docstrings_to_model_forward,
+ replace_return_docstrings,
+)
+
+
+logger = logging.getLogger(__name__)
+
+UDOP_PRETRAINED_MODEL_ARCHIVE_LIST = [
+ "microsoft/udop-large",
+ # See all UDOP models at https://huggingface.co/models?filter=udop
+]
+
+
+_CONFIG_FOR_DOC = "UdopConfig"
+
+
+UDOP_START_DOCSTRING = r"""
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+ etc.)
+
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+ and behavior.
+
+ Args:
+ config ([`UdopConfig`]): Model configuration class with all the parameters of the model.
+ Initializing with a config file does not load the weights associated with the model, only the
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+UDOP_INPUTS_DOCSTRING = r"""
+ Args:
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+ Indices of input sequence tokens in the vocabulary. UDOP is a model with relative position embeddings so
+ you should be able to pad the inputs on both the right and the left. Indices can be obtained using
+ [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for detail.
+ [What are input IDs?](../glossary#input-ids)
+
+ attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+ - 1 for tokens that are **not masked**,
+ - 0 for tokens that are **masked**.
+ [What are attention masks?](../glossary#attention-mask)
+
+ bbox (`torch.LongTensor` of shape `({0}, 4)`, *optional*):
+ Bounding boxes of each input sequence tokens. Selected in the range `[0,
+ config.max_2d_position_embeddings-1]`. Each bounding box should be a normalized version in (x0, y0, x1, y1)
+ format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
+ y1) represents the position of the lower right corner.
+
+ Note that `sequence_length = token_sequence_length + patch_sequence_length + 1` where `1` is for [CLS]
+ token. See `pixel_values` for `patch_sequence_length`.
+
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ Batch of document images. Each image is divided into patches of shape `(num_channels, config.patch_size,
+ config.patch_size)` and the total number of patches (=`patch_sequence_length`) equals to `((height /
+ config.patch_size) * (width / config.patch_size))`.
+
+ visual_bbox (`torch.LongTensor` of shape `(batch_size, patch_sequence_length, 4)`, *optional*):
+ Bounding boxes of each patch in the image. If not provided, bounding boxes are created in the model.
+
+ decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+ Indices of decoder input sequence tokens in the vocabulary. Indices can be obtained using
+ [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details.
+ [What are decoder input IDs?](../glossary#decoder-input-ids) T5 uses the `pad_token_id` as the starting
+ token for `decoder_input_ids` generation. If `past_key_values` is used, optionally only the last
+ `decoder_input_ids` have to be input (see `past_key_values`). To know more on how to prepare
+ `decoder_input_ids` for pretraining take a look at [T5 Training](./t5#training).
+ decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+ Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
+ be used by default.
+ head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+ Mask to nullify selected heads of the self-attention modules in the encoder. Mask values selected in `[0,
+ 1]`:
+ - 1 indicates the head is **not masked**,
+ - 0 indicates the head is **masked**.
+ decoder_head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+ Mask to nullify selected heads of the self-attention modules in the decoder. Mask values selected in `[0,
+ 1]`:
+ - 1 indicates the head is **not masked**,
+ - 0 indicates the head is **masked**.
+ cross_attn_head_mask (`torch.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+ Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in
+ `[0, 1]`:
+ - 1 indicates the head is **not masked**,
+ - 0 indicates the head is **masked**.
+ encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
+ Tuple consists of (`last_hidden_state`, `optional`: *hidden_states*, `optional`: *attentions*)
+ `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)` is a sequence of hidden states at
+ the output of the last layer of the encoder. Used in the cross-attention of the decoder.
+ past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+ Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+ model's internal embedding lookup matrix.
+ decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*):
+ Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+ representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
+ input (see `past_key_values`). This is useful if you want more control over how to convert
+ `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix. If
+ `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value of
+ `inputs_embeds`.
+ use_cache (`bool`, *optional*):
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+ `past_key_values`).
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+ tensors for more detail.
+ output_hidden_states (`bool`, *optional*):
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+ more detail.
+ return_dict (`bool`, *optional*):
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+UDOP_ENCODER_INPUTS_DOCSTRING = r"""
+ Args:
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+ Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you
+ should be able to pad the inputs on both the right and the left.
+
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+ [`PreTrainedTokenizer.__call__`] for detail.
+
+ To know more on how to prepare `input_ids` for pretraining take a look a [T5 Training](./t5#training).
+ attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+ - 1 for tokens that are **not masked**,
+ - 0 for tokens that are **masked**.
+
+ [What are attention masks?](../glossary#attention-mask)
+
+ bbox (`torch.LongTensor` of shape `({0}, 4)`, *optional*):
+ Bounding boxes of each input sequence tokens. Selected in the range `[0,
+ config.max_2d_position_embeddings-1]`. Each bounding box should be a normalized version in (x0, y0, x1, y1)
+ format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
+ y1) represents the position of the lower right corner.
+
+ Note that `sequence_length = token_sequence_length + patch_sequence_length + 1` where `1` is for [CLS]
+ token. See `pixel_values` for `patch_sequence_length`.
+
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ Batch of document images. Each image is divided into patches of shape `(num_channels, config.patch_size,
+ config.patch_size)` and the total number of patches (=`patch_sequence_length`) equals to `((height /
+ config.patch_size) * (width / config.patch_size))`.
+
+ visual_bbox (`torch.LongTensor` of shape `(batch_size, patch_sequence_length, 4)`, *optional*):
+ Bounding boxes of each patch in the image. If not provided, bounding boxes are created in the model.
+
+ head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+ Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
+
+ - 1 indicates the head is **not masked**,
+ - 0 indicates the head is **masked**.
+
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+ model's internal embedding lookup matrix.
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+ tensors for more detail.
+ output_hidden_states (`bool`, *optional*):
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+ more detail.
+ return_dict (`bool`, *optional*):
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@dataclass
+class BaseModelOutputWithAttentionMask(ModelOutput):
+ """
+ Class for the model's outputs that may also contain a past key/values (to speed up sequential decoding). Includes
+ an additional attention mask.
+
+ Args:
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+ Sequence of hidden-states at the output of the last layer of the model. If `past_key_values` is used only
+ the last hidden-state of the sequences of shape `(batch_size, 1, hidden_size)` is output.
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or
+ when `config.use_cache=True`):
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
+ `config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
+ encoder_sequence_length, embed_size_per_head)`. Contains pre-computed hidden-states (key and values in the
+ self-attention blocks and optionally if `config.is_encoder_decoder=True` in the cross-attention blocks)
+ that can be used (see `past_key_values` input) to speed up sequential decoding.
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or
+ when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of
+ the model at the output of each layer plus the optional initial embedding outputs.
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when
+ `config.output_attentions=True`):
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+ sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
+ the self-attention heads.
+ cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and
+ `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+ sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
+ used to compute the weighted average in the cross-attention heads.
+ """
+
+ last_hidden_state: torch.FloatTensor = None
+ attention_mask: torch.FloatTensor = None
+ past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
+ cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+def get_visual_bbox(image_size=224, patch_size=16):
+ image_feature_pool_shape = [image_size // patch_size, image_size // patch_size]
+ visual_bbox_x = torch.arange(0, 1.0 * (image_feature_pool_shape[1] + 1), 1.0)
+ visual_bbox_x /= image_feature_pool_shape[1]
+
+ visual_bbox_y = torch.arange(0, 1.0 * (image_feature_pool_shape[0] + 1), 1.0)
+ visual_bbox_y /= image_feature_pool_shape[0]
+
+ visual_bbox_input = torch.stack(
+ [
+ visual_bbox_x[:-1].repeat(image_feature_pool_shape[0], 1),
+ visual_bbox_y[:-1].repeat(image_feature_pool_shape[1], 1).transpose(0, 1),
+ visual_bbox_x[1:].repeat(image_feature_pool_shape[0], 1),
+ visual_bbox_y[1:].repeat(image_feature_pool_shape[1], 1).transpose(0, 1),
+ ],
+ dim=-1,
+ )
+
+ visual_bbox_input = visual_bbox_input.view(-1, 4)
+
+ return visual_bbox_input
+
+
+def pad_sequence(seq, target_len, pad_value=0):
+ if isinstance(seq, torch.Tensor):
+ n = seq.shape[0]
+ else:
+ n = len(seq)
+ seq = torch.tensor(seq)
+ m = target_len - n
+ if m > 0:
+ ret = torch.stack([pad_value] * m).to(seq)
+ seq = torch.cat([seq, ret], dim=0)
+ return seq[:target_len]
+
+
+def combine_image_text_embeddings(
+ image_embeddings,
+ inputs_embeds,
+ bbox,
+ visual_bbox,
+ attention_mask=None,
+ num_patches=14,
+ max_len=0,
+ image_size=224,
+ patch_size=16,
+):
+ """
+ Combine the image and text embeddings for the input to the encoder/decoder of UDOP.
+
+ First, the image embeddings are created by checking for each visual patch if it is inside the bounding box of a
+ token. If it is, the visual patch is combined with the token embedding. Then, the visual bounding boxes are combined
+ with the text bounding boxes. Finally, the visual bounding boxes are combined with the text attention mask.
+ """
+
+ sequence_length = num_patches
+ ocr_points_x = torch.clip(
+ torch.floor((bbox[:, :, 0] + bbox[:, :, 2]) / 2.0 * sequence_length).long(), 0, sequence_length - 1
+ )
+ ocr_points_y = (
+ torch.clip(torch.floor((bbox[:, :, 1] + bbox[:, :, 3]) / 2.0 * sequence_length).long(), 0, sequence_length - 1)
+ * sequence_length
+ )
+ ocr_points = ocr_points_x + ocr_points_y
+ # make sure bounding boxes are of type float to calculate means
+ bbox = bbox.to(torch.float64)
+ target_seg = (bbox.mean(-1) == 0.0) | (bbox.mean(-1) == 1.0)
+ repeated_vision_embeds = torch.gather(
+ image_embeddings, 1, ocr_points.unsqueeze(-1).repeat(1, 1, image_embeddings.size(-1))
+ )
+ repeated_vision_embeds[target_seg] = 0.0
+ inputs_embeds += repeated_vision_embeds
+
+ patch_inds = torch.full_like(image_embeddings[:, :, 0], True).bool()
+ ind = torch.cat(
+ [
+ torch.arange(len(ocr_points))[:, None].repeat(1, ocr_points.size(-1))[:, :, None].to(ocr_points),
+ ocr_points[:, :, None],
+ ],
+ dim=-1,
+ )
+ ind = ind.flatten(0, 1)
+ rows, cols = zip(*ind)
+ patch_inds[rows, cols] = False
+
+ input_vision_patches = [image_embeddings[i][patch_inds[i]] for i in range(len(patch_inds))]
+
+ if visual_bbox is None:
+ visual_bbox = get_visual_bbox(image_size=image_size, patch_size=patch_size)
+ visual_bbox = visual_bbox.unsqueeze(0).repeat(image_embeddings.size(0), 1, 1)
+ visual_bbox = visual_bbox.to(image_embeddings.device)
+
+ visual_bbox = [visual_bbox[i][patch_inds[i]] for i in range(len(patch_inds))]
+ if attention_mask is not None:
+ visual_attention_mask = [torch.tensor([1] * len(item)).to(attention_mask) for item in visual_bbox]
+
+ if max_len == 0:
+ max_len = image_embeddings.size(1)
+ else:
+ max_len = max_len - inputs_embeds.size(1)
+ inputs_vision_patches = torch.stack(
+ [pad_sequence(item, max_len, torch.zeros_like(image_embeddings[0, 0])) for item in input_vision_patches]
+ )
+ visual_bbox = torch.stack([pad_sequence(item, max_len, torch.zeros_like(bbox[0, 0])) for item in visual_bbox])
+ if attention_mask is not None:
+ visual_attention_mask = torch.stack(
+ [pad_sequence(item, max_len, torch.zeros_like(attention_mask[0, 0])) for item in visual_attention_mask]
+ )
+
+ inputs_embeds = torch.cat([inputs_embeds, inputs_vision_patches], 1)
+ bbox = torch.cat([bbox, visual_bbox], 1)
+ if attention_mask is not None:
+ attention_mask = torch.cat([attention_mask, visual_attention_mask], 1)
+ return inputs_embeds, bbox, attention_mask
+
+
+class UdopPatchEmbeddings(nn.Module):
+ """2D Image to Patch Embeddings"""
+
+ def __init__(self, config):
+ super().__init__()
+ image_size, patch_size = config.image_size, config.patch_size
+ num_channels, hidden_size = config.num_channels, config.hidden_size
+
+ image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size)
+ patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
+ num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+ self.image_size = image_size
+ self.patch_size = patch_size
+ self.num_channels = num_channels
+ self.num_patches = num_patches
+
+ self.proj = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size)
+
+ def forward(self, pixel_values):
+ batch_size, num_channels, height, width = pixel_values.shape
+ if height != self.image_size[0] or width != self.image_size[1]:
+ raise ValueError(
+ f"Input image size ({height}*{width}) doesn't match model"
+ f" ({self.image_size[0]}*{self.image_size[1]})."
+ )
+ embeddings = self.proj(pixel_values)
+ embeddings = embeddings.flatten(2).transpose(1, 2)
+ return embeddings
+
+
+class UdopPreTrainedModel(PreTrainedModel):
+ """
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+ models. Based on `T5PreTrainedModel`.
+ """
+
+ config_class = UdopConfig
+ base_model_prefix = "transformer"
+ supports_gradient_checkpointing = True
+ _no_split_modules = ["UdopBlock"]
+ _keep_in_fp32_modules = ["wo"]
+
+ def _init_weights(self, module):
+ """Initialize the weights"""
+ factor = self.config.initializer_factor # Used for testing weights initialization
+ if isinstance(module, UdopLayerNorm):
+ module.weight.data.fill_(factor * 1.0)
+ elif isinstance(module, nn.Embedding):
+ module.weight.data.normal_(mean=0.0, std=factor)
+ if module.padding_idx is not None:
+ module.weight.data[module.padding_idx].zero_()
+ elif isinstance(module, nn.Conv2d):
+ # Upcast the input in `fp32` and cast it back to desired `dtype` to avoid
+ # `trunc_normal_cpu` not implemented in `half` issues
+ module.weight.data = nn.init.trunc_normal_(module.weight.data.to(torch.float32), mean=0.0, std=factor).to(
+ module.weight.dtype
+ )
+ if module.bias is not None:
+ module.bias.data.zero_()
+ elif isinstance(module, RelativePositionBiasBase):
+ factor = self.config.initializer_factor
+ d_model = self.config.d_model
+ module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
+ elif isinstance(module, UdopModel):
+ # Mesh TensorFlow embeddings initialization
+ # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
+ module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)
+ elif isinstance(module, UdopForConditionalGeneration):
+ if hasattr(module, "lm_head") and not self.config.tie_word_embeddings:
+ module.lm_head.weight.data.normal_(mean=0.0, std=factor * 1.0)
+ elif isinstance(module, UdopDenseActDense):
+ # Mesh TensorFlow FF initialization
+ # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56
+ # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89
+ module.wi.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+ if hasattr(module.wi, "bias") and module.wi.bias is not None:
+ module.wi.bias.data.zero_()
+ module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))
+ if hasattr(module.wo, "bias") and module.wo.bias is not None:
+ module.wo.bias.data.zero_()
+ elif isinstance(module, UdopDenseGatedActDense):
+ module.wi_0.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+ if hasattr(module.wi_0, "bias") and module.wi_0.bias is not None:
+ module.wi_0.bias.data.zero_()
+ module.wi_1.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+ if hasattr(module.wi_1, "bias") and module.wi_1.bias is not None:
+ module.wi_1.bias.data.zero_()
+ module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))
+ if hasattr(module.wo, "bias") and module.wo.bias is not None:
+ module.wo.bias.data.zero_()
+ elif isinstance(module, UdopAttention):
+ # Mesh TensorFlow attention initialization to avoid scaling before softmax
+ # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136
+ d_model = self.config.d_model
+ key_value_proj_dim = self.config.d_kv
+ n_heads = self.config.num_heads
+ module.q.weight.data.normal_(mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5))
+ module.k.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
+ module.v.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
+ module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5))
+ if module.has_relative_attention_bias:
+ module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
+
+ # Copied from transformers.models.prophetnet.modeling_prophetnet.ProphetNetPreTrainedModel._shift_right with ProphetNet->Udop
+ def _shift_right(self, input_ids):
+ decoder_start_token_id = self.config.decoder_start_token_id
+ pad_token_id = self.config.pad_token_id
+
+ assert decoder_start_token_id is not None, (
+ "self.model.config.decoder_start_token_id has to be defined. In Udop it is usually set to the"
+ " pad_token_id. See Udop docs for more information"
+ )
+
+ # shift inputs to the right
+ shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+ shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+ shifted_input_ids[..., 0] = decoder_start_token_id
+
+ assert pad_token_id is not None, "self.model.config.pad_token_id has to be defined."
+ # replace possible -100 values in labels by `pad_token_id`
+ shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
+
+ assert torch.all(shifted_input_ids >= 0).item(), "Verify that `shifted_input_ids` has only positive values"
+
+ return shifted_input_ids
+
+
+# Copied from transformers.models.t5.modeling_t5.T5LayerNorm with T5->Udop
+class UdopLayerNorm(nn.Module):
+ def __init__(self, hidden_size, eps=1e-6):
+ """
+ Construct a layernorm module in the Udop style. No bias and no subtraction of mean.
+ """
+ super().__init__()
+ self.weight = nn.Parameter(torch.ones(hidden_size))
+ self.variance_epsilon = eps
+
+ def forward(self, hidden_states):
+ # Udop uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
+ # Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated
+ # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for
+ # half-precision inputs is done in fp32
+
+ variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+
+ # convert into half-precision if necessary
+ if self.weight.dtype in [torch.float16, torch.bfloat16]:
+ hidden_states = hidden_states.to(self.weight.dtype)
+
+ return self.weight * hidden_states
+
+
+# Copied from transformers.models.t5.modeling_t5.T5DenseActDense with T5->Udop
+class UdopDenseActDense(nn.Module):
+ def __init__(self, config: UdopConfig):
+ super().__init__()
+ self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)
+ self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
+ self.dropout = nn.Dropout(config.dropout_rate)
+ self.act = ACT2FN[config.dense_act_fn]
+
+ def forward(self, hidden_states):
+ hidden_states = self.wi(hidden_states)
+ hidden_states = self.act(hidden_states)
+ hidden_states = self.dropout(hidden_states)
+ if (
+ isinstance(self.wo.weight, torch.Tensor)
+ and hidden_states.dtype != self.wo.weight.dtype
+ and self.wo.weight.dtype != torch.int8
+ ):
+ hidden_states = hidden_states.to(self.wo.weight.dtype)
+ hidden_states = self.wo(hidden_states)
+ return hidden_states
+
+
+# Copied from transformers.models.t5.modeling_t5.T5DenseGatedActDense with T5->Udop
+class UdopDenseGatedActDense(nn.Module):
+ def __init__(self, config: UdopConfig):
+ super().__init__()
+ self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
+ self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
+ self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
+ self.dropout = nn.Dropout(config.dropout_rate)
+ self.act = ACT2FN[config.dense_act_fn]
+
+ def forward(self, hidden_states):
+ hidden_gelu = self.act(self.wi_0(hidden_states))
+ hidden_linear = self.wi_1(hidden_states)
+ hidden_states = hidden_gelu * hidden_linear
+ hidden_states = self.dropout(hidden_states)
+
+ # To make 8bit quantization work for google/flan-t5-xxl, self.wo is kept in float32.
+ # See https://github.com/huggingface/transformers/issues/20287
+ # we also make sure the weights are not in `int8` in case users will force `_keep_in_fp32_modules` to be `None``
+ if (
+ isinstance(self.wo.weight, torch.Tensor)
+ and hidden_states.dtype != self.wo.weight.dtype
+ and self.wo.weight.dtype != torch.int8
+ ):
+ hidden_states = hidden_states.to(self.wo.weight.dtype)
+
+ hidden_states = self.wo(hidden_states)
+ return hidden_states
+
+
+# Copied from transformers.models.t5.modeling_t5.T5LayerFF with T5->Udop
+class UdopLayerFF(nn.Module):
+ def __init__(self, config: UdopConfig):
+ super().__init__()
+ if config.is_gated_act:
+ self.DenseReluDense = UdopDenseGatedActDense(config)
+ else:
+ self.DenseReluDense = UdopDenseActDense(config)
+
+ self.layer_norm = UdopLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+ self.dropout = nn.Dropout(config.dropout_rate)
+
+ def forward(self, hidden_states):
+ forwarded_states = self.layer_norm(hidden_states)
+ forwarded_states = self.DenseReluDense(forwarded_states)
+ hidden_states = hidden_states + self.dropout(forwarded_states)
+ return hidden_states
+
+
+# Copied from transformers.models.t5.modeling_t5.T5Attention with T5->Udop
+class UdopAttention(nn.Module):
+ def __init__(self, config: UdopConfig, has_relative_attention_bias=False):
+ super().__init__()
+ self.is_decoder = config.is_decoder
+ self.has_relative_attention_bias = has_relative_attention_bias
+ self.relative_attention_num_buckets = config.relative_attention_num_buckets
+ self.relative_attention_max_distance = config.relative_attention_max_distance
+ self.d_model = config.d_model
+ self.key_value_proj_dim = config.d_kv
+ self.n_heads = config.num_heads
+ self.dropout = config.dropout_rate
+ self.inner_dim = self.n_heads * self.key_value_proj_dim
+
+ # Mesh TensorFlow initialization to avoid scaling before softmax
+ self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)
+ self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)
+ self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)
+ self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)
+
+ if self.has_relative_attention_bias:
+ self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
+ self.pruned_heads = set()
+ self.gradient_checkpointing = False
+
+ def prune_heads(self, heads):
+ if len(heads) == 0:
+ return
+ heads, index = find_pruneable_heads_and_indices(
+ heads, self.n_heads, self.key_value_proj_dim, self.pruned_heads
+ )
+ # Prune linear layers
+ self.q = prune_linear_layer(self.q, index)
+ self.k = prune_linear_layer(self.k, index)
+ self.v = prune_linear_layer(self.v, index)
+ self.o = prune_linear_layer(self.o, index, dim=1)
+ # Update hyper params
+ self.n_heads = self.n_heads - len(heads)
+ self.inner_dim = self.key_value_proj_dim * self.n_heads
+ self.pruned_heads = self.pruned_heads.union(heads)
+
+ @staticmethod
+ def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+ """
+ Adapted from Mesh Tensorflow:
+ https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+
+ Translate relative position to a bucket number for relative attention. The relative position is defined as
+ memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+ position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
+ small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
+ positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+ This should allow for more graceful generalization to longer sequences than the model has been trained on
+
+ Args:
+ relative_position: an int32 Tensor
+ bidirectional: a boolean - whether the attention is bidirectional
+ num_buckets: an integer
+ max_distance: an integer
+
+ Returns:
+ a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
+ """
+ relative_buckets = 0
+ if bidirectional:
+ num_buckets //= 2
+ relative_buckets += (relative_position > 0).to(torch.long) * num_buckets
+ relative_position = torch.abs(relative_position)
+ else:
+ relative_position = -torch.min(relative_position, torch.zeros_like(relative_position))
+ # now relative_position is in the range [0, inf)
+
+ # half of the buckets are for exact increments in positions
+ max_exact = num_buckets // 2
+ is_small = relative_position < max_exact
+
+ # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+ relative_position_if_large = max_exact + (
+ torch.log(relative_position.float() / max_exact)
+ / math.log(max_distance / max_exact)
+ * (num_buckets - max_exact)
+ ).to(torch.long)
+ relative_position_if_large = torch.min(
+ relative_position_if_large, torch.full_like(relative_position_if_large, num_buckets - 1)
+ )
+
+ relative_buckets += torch.where(is_small, relative_position, relative_position_if_large)
+ return relative_buckets
+
+ def compute_bias(self, query_length, key_length, device=None):
+ """Compute binned relative position bias"""
+ if device is None:
+ device = self.relative_attention_bias.weight.device
+ context_position = torch.arange(query_length, dtype=torch.long, device=device)[:, None]
+ memory_position = torch.arange(key_length, dtype=torch.long, device=device)[None, :]
+ relative_position = memory_position - context_position # shape (query_length, key_length)
+ relative_position_bucket = self._relative_position_bucket(
+ relative_position, # shape (query_length, key_length)
+ bidirectional=(not self.is_decoder),
+ num_buckets=self.relative_attention_num_buckets,
+ max_distance=self.relative_attention_max_distance,
+ )
+ values = self.relative_attention_bias(relative_position_bucket) # shape (query_length, key_length, num_heads)
+ values = values.permute([2, 0, 1]).unsqueeze(0) # shape (1, num_heads, query_length, key_length)
+ return values
+
+ def forward(
+ self,
+ hidden_states,
+ mask=None,
+ key_value_states=None,
+ position_bias=None,
+ past_key_value=None,
+ layer_head_mask=None,
+ query_length=None,
+ use_cache=False,
+ output_attentions=False,
+ ):
+ """
+ Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
+ """
+ # Input is (batch_size, seq_length, dim)
+ # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
+ # past_key_value[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
+ batch_size, seq_length = hidden_states.shape[:2]
+
+ real_seq_length = seq_length
+
+ if past_key_value is not None:
+ if len(past_key_value) != 2:
+ raise ValueError(
+ f"past_key_value should have 2 past states: keys and values. Got { len(past_key_value)} past states"
+ )
+ real_seq_length += past_key_value[0].shape[2] if query_length is None else query_length
+
+ key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]
+
+ def shape(states):
+ """projection"""
+ return states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).transpose(1, 2)
+
+ def unshape(states):
+ """reshape"""
+ return states.transpose(1, 2).contiguous().view(batch_size, -1, self.inner_dim)
+
+ def project(hidden_states, proj_layer, key_value_states, past_key_value):
+ """projects hidden states correctly to key/query states"""
+ if key_value_states is None:
+ # self-attn
+ # (batch_size, n_heads, seq_length, dim_per_head)
+ hidden_states = shape(proj_layer(hidden_states))
+ elif past_key_value is None:
+ # cross-attn
+ # (batch_size, n_heads, seq_length, dim_per_head)
+ hidden_states = shape(proj_layer(key_value_states))
+
+ if past_key_value is not None:
+ if key_value_states is None:
+ # self-attn
+ # (batch_size, n_heads, key_length, dim_per_head)
+ hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
+ elif past_key_value.shape[2] != key_value_states.shape[1]:
+ # checking that the `sequence_length` of the `past_key_value` is the same as
+ # the provided `key_value_states` to support prefix tuning
+ # cross-attn
+ # (batch_size, n_heads, seq_length, dim_per_head)
+ hidden_states = shape(proj_layer(key_value_states))
+ else:
+ # cross-attn
+ hidden_states = past_key_value
+ return hidden_states
+
+ # get query states
+ query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
+
+ # get key/value states
+ key_states = project(
+ hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
+ )
+ value_states = project(
+ hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
+ )
+
+ # compute scores
+ scores = torch.matmul(
+ query_states, key_states.transpose(3, 2)
+ ) # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9
+
+ if position_bias is None:
+ if not self.has_relative_attention_bias:
+ position_bias = torch.zeros(
+ (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
+ )
+ if self.gradient_checkpointing and self.training:
+ position_bias.requires_grad = True
+ else:
+ position_bias = self.compute_bias(real_seq_length, key_length, device=scores.device)
+
+ # if key and values are already calculated
+ # we want only the last query position bias
+ if past_key_value is not None:
+ position_bias = position_bias[:, :, -hidden_states.size(1) :, :]
+
+ if mask is not None:
+ position_bias = position_bias + mask # (batch_size, n_heads, seq_length, key_length)
+
+ if self.pruned_heads:
+ mask = torch.ones(position_bias.shape[1])
+ mask[list(self.pruned_heads)] = 0
+ position_bias_masked = position_bias[:, mask.bool()]
+ else:
+ position_bias_masked = position_bias
+
+ scores += position_bias_masked
+ attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
+ scores
+ ) # (batch_size, n_heads, seq_length, key_length)
+ attn_weights = nn.functional.dropout(
+ attn_weights, p=self.dropout, training=self.training
+ ) # (batch_size, n_heads, seq_length, key_length)
+
+ # Mask heads if we want to
+ if layer_head_mask is not None:
+ attn_weights = attn_weights * layer_head_mask
+
+ attn_output = unshape(torch.matmul(attn_weights, value_states)) # (batch_size, seq_length, dim)
+ attn_output = self.o(attn_output)
+
+ present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
+ outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)
+
+ if output_attentions:
+ outputs = outputs + (attn_weights,)
+ return outputs
+
+
+# Copied from transformers.models.t5.modeling_t5.T5LayerSelfAttention with T5->Udop
+class UdopLayerSelfAttention(nn.Module):
+ def __init__(self, config, has_relative_attention_bias=False):
+ super().__init__()
+ self.SelfAttention = UdopAttention(config, has_relative_attention_bias=has_relative_attention_bias)
+ self.layer_norm = UdopLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+ self.dropout = nn.Dropout(config.dropout_rate)
+
+ def forward(
+ self,
+ hidden_states,
+ attention_mask=None,
+ position_bias=None,
+ layer_head_mask=None,
+ past_key_value=None,
+ use_cache=False,
+ output_attentions=False,
+ ):
+ normed_hidden_states = self.layer_norm(hidden_states)
+ attention_output = self.SelfAttention(
+ normed_hidden_states,
+ mask=attention_mask,
+ position_bias=position_bias,
+ layer_head_mask=layer_head_mask,
+ past_key_value=past_key_value,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ )
+ hidden_states = hidden_states + self.dropout(attention_output[0])
+ outputs = (hidden_states,) + attention_output[1:] # add attentions if we output them
+ return outputs
+
+
+# Copied from transformers.models.t5.modeling_t5.T5LayerCrossAttention with T5->Udop
+class UdopLayerCrossAttention(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.EncDecAttention = UdopAttention(config, has_relative_attention_bias=False)
+ self.layer_norm = UdopLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+ self.dropout = nn.Dropout(config.dropout_rate)
+
+ def forward(
+ self,
+ hidden_states,
+ key_value_states,
+ attention_mask=None,
+ position_bias=None,
+ layer_head_mask=None,
+ past_key_value=None,
+ use_cache=False,
+ query_length=None,
+ output_attentions=False,
+ ):
+ normed_hidden_states = self.layer_norm(hidden_states)
+ attention_output = self.EncDecAttention(
+ normed_hidden_states,
+ mask=attention_mask,
+ key_value_states=key_value_states,
+ position_bias=position_bias,
+ layer_head_mask=layer_head_mask,
+ past_key_value=past_key_value,
+ use_cache=use_cache,
+ query_length=query_length,
+ output_attentions=output_attentions,
+ )
+ layer_output = hidden_states + self.dropout(attention_output[0])
+ outputs = (layer_output,) + attention_output[1:] # add attentions if we output them
+ return outputs
+
+
+# Copied from transformers.models.t5.modeling_t5.T5Block with T5->Udop
+class UdopBlock(nn.Module):
+ def __init__(self, config, has_relative_attention_bias=False):
+ super().__init__()
+ self.is_decoder = config.is_decoder
+ self.layer = nn.ModuleList()
+ self.layer.append(UdopLayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
+ if self.is_decoder:
+ self.layer.append(UdopLayerCrossAttention(config))
+
+ self.layer.append(UdopLayerFF(config))
+
+ def forward(
+ self,
+ hidden_states,
+ attention_mask=None,
+ position_bias=None,
+ encoder_hidden_states=None,
+ encoder_attention_mask=None,
+ encoder_decoder_position_bias=None,
+ layer_head_mask=None,
+ cross_attn_layer_head_mask=None,
+ past_key_value=None,
+ use_cache=False,
+ output_attentions=False,
+ return_dict=True,
+ ):
+ if past_key_value is not None:
+ if not self.is_decoder:
+ logger.warning("`past_key_values` is passed to the encoder. Please make sure this is intended.")
+ expected_num_past_key_values = 2 if encoder_hidden_states is None else 4
+
+ if len(past_key_value) != expected_num_past_key_values:
+ raise ValueError(
+ f"There should be {expected_num_past_key_values} past states. "
+ f"{'2 (past / key) for cross attention. ' if expected_num_past_key_values == 4 else ''}"
+ f"Got {len(past_key_value)} past key / value states"
+ )
+
+ self_attn_past_key_value = past_key_value[:2]
+ cross_attn_past_key_value = past_key_value[2:]
+ else:
+ self_attn_past_key_value, cross_attn_past_key_value = None, None
+
+ self_attention_outputs = self.layer[0](
+ hidden_states,
+ attention_mask=attention_mask,
+ position_bias=position_bias,
+ layer_head_mask=layer_head_mask,
+ past_key_value=self_attn_past_key_value,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ )
+ hidden_states, present_key_value_state = self_attention_outputs[:2]
+ attention_outputs = self_attention_outputs[2:] # Keep self-attention outputs and relative position weights
+
+ # clamp inf values to enable fp16 training
+ if hidden_states.dtype == torch.float16:
+ clamp_value = torch.where(
+ torch.isinf(hidden_states).any(),
+ torch.finfo(hidden_states.dtype).max - 1000,
+ torch.finfo(hidden_states.dtype).max,
+ )
+ hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+
+ do_cross_attention = self.is_decoder and encoder_hidden_states is not None
+ if do_cross_attention:
+ # the actual query length is unknown for cross attention
+ # if using past key value states. Need to inject it here
+ if present_key_value_state is not None:
+ query_length = present_key_value_state[0].shape[2]
+ else:
+ query_length = None
+
+ cross_attention_outputs = self.layer[1](
+ hidden_states,
+ key_value_states=encoder_hidden_states,
+ attention_mask=encoder_attention_mask,
+ position_bias=encoder_decoder_position_bias,
+ layer_head_mask=cross_attn_layer_head_mask,
+ past_key_value=cross_attn_past_key_value,
+ query_length=query_length,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ )
+ hidden_states = cross_attention_outputs[0]
+
+ # clamp inf values to enable fp16 training
+ if hidden_states.dtype == torch.float16:
+ clamp_value = torch.where(
+ torch.isinf(hidden_states).any(),
+ torch.finfo(hidden_states.dtype).max - 1000,
+ torch.finfo(hidden_states.dtype).max,
+ )
+ hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+
+ # Combine self attn and cross attn key value states
+ if present_key_value_state is not None:
+ present_key_value_state = present_key_value_state + cross_attention_outputs[1]
+
+ # Keep cross-attention outputs and relative position weights
+ attention_outputs = attention_outputs + cross_attention_outputs[2:]
+
+ # Apply Feed Forward layer
+ hidden_states = self.layer[-1](hidden_states)
+
+ # clamp inf values to enable fp16 training
+ if hidden_states.dtype == torch.float16:
+ clamp_value = torch.where(
+ torch.isinf(hidden_states).any(),
+ torch.finfo(hidden_states.dtype).max - 1000,
+ torch.finfo(hidden_states.dtype).max,
+ )
+ hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+
+ outputs = (hidden_states,)
+
+ if use_cache:
+ outputs = outputs + (present_key_value_state,) + attention_outputs
+ else:
+ outputs = outputs + attention_outputs
+
+ return outputs # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+
+
+class UdopCellEmbeddings(nn.Module):
+ def __init__(self, max_2d_position_embeddings=501, hidden_size=1024):
+ super(UdopCellEmbeddings, self).__init__()
+ self.max_2d_position_embeddings = max_2d_position_embeddings
+
+ self.x_position_embeddings = nn.Embedding(max_2d_position_embeddings, hidden_size)
+ self.y_position_embeddings = nn.Embedding(max_2d_position_embeddings, hidden_size)
+
+ def forward(self, bbox):
+ bbox = torch.clip(bbox, 0.0, 1.0)
+ bbox = (bbox * (self.max_2d_position_embeddings - 1)).long()
+ left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
+ upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
+ right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
+ lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
+
+ embeddings = (
+ left_position_embeddings
+ + upper_position_embeddings
+ + right_position_embeddings
+ + lower_position_embeddings
+ )
+
+ return embeddings
+
+
+# get function for bucket computation
+# protected member access seems to be lesser evil than copy paste whole function
+get_relative_position_bucket = UdopAttention._relative_position_bucket
+AUGMENTATION_RANGE = (0.80, 1.25)
+
+
+class RelativePositionBiasBase(nn.Module, ABC):
+ """
+ Base class of relative biases.
+
+ Args:
+ num_heads (`int`):
+ Number of attention heads in the model, it will create embeddings of size `num_heads`, which will be added to the scores of each token pair.
+ relative_attention_num_buckets (`int`, *optional*, defaults to 32):
+ Pair token metric (distance in the sequence, distance in pixels etc.) will be bucketed, parameter is defining number of such
+ buckets.
+ bidirectional (`bool`, *optional*, defaults to `True`):
+ Whether the distance should be bidirectional for a pair of tokens. If `False`, then distance(tok1, tok2) == distance(tok2, tok1).
+ scaling_factor (`int`, *optional*, defaults to 1):
+ Defining factor which will be used to scale relative distance.
+ max_distance (`int`, *optional*, defaults to 128):
+ All distances above this value will end up in the one/same bucket.
+ augmentation (`bool`, *optional*, defaults to `False`):
+ Whether to multiply relative distances by a random scalar.
+ expand (`bool`, *optional*, defaults to `False`):
+ Whether to expand an existing pretrained model with subsequent additions of prefix_bucket.
+ """
+
+ def __init__(
+ self,
+ num_heads=None,
+ relative_attention_num_buckets=32,
+ bidirectional=True,
+ scaling_factor=1,
+ max_distance=128,
+ level="tokens",
+ augmentation=False,
+ prefix_bucket=False,
+ expand=False,
+ ):
+ super(RelativePositionBiasBase, self).__init__()
+ self.prefix_bucket = prefix_bucket
+ self.augmentation = augmentation
+ self.level = level
+ self.max_distance = max_distance
+ self.scaling_factor = scaling_factor
+ self.bidirectional = bidirectional
+ self.num_heads = num_heads
+ self.expand = expand
+ self.relative_attention_num_buckets = relative_attention_num_buckets
+ extra_head = 2 if prefix_bucket and not self.expand else 0
+ self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets + extra_head, self.num_heads)
+
+ @abstractmethod
+ def prepare_input(
+ self,
+ attention_mask: Optional[Tensor] = None,
+ bbox: Optional[Dict[str, Any]] = None,
+ ) -> Tensor:
+ pass
+
+ def get_bucket(self, attention_mask: Optional[Tensor] = None, bbox: Optional[Dict[str, Any]] = None) -> Tensor:
+ relative_position = self.prepare_input(attention_mask, bbox)
+ rp_bucket: Tensor = get_relative_position_bucket(
+ relative_position,
+ bidirectional=self.bidirectional,
+ num_buckets=self.relative_attention_num_buckets,
+ max_distance=self.max_distance,
+ )
+ return rp_bucket
+
+ def get_relative_position(self, positions):
+ context_position = positions[:, :, None]
+ memory_position = positions[:, None, :]
+ relative_position = memory_position - context_position
+ if self.augmentation and self.training:
+ relative_position *= random.uniform(*AUGMENTATION_RANGE)
+ relative_position *= self.scaling_factor
+
+ return relative_position.to(torch.long)
+
+ def forward(self, attention_mask: Optional[Tensor] = None, bbox: Optional[Dict[str, Any]] = None) -> Tensor:
+ # re-using pretrained model with subsequent addition of prefix_bucket
+ if self.expand and self.prefix_bucket:
+ new_bias = nn.Embedding(self.relative_attention_num_buckets + 2, self.num_heads)
+ new_bias.weight.data[: self.relative_attention_num_buckets] = self.relative_attention_bias.weight.data
+ new_bias.weight.data[self.relative_attention_num_buckets :] = 0.1
+ self.relative_attention_bias = new_bias
+ self.expand = False
+
+ rp_bucket = self.get_bucket(attention_mask, bbox)
+
+ if self.prefix_bucket:
+ if rp_bucket.size(0) == 1 and attention_mask.size(0) > 1:
+ rp_bucket = rp_bucket.repeat(attention_mask.size(0), 1, 1)
+ # based on assumption that prefix bboxes are negative
+ is_prefix = bbox[:, :, 1] < 0
+ num_prefix = is_prefix.sum(-1)
+ for idx, num_prefix_row in enumerate(num_prefix.cpu().numpy()):
+ rp_bucket[idx, :num_prefix_row, num_prefix_row:] = self.relative_attention_num_buckets
+ rp_bucket[idx, num_prefix_row:, :num_prefix_row] = self.relative_attention_num_buckets + 1
+
+ values: Tensor = self.relative_attention_bias(rp_bucket)
+ if values.dim() != 4:
+ raise ValueError("Wrong dimension of values tensor")
+ values = values.permute([0, 3, 1, 2])
+
+ return values
+
+
+class RelativePositionBias1D(RelativePositionBiasBase):
+ def __init__(self, scaling_factor=1, max_distance=128, **kwargs):
+ """
+ Reimplementation of T5 relative position bias. Distance between given tokens is their distance in the sequence.
+ Parameters are the same as in base class
+ """
+ super().__init__(scaling_factor=scaling_factor, max_distance=max_distance, **kwargs)
+
+ def prepare_input(self, attention_mask: Optional[Tensor] = None, bbox: Optional[Dict[str, Any]] = None) -> Tensor:
+ if self.scaling_factor != 1:
+ raise ValueError("No need to scale 1d features")
+ relative_position = self.get_relative_position(
+ torch.arange(attention_mask.size(1), dtype=torch.long, device=attention_mask.device)[None, :]
+ )
+
+ return relative_position
+
+
+class RelativePositionBiasHorizontal(RelativePositionBiasBase):
+ def __init__(self, scaling_factor=100, max_distance=100, **kwargs):
+ """
+ Represents in the bucket embeddings horizontal distance between two tokens. Parameters are the same as in base
+ class
+ """
+ super().__init__(scaling_factor=scaling_factor, max_distance=max_distance, **kwargs)
+
+ def prepare_input(self, attention_mask: Optional[Tensor] = None, bbox: Optional[Dict[str, Any]] = None) -> Tensor:
+ if not self.scaling_factor > 1.0:
+ raise ValueError("Need to scale the values of bboxes, as there are in small (0,1) range")
+ if bbox is None:
+ raise ValueError("Bbox is required for horizontal relative position bias")
+ # get x positions of left point of bbox
+ horizontal_position: Tensor = bbox[:, :, [0, 2]].mean(dim=-1)
+
+ return self.get_relative_position(horizontal_position)
+
+
+class RelativePositionBiasVertical(RelativePositionBiasBase):
+ def __init__(self, scaling_factor=100, max_distance=100, **kwargs):
+ """
+ Represents in the bucket embeddings vertical distance between two tokens. Parameters are the same as in base
+ class
+ """
+ super().__init__(scaling_factor=scaling_factor, max_distance=max_distance, **kwargs)
+
+ def prepare_input(self, attention_mask: Optional[Tensor] = None, bbox: Optional[Dict[str, Any]] = None) -> Tensor:
+ if not self.scaling_factor > 1.0:
+ raise ValueError("Need to scale the values of bboxes, as there are in small (0,1) range")
+ if bbox is None:
+ raise ValueError("Bbox is required for vertical relative position bias")
+ # get y positions of middle of bbox
+ vertical_position: Tensor = bbox[:, :, [1, 3]].mean(dim=-1)
+
+ return self.get_relative_position(vertical_position)
+
+
+class RelativePositionBiasAggregated(nn.Module):
+ def __init__(self, modules: Sequence[RelativePositionBiasBase]):
+ """
+ Class which sums up various computed biases.
+
+ Args:
+ modules (Sequence[RelativePositionBiasBase]):
+ List of relative bias modules.
+ """
+ super().__init__()
+ self.biases = nn.ModuleList(modules)
+
+ def forward(
+ self, attention_mask: Optional[Tensor] = None, bbox: Optional[Dict[str, Any]] = None
+ ) -> Union[float, Tensor]:
+ output = 0.0
+ for bias in self.biases: # type: ignore
+ output = bias(attention_mask, bbox) + output
+
+ return output
+
+
+BIAS_CLASSES = {
+ "1d": RelativePositionBias1D,
+ "horizontal": RelativePositionBiasHorizontal,
+ "vertical": RelativePositionBiasVertical,
+}
+
+
+def create_relative_bias(config: UdopConfig) -> Sequence[RelativePositionBiasBase]:
+ """
+ Creates empty list or one/multiple relative biases.
+
+ :param config: Model's configuration :return: Sequence with created bias modules.
+ """
+ bias_list = []
+ if hasattr(config, "relative_bias_args"):
+ for bias_kwargs_org in config.relative_bias_args:
+ bias_kwargs = deepcopy(bias_kwargs_org)
+ bias_type = bias_kwargs.pop("type")
+ model_num_heads = config.num_heads if hasattr(config, "num_heads") else config.num_attention_heads
+ if "num_heads" in bias_kwargs:
+ if bias_kwargs["num_heads"] != model_num_heads:
+ raise ValueError("Number of heads must match num of heads in the model")
+ else:
+ bias_kwargs["num_heads"] = model_num_heads
+ bias_list.append(BIAS_CLASSES[bias_type](**bias_kwargs)) # type: ignore
+
+ return bias_list
+
+
+class UdopStack(UdopPreTrainedModel):
+ """
+ This class is based on `T5Stack`, but modified to take into account the image modality as well as 2D position
+ embeddings.
+ """
+
+ def __init__(self, config, embed_tokens=None, embed_patches=None):
+ super().__init__(config)
+
+ self.embed_tokens = embed_tokens
+ self.embed_patches = embed_patches
+ self.is_decoder = config.is_decoder
+ self._max_length = config.max_length
+ self.num_layers = config.num_layers
+
+ self.block = nn.ModuleList(
+ [UdopBlock(config, has_relative_attention_bias=bool(i == 0)) for i in range(self.num_layers)]
+ )
+ self.final_layer_norm = UdopLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+
+ self.dropout = nn.Dropout(config.dropout_rate)
+
+ if not self.is_decoder:
+ self.cell_2d_embedding = UdopCellEmbeddings(config.max_2d_position_embeddings, config.hidden_size)
+
+ # get weights from encoder position bias
+ self.relative_bias = self._get_relative_bias(config)
+
+ # tie weights of original position bias of encoder
+ for bias in self.relative_bias.biases:
+ if isinstance(bias, RelativePositionBias1D):
+ self._tie_or_clone_weights(
+ bias.relative_attention_bias, self.block[0].layer[0].SelfAttention.relative_attention_bias
+ )
+
+ @staticmethod
+ def _get_relative_bias(config: UdopConfig) -> RelativePositionBiasAggregated:
+ relative_bias_list = create_relative_bias(config)
+ return RelativePositionBiasAggregated(relative_bias_list)
+
+ def get_input_embeddings(self):
+ return self.embed_tokens
+
+ def get_output_embeddings(self):
+ return self.embed_tokens
+
+ def set_input_embeddings(self, new_embeddings):
+ self.embed_tokens = new_embeddings
+
+ def forward(
+ self,
+ input_ids=None,
+ attention_mask=None,
+ bbox=None,
+ encoder_hidden_states=None,
+ encoder_attention_mask=None,
+ inputs_embeds=None,
+ pixel_values=None,
+ visual_bbox=None,
+ image_embeddings=None,
+ position_bias=None,
+ head_mask=None,
+ cross_attn_head_mask=None,
+ past_key_values=None,
+ use_cache=None,
+ output_attentions=None,
+ output_hidden_states=None,
+ return_dict=None,
+ ):
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ # input embeddings processing
+
+ if input_ids is not None and inputs_embeds is not None:
+ err_msg_prefix = "decoder_" if self.is_decoder else ""
+ raise ValueError(
+ f"You cannot specify both {err_msg_prefix}inputs and {err_msg_prefix}inputs_embeds at the same time"
+ )
+ elif input_ids is not None and torch.numel(input_ids) > 0:
+ input_shape = input_ids.size()
+ input_ids = input_ids.view(-1, input_shape[-1])
+ elif inputs_embeds is None and input_ids is not None and torch.numel(input_ids) == 0:
+ input_ids = torch.full((4, 1024), self.config.pad_token_id, device=input_ids.device, dtype=input_ids.dtype)
+ attention_mask = torch.zeros((4, 1024), device=input_ids.device, dtype=input_ids.dtype)
+ bbox = torch.zeros((4, 1024, 4), device=input_ids.device, dtype=input_ids.dtype)
+ input_shape = input_ids.size()
+ position_bias = torch.zeros_like(self.get_extended_attention_mask(attention_mask, input_shape))
+ # encoder_attention_mask = attention_mask
+ logger.warning("Empty batch")
+ elif inputs_embeds is not None:
+ input_shape = inputs_embeds.size()[:-1]
+ else:
+ err_msg_prefix = "decoder_" if self.is_decoder else ""
+ raise ValueError(f"You have to specify either {err_msg_prefix}inputs or {err_msg_prefix}inputs_embeds")
+
+ if inputs_embeds is None:
+ if self.embed_tokens is None:
+ raise ValueError("You have to intialize the model with valid token embeddings")
+ inputs_embeds = self.embed_tokens(input_ids)
+
+ if pixel_values is not None:
+ image_embeddings = self.embed_patches(pixel_values)
+
+ if image_embeddings is not None:
+ # combine visual and OCR text embeddings
+ num_patches = self.config.image_size // self.config.patch_size
+ inputs_embeds, bbox, attention_mask = combine_image_text_embeddings(
+ image_embeddings,
+ inputs_embeds,
+ bbox,
+ visual_bbox,
+ attention_mask,
+ num_patches,
+ 0,
+ self.config.image_size,
+ self.config.patch_size,
+ )
+ input_shape = inputs_embeds.size()[:-1]
+
+ if not self.is_decoder and bbox is not None:
+ inputs_embeds += self.cell_2d_embedding(bbox)
+
+ batch_size, seq_length = input_shape
+
+ # required mask seq length can be calculated via length of past
+ mask_seq_length = past_key_values[0][0].shape[2] + seq_length if past_key_values is not None else seq_length
+
+ if use_cache is True:
+ assert self.is_decoder, "`use_cache` can only be set to `True` if {} is used as a decoder".format(self)
+
+ if attention_mask is None:
+ attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)
+ if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
+ encoder_seq_length = encoder_hidden_states.shape[1]
+ encoder_attention_mask = torch.ones(
+ batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long
+ )
+
+ # initialize past_key_values with `None` if past does not exist
+ if past_key_values is None:
+ past_key_values = [None] * len(self.block)
+
+ # ourselves in which case we just need to make it broadcastable to all heads.
+ extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+ if self.is_decoder and encoder_attention_mask is not None:
+ encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+ else:
+ encoder_extended_attention_mask = None
+
+ # Prepare head mask if needed
+ head_mask = self.get_head_mask(head_mask, self.num_layers)
+ present_key_value_states = () if use_cache else None
+ all_hidden_states = () if output_hidden_states else None
+ all_attentions = () if output_attentions else None
+ all_cross_attentions = () if (output_attentions and self.is_decoder) else None
+
+ if self.is_decoder: # modified lines
+ position_bias = None
+ else:
+ position_bias = self.relative_bias(attention_mask=attention_mask, bbox=bbox)
+ position_bias = position_bias + extended_attention_mask
+ encoder_decoder_position_bias = None
+
+ hidden_states = inputs_embeds
+
+ hidden_states = self.dropout(hidden_states)
+
+ for i, (layer_module, past_key_value) in enumerate(zip(self.block, past_key_values)):
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (hidden_states,)
+
+ layer_outputs = layer_module(
+ hidden_states,
+ attention_mask=extended_attention_mask,
+ position_bias=position_bias,
+ encoder_hidden_states=encoder_hidden_states,
+ encoder_attention_mask=encoder_extended_attention_mask,
+ encoder_decoder_position_bias=encoder_decoder_position_bias,
+ layer_head_mask=head_mask[i],
+ past_key_value=past_key_value,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ )
+ # layer_outputs is a tuple with:
+ # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
+ if use_cache is False: # MP fixes
+ layer_outputs = layer_outputs[:1] + (None,) + layer_outputs[1:]
+ hidden_states, present_key_value_state = layer_outputs[:2]
+
+ # We share the position biases between the layers - the first layer store them
+ # layer_outputs = hidden-states, key-value-states (self-attention weights),
+ # (self-attention position bias), (cross-attention weights), (cross-attention position bias)
+
+ position_bias = layer_outputs[2]
+ if self.is_decoder and encoder_hidden_states is not None:
+ encoder_decoder_position_bias = layer_outputs[4 if output_attentions else 3]
+ # append next layer key value states
+ if use_cache:
+ present_key_value_states = present_key_value_states + (present_key_value_state,)
+
+ if output_attentions:
+ all_attentions = all_attentions + (layer_outputs[2],) # We keep only self-attention weights for now
+ if self.is_decoder:
+ all_cross_attentions = all_cross_attentions + (layer_outputs[5],)
+
+ hidden_states = self.final_layer_norm(hidden_states)
+ hidden_states = self.dropout(hidden_states)
+
+ # Add last layer
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (hidden_states,)
+
+ if not return_dict:
+ return tuple(
+ v
+ for v in [
+ hidden_states,
+ attention_mask,
+ present_key_value_states,
+ all_hidden_states,
+ all_attentions,
+ all_cross_attentions,
+ ]
+ if v is not None
+ )
+
+ return BaseModelOutputWithAttentionMask(
+ last_hidden_state=hidden_states,
+ attention_mask=attention_mask,
+ past_key_values=present_key_value_states,
+ hidden_states=all_hidden_states,
+ attentions=all_attentions,
+ cross_attentions=all_cross_attentions,
+ )
+
+
+@add_start_docstrings(
+ "The bare UDOP encoder-decoder Transformer outputting raw hidden-states without any specific head on top.",
+ UDOP_START_DOCSTRING,
+)
+class UdopModel(UdopPreTrainedModel):
+ _tied_weights_keys = [
+ "encoder.embed_tokens.weight",
+ "decoder.embed_tokens.weight",
+ "encoder.embed_patches.proj.weight",
+ "encoder.embed_patches.proj.bias",
+ "encoder.relative_bias.biases.0.relative_attention_bias.weight",
+ "decoder.relative_bias.biases.0.relative_attention_bias.weight",
+ ]
+
+ def __init__(self, config):
+ super(UdopModel, self).__init__(config)
+
+ # text and image embeddings
+ self.shared = nn.Embedding(config.vocab_size, config.d_model)
+ self.patch_embed = UdopPatchEmbeddings(config)
+
+ encoder_config = deepcopy(config)
+ encoder_config.is_decoder = False
+ encoder_config.use_cache = False
+ encoder_config.is_encoder_decoder = False
+ self.encoder = UdopStack(encoder_config, self.shared, self.patch_embed)
+
+ decoder_config = deepcopy(config)
+ decoder_config.is_decoder = True
+ decoder_config.is_encoder_decoder = False
+ decoder_config.num_layers = config.num_decoder_layers
+ self.decoder = UdopStack(decoder_config, self.shared)
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.shared
+
+ def set_input_embeddings(self, new_embeddings):
+ self.shared = new_embeddings
+ self.encoder.set_input_embeddings(new_embeddings)
+ self.decoder.set_input_embeddings(new_embeddings)
+
+ def get_encoder(self):
+ return self.encoder
+
+ def get_decoder(self):
+ return self.decoder
+
+ @add_start_docstrings_to_model_forward(UDOP_INPUTS_DOCSTRING)
+ @replace_return_docstrings(output_type=Seq2SeqModelOutput, config_class=_CONFIG_FOR_DOC)
+ def forward(
+ self,
+ input_ids: Tensor = None,
+ attention_mask: Tensor = None,
+ bbox: Dict[str, Any] = None,
+ pixel_values: Optional[Tensor] = None,
+ visual_bbox: Dict[str, Any] = None,
+ decoder_input_ids: Optional[Tensor] = None,
+ decoder_attention_mask: Optional[Tensor] = None,
+ inputs_embeds: Optional[Tensor] = None,
+ encoder_outputs: Optional[Tensor] = None,
+ past_key_values: Optional[Tensor] = None,
+ head_mask: Optional[Tensor] = None,
+ decoder_inputs_embeds: Optional[Tensor] = None,
+ decoder_head_mask: Optional[Tensor] = None,
+ cross_attn_head_mask: Optional[Tensor] = None,
+ use_cache=True,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ ) -> Tuple[Tensor, ...]:
+ r"""
+ Returns:
+
+ Example:
+
+ ```python
+ >>> from transformers import AutoProcessor, AutoModel
+ >>> from datasets import load_dataset
+ >>> import torch
+
+ >>> processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
+ >>> model = AutoModel.from_pretrained("microsoft/udop-large")
+
+ >>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
+ >>> example = dataset[0]
+ >>> image = example["image"]
+ >>> words = example["tokens"]
+ >>> boxes = example["bboxes"]
+ >>> inputs = processor(image, words, boxes=boxes, return_tensors="pt")
+
+ >>> decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]])
+
+ >>> # forward pass
+ >>> outputs = model(**inputs, decoder_input_ids=decoder_input_ids)
+ >>> last_hidden_states = outputs.last_hidden_state
+ >>> list(last_hidden_states.shape)
+ [1, 1, 1024]
+ ```"""
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ # Encode if needed (training, first prediction pass)
+ if encoder_outputs is None:
+ encoder_outputs = self.encoder(
+ input_ids=input_ids,
+ attention_mask=attention_mask,
+ bbox=bbox,
+ pixel_values=pixel_values,
+ visual_bbox=visual_bbox,
+ inputs_embeds=inputs_embeds,
+ head_mask=head_mask,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ hidden_states = encoder_outputs[0]
+ encoder_attention_mask = encoder_outputs.attention_mask if return_dict else encoder_outputs[1]
+
+ # Decode
+ decoder_outputs = self.decoder(
+ input_ids=decoder_input_ids,
+ attention_mask=decoder_attention_mask,
+ inputs_embeds=decoder_inputs_embeds,
+ past_key_values=past_key_values,
+ encoder_hidden_states=hidden_states,
+ encoder_attention_mask=encoder_attention_mask,
+ head_mask=decoder_head_mask,
+ cross_attn_head_mask=cross_attn_head_mask,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ if not return_dict:
+ # we filter out the attention mask
+ decoder_outputs = tuple(value for idx, value in enumerate(decoder_outputs) if idx != 1)
+ encoder_outputs = tuple(value for idx, value in enumerate(encoder_outputs) if idx != 1)
+ return decoder_outputs + encoder_outputs
+
+ return Seq2SeqModelOutput(
+ last_hidden_state=decoder_outputs.last_hidden_state,
+ past_key_values=decoder_outputs.past_key_values,
+ decoder_hidden_states=decoder_outputs.hidden_states,
+ decoder_attentions=decoder_outputs.attentions,
+ cross_attentions=decoder_outputs.cross_attentions,
+ encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+ encoder_hidden_states=encoder_outputs.hidden_states,
+ encoder_attentions=encoder_outputs.attentions,
+ )
+
+
+@add_start_docstrings(
+ """The UDOP encoder-decoder Transformer with a language modeling head on top, enabling to generate text given document
+ images and an optional prompt.
+
+ This class is based on [`T5ForConditionalGeneration`], extended to deal with images and layout (2D) data.""",
+ UDOP_START_DOCSTRING,
+)
+class UdopForConditionalGeneration(UdopPreTrainedModel):
+ _tied_weights_keys = [
+ "encoder.embed_tokens.weight",
+ "decoder.embed_tokens.weight",
+ "encoder.embed_patches.proj.weight",
+ "encoder.embed_patches.proj.bias",
+ "encoder.relative_bias.biases.0.relative_attention_bias.weight",
+ "decoder.relative_bias.biases.0.relative_attention_bias.weight",
+ "lm_head.weight",
+ ]
+
+ def __init__(self, config):
+ super(UdopForConditionalGeneration, self).__init__(config)
+
+ # text and image embeddings
+ self.shared = nn.Embedding(config.vocab_size, config.d_model)
+ self.patch_embed = UdopPatchEmbeddings(config)
+
+ encoder_config = deepcopy(config)
+ encoder_config.is_decoder = False
+ encoder_config.use_cache = False
+ encoder_config.is_encoder_decoder = False
+ self.encoder = UdopStack(encoder_config, self.shared, self.patch_embed)
+
+ decoder_config = deepcopy(config)
+ decoder_config.is_decoder = True
+ decoder_config.is_encoder_decoder = False
+ decoder_config.num_layers = config.num_decoder_layers
+ self.decoder = UdopStack(decoder_config, self.shared)
+
+ # The weights of the language modeling head are shared with those of the encoder and decoder
+ self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.shared
+
+ def set_input_embeddings(self, new_embeddings):
+ self.shared = new_embeddings
+ self.encoder.set_input_embeddings(new_embeddings)
+ self.decoder.set_input_embeddings(new_embeddings)
+
+ def set_output_embeddings(self, new_embeddings):
+ self.lm_head = new_embeddings
+
+ def get_output_embeddings(self):
+ return self.lm_head
+
+ def get_encoder(self):
+ return self.encoder
+
+ def get_decoder(self):
+ return self.decoder
+
+ @add_start_docstrings_to_model_forward(UDOP_INPUTS_DOCSTRING)
+ @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
+ def forward(
+ self,
+ input_ids: Tensor = None,
+ attention_mask: Tensor = None,
+ bbox: Dict[str, Any] = None,
+ pixel_values: Optional[Tensor] = None,
+ visual_bbox: Dict[str, Any] = None,
+ decoder_input_ids: Optional[Tensor] = None,
+ decoder_attention_mask: Optional[Tensor] = None,
+ inputs_embeds: Optional[Tensor] = None,
+ encoder_outputs: Optional[Tensor] = None,
+ past_key_values: Optional[Tensor] = None,
+ head_mask: Optional[Tensor] = None,
+ decoder_inputs_embeds: Optional[Tensor] = None,
+ decoder_head_mask: Optional[Tensor] = None,
+ cross_attn_head_mask: Optional[Tensor] = None,
+ use_cache=True,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ labels: Optional[Tensor] = None,
+ ) -> Tuple[Tensor, ...]:
+ r"""
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+ Labels for computing the language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size -
+ 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for labels in `[0, ...,
+ config.vocab_size]`.
+
+ Returns:
+
+ Examples:
+
+ ```python
+ >>> from transformers import AutoProcessor, UdopForConditionalGeneration
+ >>> from datasets import load_dataset
+
+ >>> # load model and processor
+ >>> processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
+ >>> model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large")
+
+ >>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
+ >>> example = dataset[0]
+ >>> image = example["image"]
+ >>> words = example["tokens"]
+ >>> boxes = example["bboxes"]
+ >>> question = "Question answering. What is the date on the form?"
+ >>> encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
+
+ >>> # autoregressive generation
+ >>> predicted_ids = model.generate(**encoding)
+ >>> print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])
+ 9/30/92
+ ```"""
+
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ if decoder_input_ids is None and labels is not None:
+ decoder_input_ids = self._shift_right(labels)
+
+ # Encode if needed (training, first prediction pass)
+ if encoder_outputs is None:
+ encoder_outputs = self.encoder(
+ input_ids=input_ids,
+ bbox=bbox,
+ visual_bbox=visual_bbox,
+ pixel_values=pixel_values,
+ attention_mask=attention_mask,
+ inputs_embeds=inputs_embeds,
+ head_mask=head_mask,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ hidden_states = encoder_outputs[0]
+ encoder_attention_mask = encoder_outputs.attention_mask if return_dict else encoder_outputs[1]
+
+ # Decode
+ decoder_outputs = self.decoder(
+ input_ids=decoder_input_ids,
+ attention_mask=decoder_attention_mask,
+ inputs_embeds=decoder_inputs_embeds,
+ past_key_values=past_key_values,
+ encoder_hidden_states=hidden_states,
+ encoder_attention_mask=encoder_attention_mask,
+ head_mask=decoder_head_mask,
+ cross_attn_head_mask=cross_attn_head_mask,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ sequence_output = decoder_outputs[0]
+
+ if self.config.tie_word_embeddings:
+ # Rescale output before projecting on vocab
+ # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
+ sequence_output = sequence_output * (self.config.d_model**-0.5)
+
+ lm_logits = self.lm_head(sequence_output)
+
+ loss = None
+ if labels is not None:
+ loss_fct = CrossEntropyLoss(ignore_index=-100)
+ loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))
+
+ if not return_dict:
+ output = (lm_logits,) + decoder_outputs[2:] + (encoder_outputs[0],) + encoder_outputs[2:]
+ return ((loss,) + output) if loss is not None else output
+
+ return Seq2SeqLMOutput(
+ loss=loss,
+ logits=lm_logits,
+ past_key_values=decoder_outputs.past_key_values,
+ decoder_hidden_states=decoder_outputs.hidden_states,
+ decoder_attentions=decoder_outputs.attentions,
+ cross_attentions=decoder_outputs.cross_attentions,
+ encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+ encoder_hidden_states=encoder_outputs.hidden_states,
+ encoder_attentions=encoder_outputs.attentions,
+ )
+
+ def prepare_inputs_for_generation(
+ self,
+ input_ids,
+ past_key_values=None,
+ attention_mask=None,
+ head_mask=None,
+ decoder_head_mask=None,
+ cross_attn_head_mask=None,
+ use_cache=None,
+ encoder_outputs=None,
+ **kwargs,
+ ):
+ # cut decoder_input_ids if past is used
+ if past_key_values is not None:
+ input_ids = input_ids[:, -1:]
+
+ return {
+ "decoder_input_ids": input_ids,
+ "past_key_values": past_key_values,
+ "encoder_outputs": encoder_outputs,
+ "attention_mask": attention_mask,
+ "head_mask": head_mask,
+ "decoder_head_mask": decoder_head_mask,
+ "cross_attn_head_mask": cross_attn_head_mask,
+ "use_cache": use_cache,
+ "bbox": kwargs.get("bbox", None),
+ "pixel_values": kwargs.get("pixel_values", None),
+ "visual_bbox": kwargs.get("visual_bbox", None),
+ }
+
+ # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration._reorder_cache
+ def _reorder_cache(self, past_key_values, beam_idx):
+ # if decoder past is not included in output
+ # speedy decoding is disabled and no need to reorder
+ if past_key_values is None:
+ logger.warning("You might want to consider setting `use_cache=True` to speed up decoding")
+ return past_key_values
+
+ reordered_decoder_past = ()
+ for layer_past_states in past_key_values:
+ # get the correct batch idx from layer past batch dim
+ # batch dim of `past` is at 2nd position
+ reordered_layer_past_states = ()
+ for layer_past_state in layer_past_states:
+ # need to set correct `past` for each of the four key / value states
+ reordered_layer_past_states = reordered_layer_past_states + (
+ layer_past_state.index_select(0, beam_idx.to(layer_past_state.device)),
+ )
+
+ if reordered_layer_past_states[0].shape != layer_past_states[0].shape:
+ raise ValueError(
+ f"reordered_layer_past_states[0] shape {reordered_layer_past_states[0].shape} and layer_past_states[0] shape {layer_past_states[0].shape} mismatched"
+ )
+ if len(reordered_layer_past_states) != len(layer_past_states):
+ raise ValueError(
+ f"length of reordered_layer_past_states {len(reordered_layer_past_states)} and length of layer_past_states {len(layer_past_states)} mismatched"
+ )
+
+ reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)
+ return reordered_decoder_past
+
+
+@add_start_docstrings(
+ "The bare UDOP Model transformer outputting encoder's raw hidden-states without any specific head on top.",
+ UDOP_START_DOCSTRING,
+)
+class UdopEncoderModel(UdopPreTrainedModel):
+ _tied_weights_keys = [
+ "encoder.embed_tokens.weight",
+ "encoder.embed_patches.proj.weight",
+ "encoder.embed_patches.proj.bias",
+ "encoder.relative_bias.biases.0.relative_attention_bias.weight",
+ ]
+
+ def __init__(self, config: UdopConfig):
+ super().__init__(config)
+
+ # text and image embeddings
+ self.shared = nn.Embedding(config.vocab_size, config.d_model)
+ self.patch_embed = UdopPatchEmbeddings(config)
+
+ encoder_config = deepcopy(config)
+ encoder_config.is_decoder = False
+ encoder_config.use_cache = False
+ encoder_config.is_encoder_decoder = False
+ self.encoder = UdopStack(encoder_config, self.shared, self.patch_embed)
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.shared
+
+ def set_input_embeddings(self, new_embeddings):
+ self.shared = new_embeddings
+ self.encoder.set_input_embeddings(new_embeddings)
+
+ def get_encoder(self):
+ return self.encoder
+
+ def _prune_heads(self, heads_to_prune):
+ """
+ Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+ class PreTrainedModel
+ """
+ for layer, heads in heads_to_prune.items():
+ self.encoder.block[layer].layer[0].SelfAttention.prune_heads(heads)
+
+ @add_start_docstrings_to_model_forward(UDOP_ENCODER_INPUTS_DOCSTRING)
+ @replace_return_docstrings(output_type=BaseModelOutputWithAttentionMask, config_class=_CONFIG_FOR_DOC)
+ def forward(
+ self,
+ input_ids: Tensor = None,
+ bbox: Dict[str, Any] = None,
+ attention_mask: Tensor = None,
+ pixel_values: Optional[Tensor] = None,
+ visual_bbox: Dict[str, Any] = None,
+ head_mask: Optional[Tensor] = None,
+ inputs_embeds: Optional[Tensor] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ ) -> Union[Tuple[torch.FloatTensor], BaseModelOutputWithAttentionMask]:
+ r"""
+ Returns:
+
+ Example:
+
+ ```python
+ >>> from transformers import AutoProcessor, UdopEncoderModel
+ >>> from huggingface_hub import hf_hub_download
+ >>> from datasets import load_dataset
+
+ >>> processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
+ >>> model = UdopEncoderModel.from_pretrained("microsoft/udop-large")
+
+ >>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
+ >>> example = dataset[0]
+ >>> image = example["image"]
+ >>> words = example["tokens"]
+ >>> boxes = example["bboxes"]
+ >>> encoding = processor(image, words, boxes=boxes, return_tensors="pt")
+
+ >>> outputs = model(**encoding)
+ >>> last_hidden_states = outputs.last_hidden_state
+ ```"""
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ encoder_outputs = self.encoder(
+ input_ids=input_ids,
+ bbox=bbox,
+ visual_bbox=visual_bbox,
+ pixel_values=pixel_values,
+ attention_mask=attention_mask,
+ inputs_embeds=inputs_embeds,
+ head_mask=head_mask,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ return encoder_outputs
diff --git a/src/transformers/models/udop/processing_udop.py b/src/transformers/models/udop/processing_udop.py
new file mode 100644
index 00000000000000..2902541d6f5b46
--- /dev/null
+++ b/src/transformers/models/udop/processing_udop.py
@@ -0,0 +1,204 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for UDOP.
+"""
+
+from typing import List, Optional, Union
+
+from ...image_utils import ImageInput
+from ...processing_utils import ProcessorMixin
+from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
+from ...utils import TensorType
+
+
+class UdopProcessor(ProcessorMixin):
+ r"""
+ Constructs a UDOP processor which combines a LayoutLMv3 image processor and a UDOP tokenizer into a single processor.
+
+ [`UdopProcessor`] offers all the functionalities you need to prepare data for the model.
+
+ It first uses [`LayoutLMv3ImageProcessor`] to resize, rescale and normalize document images, and optionally applies OCR
+ to get words and normalized bounding boxes. These are then provided to [`UdopTokenizer`] or [`UdopTokenizerFast`],
+ which turns the words and bounding boxes into token-level `input_ids`, `attention_mask`, `token_type_ids`, `bbox`.
+ Optionally, one can provide integer `word_labels`, which are turned into token-level `labels` for token
+ classification tasks (such as FUNSD, CORD).
+
+ Additionally, it also supports passing `text_target` and `text_pair_target` to the tokenizer, which can be used to
+ prepare labels for language modeling tasks.
+
+ Args:
+ image_processor (`LayoutLMv3ImageProcessor`):
+ An instance of [`LayoutLMv3ImageProcessor`]. The image processor is a required input.
+ tokenizer (`UdopTokenizer` or `UdopTokenizerFast`):
+ An instance of [`UdopTokenizer`] or [`UdopTokenizerFast`]. The tokenizer is a required input.
+ """
+
+ attributes = ["image_processor", "tokenizer"]
+ image_processor_class = "LayoutLMv3ImageProcessor"
+ tokenizer_class = ("UdopTokenizer", "UdopTokenizerFast")
+
+ def __init__(self, image_processor, tokenizer):
+ super().__init__(image_processor, tokenizer)
+
+ def __call__(
+ self,
+ images: Optional[ImageInput] = None,
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+ text_pair: Optional[Union[PreTokenizedInput, List[PreTokenizedInput]]] = None,
+ boxes: Union[List[List[int]], List[List[List[int]]]] = None,
+ word_labels: Optional[Union[List[int], List[List[int]]]] = None,
+ text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+ text_pair_target: Optional[
+ Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
+ ] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = False,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ ) -> BatchEncoding:
+ """
+ This method first forwards the `images` argument to [`~UdopImageProcessor.__call__`]. In case
+ [`UdopImageProcessor`] was initialized with `apply_ocr` set to `True`, it passes the obtained words and
+ bounding boxes along with the additional arguments to [`~UdopTokenizer.__call__`] and returns the output,
+ together with the prepared `pixel_values`. In case [`UdopImageProcessor`] was initialized with `apply_ocr` set
+ to `False`, it passes the words (`text`/``text_pair`) and `boxes` specified by the user along with the
+ additional arguments to [`~UdopTokenizer.__call__`] and returns the output, together with the prepared
+ `pixel_values`.
+
+ Alternatively, one can pass `text_target` and `text_pair_target` to prepare the targets of UDOP.
+
+ Please refer to the docstring of the above two methods for more information.
+ """
+ # verify input
+ if self.image_processor.apply_ocr and (boxes is not None):
+ raise ValueError(
+ "You cannot provide bounding boxes if you initialized the image processor with apply_ocr set to True."
+ )
+
+ if self.image_processor.apply_ocr and (word_labels is not None):
+ raise ValueError(
+ "You cannot provide word labels if you initialized the image processor with apply_ocr set to True."
+ )
+
+ if return_overflowing_tokens is True and return_offsets_mapping is False:
+ raise ValueError("You cannot return overflowing tokens without returning the offsets mapping.")
+
+ if text_target is not None:
+ # use the processor to prepare the targets of UDOP
+ return self.tokenizer(
+ text_target=text_target,
+ text_pair_target=text_pair_target,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ return_tensors=return_tensors,
+ )
+
+ else:
+ # use the processor to prepare the inputs of UDOP
+ # first, apply the image processor
+ features = self.image_processor(images=images, return_tensors=return_tensors)
+
+ # second, apply the tokenizer
+ if text is not None and self.image_processor.apply_ocr and text_pair is None:
+ if isinstance(text, str):
+ text = [text] # add batch dimension (as the image processor always adds a batch dimension)
+ text_pair = features["words"]
+
+ encoded_inputs = self.tokenizer(
+ text=text if text is not None else features["words"],
+ text_pair=text_pair if text_pair is not None else None,
+ boxes=boxes if boxes is not None else features["boxes"],
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ return_tensors=return_tensors,
+ )
+
+ # add pixel values
+ pixel_values = features.pop("pixel_values")
+ if return_overflowing_tokens is True:
+ pixel_values = self.get_overflowing_images(pixel_values, encoded_inputs["overflow_to_sample_mapping"])
+ encoded_inputs["pixel_values"] = pixel_values
+
+ return encoded_inputs
+
+ # Copied from transformers.models.layoutlmv3.processing_layoutlmv3.LayoutLMv3Processor.get_overflowing_images
+ def get_overflowing_images(self, images, overflow_to_sample_mapping):
+ # in case there's an overflow, ensure each `input_ids` sample is mapped to its corresponding image
+ images_with_overflow = []
+ for sample_idx in overflow_to_sample_mapping:
+ images_with_overflow.append(images[sample_idx])
+
+ if len(images_with_overflow) != len(overflow_to_sample_mapping):
+ raise ValueError(
+ "Expected length of images to be the same as the length of `overflow_to_sample_mapping`, but got"
+ f" {len(images_with_overflow)} and {len(overflow_to_sample_mapping)}"
+ )
+
+ return images_with_overflow
+
+ # Copied from transformers.models.layoutlmv3.processing_layoutlmv3.LayoutLMv3Processor.batch_decode
+ def batch_decode(self, *args, **kwargs):
+ """
+ This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+ refer to the docstring of this method for more information.
+ """
+ return self.tokenizer.batch_decode(*args, **kwargs)
+
+ # Copied from transformers.models.layoutlmv3.processing_layoutlmv3.LayoutLMv3Processor.decode
+ def decode(self, *args, **kwargs):
+ """
+ This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer
+ to the docstring of this method for more information.
+ """
+ return self.tokenizer.decode(*args, **kwargs)
+
+ @property
+ # Copied from transformers.models.layoutlmv3.processing_layoutlmv3.LayoutLMv3Processor.model_input_names
+ def model_input_names(self):
+ return ["input_ids", "bbox", "attention_mask", "pixel_values"]
diff --git a/src/transformers/models/udop/tokenization_udop.py b/src/transformers/models/udop/tokenization_udop.py
new file mode 100644
index 00000000000000..10e92db48cebba
--- /dev/null
+++ b/src/transformers/models/udop/tokenization_udop.py
@@ -0,0 +1,1483 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+""" Tokenization classes for UDOP model."""
+
+
+import os
+import re
+import warnings
+from shutil import copyfile
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import sentencepiece as spm
+
+from ...tokenization_utils import PreTrainedTokenizer
+from ...tokenization_utils_base import (
+ AddedToken,
+ BatchEncoding,
+ EncodedInput,
+ PreTokenizedInput,
+ TextInput,
+ TextInputPair,
+ TruncationStrategy,
+)
+from ...utils import PaddingStrategy, TensorType, add_end_docstrings, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+SPIECE_UNDERLINE = "▁"
+
+
+UDOP_ENCODE_KWARGS_DOCSTRING = r"""
+ add_special_tokens (`bool`, *optional*, defaults to `True`):
+ Whether or not to encode the sequences with the special tokens relative to their model.
+ padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `False`):
+ Activates and controls padding. Accepts the following values:
+
+ - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+ sequence if provided).
+ - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+ acceptable input length for the model if that argument is not provided.
+ - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+ lengths).
+ truncation (`bool`, `str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
+ Activates and controls truncation. Accepts the following values:
+
+ - `True` or `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or
+ to the maximum acceptable input length for the model if that argument is not provided. This will
+ truncate token by token, removing a token from the longest sequence in the pair if a pair of
+ sequences (or a batch of pairs) is provided.
+ - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
+ maximum acceptable input length for the model if that argument is not provided. This will only
+ truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+ - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
+ maximum acceptable input length for the model if that argument is not provided. This will only
+ truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+ - `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths
+ greater than the model maximum admissible input size).
+ max_length (`int`, *optional*):
+ Controls the maximum length to use by one of the truncation/padding parameters.
+
+ If left unset or set to `None`, this will use the predefined model maximum length if a maximum length
+ is required by one of the truncation/padding parameters. If the model has no specific maximum input
+ length (like XLNet) truncation/padding to a maximum length will be deactivated.
+ stride (`int`, *optional*, defaults to 0):
+ If set to a number along with `max_length`, the overflowing tokens returned when
+ `return_overflowing_tokens=True` will contain some tokens from the end of the truncated sequence
+ returned to provide some overlap between truncated and overflowing sequences. The value of this
+ argument defines the number of overlapping tokens.
+ pad_to_multiple_of (`int`, *optional*):
+ If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
+ the use of Tensor Cores on NVIDIA hardware with compute capability `>= 7.5` (Volta).
+ return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
+ If set, will return tensors instead of list of python integers. Acceptable values are:
+
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
+ - `'np'`: Return Numpy `np.ndarray` objects.
+ return_token_type_ids (`bool`, *optional*):
+ Whether to return token type IDs. If left to the default, will return the token type IDs according to
+ the specific tokenizer's default, defined by the `return_outputs` attribute.
+
+ [What are token type IDs?](../glossary#token-type-ids)
+ return_attention_mask (`bool`, *optional*):
+ Whether to return the attention mask. If left to the default, will return the attention mask according
+ to the specific tokenizer's default, defined by the `return_outputs` attribute.
+
+ [What are attention masks?](../glossary#attention-mask)
+ return_overflowing_tokens (`bool`, *optional*, defaults to `False`):
+ Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
+ of pairs) is provided with `truncation_strategy = longest_first` or `True`, an error is raised instead
+ of returning overflowing tokens.
+ return_special_tokens_mask (`bool`, *optional*, defaults to `False`):
+ Whether or not to return special tokens mask information.
+ return_offsets_mapping (`bool`, *optional*, defaults to `False`):
+ Whether or not to return `(char_start, char_end)` for each token.
+
+ This is only available on fast tokenizers inheriting from [`PreTrainedTokenizerFast`], if using
+ Python's tokenizer, this method will raise `NotImplementedError`.
+ return_length (`bool`, *optional*, defaults to `False`):
+ Whether or not to return the lengths of the encoded inputs.
+ verbose (`bool`, *optional*, defaults to `True`):
+ Whether or not to print more information and warnings.
+ **kwargs: passed to the `self.tokenize()` method
+
+ Return:
+ [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+
+ - **input_ids** -- List of token ids to be fed to a model.
+
+ [What are input IDs?](../glossary#input-ids)
+
+ - **bbox** -- List of bounding boxes to be fed to a model.
+
+ - **token_type_ids** -- List of token type ids to be fed to a model (when `return_token_type_ids=True` or
+ if *"token_type_ids"* is in `self.model_input_names`).
+
+ [What are token type IDs?](../glossary#token-type-ids)
+
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names`).
+
+ [What are attention masks?](../glossary#attention-mask)
+
+ - **labels** -- List of labels to be fed to a model. (when `word_labels` is specified).
+ - **overflowing_tokens** -- List of overflowing tokens sequences (when a `max_length` is specified and
+ `return_overflowing_tokens=True`).
+ - **num_truncated_tokens** -- Number of tokens truncated (when a `max_length` is specified and
+ `return_overflowing_tokens=True`).
+ - **special_tokens_mask** -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
+ regular sequence tokens (when `add_special_tokens=True` and `return_special_tokens_mask=True`).
+ - **length** -- The length of the inputs (when `return_length=True`).
+"""
+
+VOCAB_FILES_NAMES = {"vocab_file": "spiece.model", "tokenizer_file": "tokenizer.json"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+ "vocab_file": {
+ "microsoft/udop-large": "https://huggingface.co/microsoft/udop-large/resolve/main/spiece.model",
+ },
+ "tokenizer_file": {
+ "microsoft/udop-large": "https://huggingface.co/microsoft/udop-large/resolve/main/tokenizer.json",
+ },
+}
+
+
+# TODO(PVP) - this should be removed in Transformers v5
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+ "microsoft/udop-large": 512,
+}
+
+
+class UdopTokenizer(PreTrainedTokenizer):
+ """
+ Adapted from [`LayoutXLMTokenizer`] and [`T5Tokenizer`]. Based on
+ [SentencePiece](https://github.com/google/sentencepiece).
+
+ This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+ this superclass for more information regarding those methods.
+
+ Args:
+ vocab_file (`str`):
+ Path to the vocabulary file.
+
+ eos_token (`str`, *optional*, defaults to `""`):
+ The end of sequence token.
+
+
+
+ When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+ The token used is the `sep_token`.
+
+
+
+ unk_token (`str`, *optional*, defaults to `""`):
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+ token instead.
+
+ sep_token (`str`, *optional*, defaults to `""`):
+ The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+ sequence classification or for a text and a question for question answering. It is also used as the last
+ token of a sequence built with special tokens.
+
+ pad_token (`str`, *optional*, defaults to `""`):
+ The token used for padding, for example when batching sequences of different lengths.
+ sep_token_box (`List[int]`, *optional*, defaults to `[1000, 1000, 1000, 1000]`):
+ The bounding box to use for the special [SEP] token.
+ pad_token_box (`List[int]`, *optional*, defaults to `[0, 0, 0, 0]`):
+ The bounding box to use for the special [PAD] token.
+ pad_token_label (`int`, *optional*, defaults to -100):
+ The label to use for padding tokens. Defaults to -100, which is the `ignore_index` of PyTorch's
+ CrossEntropyLoss.
+ only_label_first_subword (`bool`, *optional*, defaults to `True`):
+ Whether or not to only label the first subword, in case word labels are provided.
+ additional_special_tokens (`List[str]`, *optional*, defaults to `["NOTUSED", " NOTUSED"]`):
+ Additional special tokens used by the tokenizer.
+
+ sp_model_kwargs (`dict`, *optional*):
+ Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
+ SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
+ to set:
+
+ - `enable_sampling`: Enable subword regularization.
+ - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
+
+ - `nbest_size = {0,1}`: No sampling is performed.
+ - `nbest_size > 1`: samples from the nbest_size results.
+ - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
+ using forward-filtering-and-backward-sampling algorithm.
+
+ - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
+ BPE-dropout.
+ legacy (`bool`, *optional*, defaults to `True`):
+ Whether or not the `legacy` behaviour of the tokenizer should be used. Legacy is before the merge of #24622
+ which includes fixes to properly handle tokens that appear after special tokens. A simple example:
+ - `legacy=True`:
+ ```python
+ >>> from transformers import T5Tokenizer
+
+ >>> tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=True)
+ >>> tokenizer.encode("Hello .")
+ [8774, 32099, 3, 5, 1]
+ ```
+ - `legacy=False`:
+ ```python
+ >>> from transformers import T5Tokenizer
+
+ >>> tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=False)
+ >>> tokenizer.encode("Hello .") # the extra space `[3]` is no longer here
+ [8774, 32099, 5, 1]
+ ```
+ Checkout the pull request and the issue [here](https://github.com/huggingface/transformers/pull/24565) for
+ more details.
+ add_prefix_space (`bool`, *optional*, defaults to `True`):
+ Whether or not to add an initial space to the input. This allows to treat the leading word just as any
+ other word.
+
+
+ Attributes:
+ sp_model (`SentencePieceProcessor`):
+ The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
+ """
+
+ vocab_files_names = VOCAB_FILES_NAMES
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+ model_input_names = ["input_ids", "attention_mask"]
+
+ def __init__(
+ self,
+ vocab_file,
+ eos_token="",
+ unk_token="",
+ sep_token="",
+ pad_token="",
+ sep_token_box=[1000, 1000, 1000, 1000],
+ pad_token_box=[0, 0, 0, 0],
+ pad_token_label=-100,
+ only_label_first_subword=True,
+ additional_special_tokens=None,
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
+ legacy=True,
+ add_prefix_space=True,
+ **kwargs,
+ ) -> None:
+ eos_token = AddedToken(eos_token, special=True) if isinstance(eos_token, str) else eos_token
+ unk_token = AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token
+ sep_token = AddedToken(sep_token, special=True) if isinstance(sep_token, str) else sep_token
+ pad_token = AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token
+
+ self.legacy = legacy
+ self.add_prefix_space = add_prefix_space
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+ self.vocab_file = vocab_file
+
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+ self.sp_model.Load(vocab_file)
+
+ # additional properties
+ self.sep_token_box = sep_token_box
+ self.pad_token_box = pad_token_box
+ self.pad_token_label = pad_token_label
+ self.only_label_first_subword = only_label_first_subword
+
+ super().__init__(
+ eos_token=eos_token,
+ unk_token=unk_token,
+ sep_token=sep_token,
+ pad_token=pad_token,
+ sep_token_box=sep_token_box,
+ pad_token_box=pad_token_box,
+ pad_token_label=pad_token_label,
+ only_label_first_subword=only_label_first_subword,
+ additional_special_tokens=additional_special_tokens,
+ sp_model_kwargs=self.sp_model_kwargs,
+ legacy=legacy,
+ add_prefix_space=add_prefix_space,
+ **kwargs,
+ )
+
+ @property
+ def vocab_size(self):
+ return len(self.sp_model)
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_vocab
+ def get_vocab(self):
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+ vocab.update(self.added_tokens_encoder)
+ return vocab
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_special_tokens_mask
+ def get_special_tokens_mask(
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+ ) -> List[int]:
+ """
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+ special tokens using the tokenizer `prepare_for_model` method.
+
+ Args:
+ token_ids_0 (`List[int]`):
+ List of IDs.
+ token_ids_1 (`List[int]`, *optional*):
+ Optional second list of IDs for sequence pairs.
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+ Whether or not the token list is already formatted with special tokens for the model.
+
+ Returns:
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+ """
+ if already_has_special_tokens:
+ return super().get_special_tokens_mask(
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+ )
+
+ # normal case: some special tokens
+ if token_ids_1 is None:
+ return ([0] * len(token_ids_0)) + [1]
+ return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_sentinel_tokens
+ def get_sentinel_tokens(self):
+ return list(
+ set(filter(lambda x: bool(re.search(r"", x)) is not None, self.additional_special_tokens))
+ )
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_sentinel_token_ids
+ def get_sentinel_token_ids(self):
+ return [self.convert_tokens_to_ids(token) for token in self.get_sentinel_tokens()]
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer._add_eos_if_not_present
+ def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
+ """Do not add eos again if user already added it."""
+ if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
+ warnings.warn(
+ f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated"
+ " eos tokens being added."
+ )
+ return token_ids
+ else:
+ return token_ids + [self.eos_token_id]
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.create_token_type_ids_from_sequences
+ def create_token_type_ids_from_sequences(
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+ ) -> List[int]:
+ """
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
+ use of token type ids, therefore a list of zeros is returned.
+
+ Args:
+ token_ids_0 (`List[int]`):
+ List of IDs.
+ token_ids_1 (`List[int]`, *optional*):
+ Optional second list of IDs for sequence pairs.
+
+ Returns:
+ `List[int]`: List of zeros.
+ """
+ eos = [self.eos_token_id]
+
+ if token_ids_1 is None:
+ return len(token_ids_0 + eos) * [0]
+ return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.build_inputs_with_special_tokens
+ def build_inputs_with_special_tokens(
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+ ) -> List[int]:
+ """
+ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+ adding special tokens. A sequence has the following format:
+
+ - single sequence: `X `
+ - pair of sequences: `A B `
+
+ Args:
+ token_ids_0 (`List[int]`):
+ List of IDs to which the special tokens will be added.
+ token_ids_1 (`List[int]`, *optional*):
+ Optional second list of IDs for sequence pairs.
+
+ Returns:
+ `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+ """
+ token_ids_0 = self._add_eos_if_not_present(token_ids_0)
+ if token_ids_1 is None:
+ return token_ids_0
+ else:
+ token_ids_1 = self._add_eos_if_not_present(token_ids_1)
+ return token_ids_0 + token_ids_1
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.__getstate__
+ def __getstate__(self):
+ state = self.__dict__.copy()
+ state["sp_model"] = None
+ return state
+
+ def __setstate__(self, d):
+ self.__dict__ = d
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+ self.sp_model.Load(self.vocab_file)
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize
+ def tokenize(self, text: "TextInput", **kwargs) -> List[str]:
+ """
+ Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
+ first token is special.
+ """
+ if self.legacy or len(text) == 0:
+ return super().tokenize(text, **kwargs)
+
+ text = text.replace(SPIECE_UNDERLINE, " ")
+ if self.add_prefix_space:
+ text = SPIECE_UNDERLINE + text
+
+ tokens = super().tokenize(text, **kwargs)
+
+ if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
+ tokens = tokens[1:]
+ return tokens
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer._tokenize
+ def _tokenize(self, text, **kwargs):
+ """
+ Returns a tokenized string.
+
+ We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
+ SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
+ `['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
+ `unk_token`. Here is an example with `unk_token = ""` and `unk_token_length = 4`.
+ `self.tokenizer.sp_model.encode(" Hey", out_type = str)[4:]`.
+ """
+ tokens = self.sp_model.encode(text, out_type=str)
+ if self.legacy or not text.startswith((SPIECE_UNDERLINE, " ")):
+ return tokens
+
+ # 1. Encode string + prefix ex: " Hey"
+ tokens = self.sp_model.encode(self.unk_token + text, out_type=str)
+ # 2. Remove self.unk_token from ['<','unk','>', '▁Hey']
+ return tokens[self.unk_token_length :] if len(tokens) >= self.unk_token_length else tokens
+
+ def _convert_token_to_id(self, token):
+ """Converts a token (str) in an id using the vocab."""
+ return self.sp_model.piece_to_id(token)
+
+ def _convert_id_to_token(self, index):
+ """Converts an index (integer) in a token (str) using the vocab."""
+ return self.sp_model.IdToPiece(index)
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.convert_tokens_to_string
+ def convert_tokens_to_string(self, tokens):
+ """Converts a sequence of tokens (string) in a single string."""
+ # since we manually add the prefix space, we have to remove it when decoding
+ if tokens[0].startswith(SPIECE_UNDERLINE) and self.add_prefix_space:
+ tokens[0] = tokens[0][1:]
+
+ current_sub_tokens = []
+ out_string = ""
+ prev_is_special = False
+ for token in tokens:
+ # make sure that special tokens are not decoded using sentencepiece model
+ if token in self.all_special_tokens:
+ if not prev_is_special:
+ out_string += " "
+ out_string += self.sp_model.decode(current_sub_tokens) + token
+ prev_is_special = True
+ current_sub_tokens = []
+ else:
+ current_sub_tokens.append(token)
+ prev_is_special = False
+ out_string += self.sp_model.decode(current_sub_tokens)
+ return out_string.strip()
+
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.save_vocabulary
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+ if not os.path.isdir(save_directory):
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+ return
+ out_vocab_file = os.path.join(
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+ )
+
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+ copyfile(self.vocab_file, out_vocab_file)
+ elif not os.path.isfile(self.vocab_file):
+ with open(out_vocab_file, "wb") as fi:
+ content_spiece_model = self.sp_model.serialized_model_proto()
+ fi.write(content_spiece_model)
+
+ return (out_vocab_file,)
+
+ @add_end_docstrings(UDOP_ENCODE_KWARGS_DOCSTRING)
+ def __call__(
+ self,
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+ text_pair: Optional[Union[PreTokenizedInput, List[PreTokenizedInput]]] = None,
+ boxes: Union[List[List[int]], List[List[List[int]]]] = None,
+ word_labels: Optional[Union[List[int], List[List[int]]]] = None,
+ text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+ text_pair_target: Optional[
+ Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
+ ] = None,
+ **kwargs,
+ ) -> BatchEncoding:
+ if text is None and text_target is None:
+ raise ValueError("You need to specify either `text` or `text_target`.")
+ if text is not None:
+ # The context manager will send the inputs as normal texts and not text_target, but we shouldn't change the
+ # input mode in this case.
+ if not self._in_target_context_manager:
+ self._switch_to_input_mode()
+ encodings = self.call_boxes(text=text, text_pair=text_pair, boxes=boxes, word_labels=word_labels, **kwargs)
+ if text_target is not None:
+ self._switch_to_target_mode()
+ target_encodings = self._call_one(text=text_target, text_pair=text_pair_target, **kwargs)
+ # Leave back tokenizer in input mode
+ self._switch_to_input_mode()
+
+ if text_target is None:
+ return encodings
+ elif text is None:
+ return target_encodings
+ else:
+ encodings["labels"] = target_encodings["input_ids"]
+ return encodings
+
+ def call_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
+ text_pair: Optional[Union[PreTokenizedInput, List[PreTokenizedInput]]] = None,
+ boxes: Union[List[List[int]], List[List[List[int]]]] = None,
+ word_labels: Optional[Union[List[int], List[List[int]]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ """
+ Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of
+ sequences with word-level normalized bounding boxes and optional labels.
+
+ Args:
+ text (`str`, `List[str]`, `List[List[str]]`):
+ The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings
+ (words of a single example or questions of a batch of examples) or a list of list of strings (batch of
+ words).
+ text_pair (`List[str]`, `List[List[str]]`):
+ The sequence or batch of sequences to be encoded. Each sequence should be a list of strings
+ (pretokenized string).
+ boxes (`List[List[int]]`, `List[List[List[int]]]`):
+ Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale.
+ word_labels (`List[int]`, `List[List[int]]`, *optional*):
+ Word-level integer labels (for token classification tasks such as FUNSD, CORD).
+ """
+
+ # Input type checking for clearer error
+ def _is_valid_text_input(t):
+ if isinstance(t, str):
+ # Strings are fine
+ return True
+ elif isinstance(t, (list, tuple)):
+ # List are fine as long as they are...
+ if len(t) == 0:
+ # ... empty
+ return True
+ elif isinstance(t[0], str):
+ # ... list of strings
+ return True
+ elif isinstance(t[0], (list, tuple)):
+ # ... list with an empty list or with a list of strings
+ return len(t[0]) == 0 or isinstance(t[0][0], str)
+ else:
+ return False
+ else:
+ return False
+
+ if text_pair is not None:
+ # in case text + text_pair are provided, text = questions, text_pair = words
+ if not _is_valid_text_input(text):
+ raise ValueError("text input must of type `str` (single example) or `List[str]` (batch of examples). ")
+ if not isinstance(text_pair, (list, tuple)):
+ raise ValueError(
+ "words must of type `List[str]` (single pretokenized example), "
+ "or `List[List[str]]` (batch of pretokenized examples)."
+ )
+ else:
+ # in case only text is provided => must be words
+ if not isinstance(text, (list, tuple)):
+ raise ValueError(
+ "Words must of type `List[str]` (single pretokenized example), "
+ "or `List[List[str]]` (batch of pretokenized examples)."
+ )
+
+ if text_pair is not None:
+ is_batched = isinstance(text, (list, tuple))
+ else:
+ is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
+
+ words = text if text_pair is None else text_pair
+ if boxes is None:
+ raise ValueError("You must provide corresponding bounding boxes")
+ if is_batched:
+ if len(words) != len(boxes):
+ raise ValueError("You must provide words and boxes for an equal amount of examples")
+ for words_example, boxes_example in zip(words, boxes):
+ if len(words_example) != len(boxes_example):
+ raise ValueError("You must provide as many words as there are bounding boxes")
+ else:
+ if len(words) != len(boxes):
+ raise ValueError("You must provide as many words as there are bounding boxes")
+
+ if is_batched:
+ if text_pair is not None and len(text) != len(text_pair):
+ raise ValueError(
+ f"batch length of `text`: {len(text)} does not match batch length of `text_pair`:"
+ f" {len(text_pair)}."
+ )
+ batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
+ is_pair = bool(text_pair is not None)
+ return self.batch_encode_plus_boxes(
+ batch_text_or_text_pairs=batch_text_or_text_pairs,
+ is_pair=is_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+ else:
+ return self.encode_plus_boxes(
+ text=text,
+ text_pair=text_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ def batch_encode_plus_boxes(
+ self,
+ batch_text_or_text_pairs: Union[
+ List[TextInput],
+ List[TextInputPair],
+ List[PreTokenizedInput],
+ ],
+ is_pair: bool = None,
+ boxes: Optional[List[List[List[int]]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ is_split_into_words: bool = False,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ """
+ Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
+
+ Args:
+ batch_text_or_text_pairs (`List[str]`, `List[Tuple[str, str]]`, `List[List[str]]`, `List[Tuple[List[str], List[str]]]`, and for not-fast tokenizers, also `List[List[int]]`, `List[Tuple[List[int], List[int]]]`):
+ Batch of sequences or pair of sequences to be encoded. This can be a list of
+ string/string-sequences/int-sequences or a list of pair of string/string-sequences/int-sequence (see
+ details in `encode_plus`).
+ """
+
+ # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+ padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ pad_to_multiple_of=pad_to_multiple_of,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ return self._batch_encode_plus_boxes(
+ batch_text_or_text_pairs=batch_text_or_text_pairs,
+ is_pair=is_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding_strategy=padding_strategy,
+ truncation_strategy=truncation_strategy,
+ max_length=max_length,
+ stride=stride,
+ is_split_into_words=is_split_into_words,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ def encode_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput, EncodedInput],
+ text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ **kwargs,
+ ) -> List[int]:
+ """
+ Args:
+ Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing
+ `self.convert_tokens_to_ids(self.tokenize(text))`.
+ text (`str`, `List[str]` or `List[int]`):
+ The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
+ `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ text_pair (`str`, `List[str]` or `List[int]`, *optional*):
+ Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
+ the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ """
+ encoded_inputs = self.encode_plus_boxes(
+ text,
+ text_pair=text_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ return_tensors=return_tensors,
+ **kwargs,
+ )
+
+ return encoded_inputs["input_ids"]
+
+ def encode_plus_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput],
+ text_pair: Optional[PreTokenizedInput] = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ is_split_into_words: bool = False,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ """
+ Tokenize and prepare for the model a sequence or a pair of sequences.
+
+
+
+ This method is deprecated, `__call__` should be used instead.
+
+
+
+ Args:
+ text (`str`, `List[str]` or `List[int]` (the latter only for not-fast tokenizers)):
+ The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
+ `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ text_pair (`str`, `List[str]` or `List[int]`, *optional*):
+ Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
+ the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ """
+
+ # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+ padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ pad_to_multiple_of=pad_to_multiple_of,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ return self._encode_plus_boxes(
+ text=text,
+ text_pair=text_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding_strategy=padding_strategy,
+ truncation_strategy=truncation_strategy,
+ max_length=max_length,
+ stride=stride,
+ is_split_into_words=is_split_into_words,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ def _batch_encode_plus_boxes(
+ self,
+ batch_text_or_text_pairs: Union[
+ List[TextInput],
+ List[TextInputPair],
+ List[PreTokenizedInput],
+ ],
+ is_pair: bool = None,
+ boxes: Optional[List[List[List[int]]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ if return_offsets_mapping:
+ raise NotImplementedError(
+ "return_offset_mapping is not available when using Python tokenizers. "
+ "To use this feature, change your tokenizer to one deriving from "
+ "transformers.PreTrainedTokenizerFast."
+ )
+
+ batch_outputs = self._batch_prepare_for_model_boxes(
+ batch_text_or_text_pairs=batch_text_or_text_pairs,
+ is_pair=is_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding_strategy=padding_strategy,
+ truncation_strategy=truncation_strategy,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_attention_mask=return_attention_mask,
+ return_token_type_ids=return_token_type_ids,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_length=return_length,
+ return_tensors=return_tensors,
+ verbose=verbose,
+ )
+
+ return BatchEncoding(batch_outputs)
+
+ @add_end_docstrings(UDOP_ENCODE_KWARGS_DOCSTRING)
+ def _batch_prepare_for_model_boxes(
+ self,
+ batch_text_or_text_pairs,
+ is_pair: bool = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[str] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ ) -> BatchEncoding:
+ """
+ Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It
+ adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
+ manages a moving window (with user defined stride) for overflowing tokens
+
+ Args:
+ batch_ids_pairs: list of tokenized input ids or input ids pairs
+ """
+
+ batch_outputs = {}
+ for idx, example in enumerate(zip(batch_text_or_text_pairs, boxes)):
+ batch_text_or_text_pair, boxes_example = example
+ outputs = self.prepare_for_model_boxes(
+ batch_text_or_text_pair[0] if is_pair else batch_text_or_text_pair,
+ batch_text_or_text_pair[1] if is_pair else None,
+ boxes_example,
+ word_labels=word_labels[idx] if word_labels is not None else None,
+ add_special_tokens=add_special_tokens,
+ padding=PaddingStrategy.DO_NOT_PAD.value, # we pad in batch afterward
+ truncation=truncation_strategy.value,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=None, # we pad in batch afterward
+ return_attention_mask=False, # we pad in batch afterward
+ return_token_type_ids=return_token_type_ids,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_length=return_length,
+ return_tensors=None, # We convert the whole batch to tensors at the end
+ prepend_batch_axis=False,
+ verbose=verbose,
+ )
+
+ for key, value in outputs.items():
+ if key not in batch_outputs:
+ batch_outputs[key] = []
+ batch_outputs[key].append(value)
+
+ batch_outputs = self.pad(
+ batch_outputs,
+ padding=padding_strategy.value,
+ max_length=max_length,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_attention_mask=return_attention_mask,
+ )
+
+ batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
+
+ return batch_outputs
+
+ def _encode_plus_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput],
+ text_pair: Optional[PreTokenizedInput] = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[int]] = None,
+ add_special_tokens: bool = True,
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ if return_offsets_mapping:
+ raise NotImplementedError(
+ "return_offset_mapping is not available when using Python tokenizers. "
+ "To use this feature, change your tokenizer to one deriving from "
+ "transformers.PreTrainedTokenizerFast. "
+ "More information on available tokenizers at "
+ "https://github.com/huggingface/transformers/pull/2674"
+ )
+
+ return self.prepare_for_model_boxes(
+ text=text,
+ text_pair=text_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding_strategy.value,
+ truncation=truncation_strategy.value,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ prepend_batch_axis=True,
+ return_attention_mask=return_attention_mask,
+ return_token_type_ids=return_token_type_ids,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_length=return_length,
+ verbose=verbose,
+ )
+
+ @add_end_docstrings(UDOP_ENCODE_KWARGS_DOCSTRING)
+ def prepare_for_model_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput],
+ text_pair: Optional[PreTokenizedInput] = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[int]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ prepend_batch_axis: bool = False,
+ **kwargs,
+ ) -> BatchEncoding:
+ """
+ Prepares a sequence or a pair of sequences so that it can be used by the model. It adds special tokens,
+ truncates sequences if overflowing while taking into account the special tokens and manages a moving window
+ (with user defined stride) for overflowing tokens.
+
+ Word-level `boxes` are turned into token-level `bbox`. If provided, word-level `word_labels` are turned into
+ token-level `labels`. The word label is used for the first token of the word, while remaining tokens are
+ labeled with -100, such that they will be ignored by the loss function.
+
+ Args:
+ text (`str`, `List[str]`, `List[List[str]]`):
+ The first sequence to be encoded. This can be a string, a list of strings or a list of list of strings.
+ text_pair (`List[str]` or `List[int]`, *optional*):
+ Optional second sequence to be encoded. This can be a list of strings (words of a single example) or a
+ list of list of strings (words of a batch of examples).
+ """
+
+ # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+ padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ pad_to_multiple_of=pad_to_multiple_of,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ tokens = []
+ pair_tokens = []
+ token_boxes = []
+ pair_token_boxes = []
+ labels = []
+
+ if text_pair is None:
+ if word_labels is None:
+ # CASE 1: document image classification (training + inference) + CASE 2: token classification (inference)
+ for word, box in zip(text, boxes):
+ if len(word) < 1: # skip empty words
+ continue
+ word_tokens = self.tokenize(word)
+ tokens.extend(word_tokens)
+ token_boxes.extend([box] * len(word_tokens))
+ else:
+ # CASE 2: token classification (training)
+ for word, box, label in zip(text, boxes, word_labels):
+ if len(word) < 1: # skip empty words
+ continue
+ word_tokens = self.tokenize(word)
+ tokens.extend(word_tokens)
+ token_boxes.extend([box] * len(word_tokens))
+ if self.only_label_first_subword:
+ # Use the real label id for the first token of the word, and padding ids for the remaining tokens
+ labels.extend([label] + [self.pad_token_label] * (len(word_tokens) - 1))
+ else:
+ labels.extend([label] * len(word_tokens))
+ else:
+ # CASE 3: document visual question answering (inference)
+ # text = question
+ # text_pair = words
+ tokens = self.tokenize(text)
+ token_boxes = [self.pad_token_box for _ in range(len(tokens))]
+
+ for word, box in zip(text_pair, boxes):
+ if len(word) < 1: # skip empty words
+ continue
+ word_tokens = self.tokenize(word)
+ pair_tokens.extend(word_tokens)
+ pair_token_boxes.extend([box] * len(word_tokens))
+
+ # Create ids + pair_ids
+ ids = self.convert_tokens_to_ids(tokens)
+ pair_ids = self.convert_tokens_to_ids(pair_tokens) if pair_tokens else None
+
+ # Compute the total size of the returned encodings
+ pair = bool(pair_ids is not None)
+ len_ids = len(ids)
+ len_pair_ids = len(pair_ids) if pair else 0
+ total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
+
+ # Truncation: Handle max sequence length
+ overflowing_tokens = []
+ overflowing_token_boxes = []
+ overflowing_labels = []
+ if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
+ (
+ ids,
+ token_boxes,
+ pair_ids,
+ pair_token_boxes,
+ labels,
+ overflowing_tokens,
+ overflowing_token_boxes,
+ overflowing_labels,
+ ) = self.truncate_sequences(
+ ids,
+ token_boxes,
+ pair_ids=pair_ids,
+ pair_token_boxes=pair_token_boxes,
+ labels=labels,
+ num_tokens_to_remove=total_len - max_length,
+ truncation_strategy=truncation_strategy,
+ stride=stride,
+ )
+
+ if return_token_type_ids and not add_special_tokens:
+ raise ValueError(
+ "Asking to return token_type_ids while setting add_special_tokens to False "
+ "results in an undefined behavior. Please set add_special_tokens to True or "
+ "set return_token_type_ids to None."
+ )
+
+ # Load from model defaults
+ if return_token_type_ids is None:
+ return_token_type_ids = "token_type_ids" in self.model_input_names
+ if return_attention_mask is None:
+ return_attention_mask = "attention_mask" in self.model_input_names
+
+ encoded_inputs = {}
+
+ if return_overflowing_tokens:
+ encoded_inputs["overflowing_tokens"] = overflowing_tokens
+ encoded_inputs["overflowing_token_boxes"] = overflowing_token_boxes
+ encoded_inputs["overflowing_labels"] = overflowing_labels
+ encoded_inputs["num_truncated_tokens"] = total_len - max_length
+
+ # Add special tokens
+ if add_special_tokens:
+ sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+ token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+ token_boxes = token_boxes + [self.sep_token_box]
+ if pair_token_boxes:
+ pair_token_boxes = pair_token_boxes + [self.sep_token_box]
+ if labels:
+ labels = labels + [self.pad_token_label]
+ else:
+ sequence = ids + pair_ids if pair else ids
+ token_type_ids = [0] * len(ids) + ([0] * len(pair_ids) if pair else [])
+
+ # Build output dictionary
+ encoded_inputs["input_ids"] = sequence
+ encoded_inputs["bbox"] = token_boxes + pair_token_boxes
+ if return_token_type_ids:
+ encoded_inputs["token_type_ids"] = token_type_ids
+ if return_special_tokens_mask:
+ if add_special_tokens:
+ encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+ else:
+ encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
+
+ if labels:
+ encoded_inputs["labels"] = labels
+
+ # Check lengths
+ self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
+
+ # Padding
+ if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
+ encoded_inputs = self.pad(
+ encoded_inputs,
+ max_length=max_length,
+ padding=padding_strategy.value,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_attention_mask=return_attention_mask,
+ )
+
+ if return_length:
+ encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+
+ batch_outputs = BatchEncoding(
+ encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
+ )
+
+ return batch_outputs
+
+ # Copied from transformers.models.layoutxlm.tokenization_layoutxlm.LayoutXLMTokenizer.truncate_sequences
+ def truncate_sequences(
+ self,
+ ids: List[int],
+ token_boxes: List[List[int]],
+ pair_ids: Optional[List[int]] = None,
+ pair_token_boxes: Optional[List[List[int]]] = None,
+ labels: Optional[List[int]] = None,
+ num_tokens_to_remove: int = 0,
+ truncation_strategy: Union[str, TruncationStrategy] = "longest_first",
+ stride: int = 0,
+ ) -> Tuple[List[int], List[int], List[int]]:
+ """
+ Truncates a sequence pair in-place following the strategy.
+
+ Args:
+ ids (`List[int]`):
+ Tokenized input ids of the first sequence. Can be obtained from a string by chaining the `tokenize` and
+ `convert_tokens_to_ids` methods.
+ token_boxes (`List[List[int]]`):
+ Bounding boxes of the first sequence.
+ pair_ids (`List[int]`, *optional*):
+ Tokenized input ids of the second sequence. Can be obtained from a string by chaining the `tokenize`
+ and `convert_tokens_to_ids` methods.
+ pair_token_boxes (`List[List[int]]`, *optional*):
+ Bounding boxes of the second sequence.
+ labels (`List[int]`, *optional*):
+ Labels of the first sequence (for token classification tasks).
+ num_tokens_to_remove (`int`, *optional*, defaults to 0):
+ Number of tokens to remove using the truncation strategy.
+ truncation_strategy (`str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
+ The strategy to follow for truncation. Can be:
+
+ - `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
+ maximum acceptable input length for the model if that argument is not provided. This will truncate
+ token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a
+ batch of pairs) is provided.
+ - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
+ maximum acceptable input length for the model if that argument is not provided. This will only
+ truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+ - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
+ maximum acceptable input length for the model if that argument is not provided. This will only
+ truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+ - `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths greater
+ than the model maximum admissible input size).
+ stride (`int`, *optional*, defaults to 0):
+ If set to a positive number, the overflowing tokens returned will contain some tokens from the main
+ sequence returned. The value of this argument defines the number of additional tokens.
+
+ Returns:
+ `Tuple[List[int], List[int], List[int]]`: The truncated `ids`, the truncated `pair_ids` and the list of
+ overflowing tokens.
+ """
+ if num_tokens_to_remove <= 0:
+ return ids, token_boxes, pair_ids, pair_token_boxes, labels, [], [], []
+
+ if not isinstance(truncation_strategy, TruncationStrategy):
+ truncation_strategy = TruncationStrategy(truncation_strategy)
+
+ overflowing_tokens = []
+ overflowing_token_boxes = []
+ overflowing_labels = []
+ if truncation_strategy == TruncationStrategy.LONGEST_FIRST:
+ for _ in range(num_tokens_to_remove):
+ if pair_ids is None or len(ids) > len(pair_ids):
+ if not overflowing_tokens:
+ window_len = min(len(ids), stride + 1)
+ else:
+ window_len = 1
+ overflowing_tokens.extend(ids[-window_len:])
+ overflowing_token_boxes.extend(token_boxes[-window_len:])
+ overflowing_labels.extend(labels[-window_len:])
+ ids = ids[:-1]
+ token_boxes = token_boxes[:-1]
+ labels = labels[:-1]
+ else:
+ if not overflowing_tokens:
+ window_len = min(len(pair_ids), stride + 1)
+ else:
+ window_len = 1
+ overflowing_tokens.extend(pair_ids[-window_len:])
+ overflowing_token_boxes.extend(pair_token_boxes[-window_len:])
+ pair_ids = pair_ids[:-1]
+ pair_token_boxes = pair_token_boxes[:-1]
+ elif truncation_strategy == TruncationStrategy.ONLY_FIRST:
+ if len(ids) > num_tokens_to_remove:
+ window_len = min(len(ids), stride + num_tokens_to_remove)
+ overflowing_tokens = ids[-window_len:]
+ overflowing_token_boxes = token_boxes[-window_len:]
+ overflowing_labels = labels[-window_len:]
+ ids = ids[:-num_tokens_to_remove]
+ token_boxes = token_boxes[:-num_tokens_to_remove]
+ labels = labels[:-num_tokens_to_remove]
+ else:
+ logger.error(
+ f"We need to remove {num_tokens_to_remove} to truncate the input "
+ f"but the first sequence has a length {len(ids)}. "
+ f"Please select another truncation strategy than {truncation_strategy}, "
+ "for instance 'longest_first' or 'only_second'."
+ )
+ elif truncation_strategy == TruncationStrategy.ONLY_SECOND and pair_ids is not None:
+ if len(pair_ids) > num_tokens_to_remove:
+ window_len = min(len(pair_ids), stride + num_tokens_to_remove)
+ overflowing_tokens = pair_ids[-window_len:]
+ overflowing_token_boxes = pair_token_boxes[-window_len:]
+ pair_ids = pair_ids[:-num_tokens_to_remove]
+ pair_token_boxes = pair_token_boxes[:-num_tokens_to_remove]
+ else:
+ logger.error(
+ f"We need to remove {num_tokens_to_remove} to truncate the input "
+ f"but the second sequence has a length {len(pair_ids)}. "
+ f"Please select another truncation strategy than {truncation_strategy}, "
+ "for instance 'longest_first' or 'only_first'."
+ )
+
+ return (
+ ids,
+ token_boxes,
+ pair_ids,
+ pair_token_boxes,
+ labels,
+ overflowing_tokens,
+ overflowing_token_boxes,
+ overflowing_labels,
+ )
+
+ # Copied from transformers.models.layoutxlm.tokenization_layoutxlm.LayoutXLMTokenizer._pad
+ def _pad(
+ self,
+ encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+ max_length: Optional[int] = None,
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+ pad_to_multiple_of: Optional[int] = None,
+ return_attention_mask: Optional[bool] = None,
+ ) -> dict:
+ """
+ Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+ Args:
+ encoded_inputs:
+ Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+ max_length: maximum length of the returned list and optionally padding length (see below).
+ Will truncate by taking into account the special tokens.
+ padding_strategy: PaddingStrategy to use for padding.
+
+ - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+ - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+ - PaddingStrategy.DO_NOT_PAD: Do not pad
+ The tokenizer padding sides are defined in self.padding_side:
+
+ - 'left': pads on the left of the sequences
+ - 'right': pads on the right of the sequences
+ pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+ This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+ `>= 7.5` (Volta).
+ return_attention_mask:
+ (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+ """
+ # Load from model defaults
+ if return_attention_mask is None:
+ return_attention_mask = "attention_mask" in self.model_input_names
+
+ required_input = encoded_inputs[self.model_input_names[0]]
+
+ if padding_strategy == PaddingStrategy.LONGEST:
+ max_length = len(required_input)
+
+ if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+ max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+ needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+
+ # Initialize attention mask if not present.
+ if return_attention_mask and "attention_mask" not in encoded_inputs:
+ encoded_inputs["attention_mask"] = [1] * len(required_input)
+
+ if needs_to_be_padded:
+ difference = max_length - len(required_input)
+ if self.padding_side == "right":
+ if return_attention_mask:
+ encoded_inputs["attention_mask"] = encoded_inputs["attention_mask"] + [0] * difference
+ if "token_type_ids" in encoded_inputs:
+ encoded_inputs["token_type_ids"] = (
+ encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+ )
+ if "bbox" in encoded_inputs:
+ encoded_inputs["bbox"] = encoded_inputs["bbox"] + [self.pad_token_box] * difference
+ if "labels" in encoded_inputs:
+ encoded_inputs["labels"] = encoded_inputs["labels"] + [self.pad_token_label] * difference
+ if "special_tokens_mask" in encoded_inputs:
+ encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
+ encoded_inputs[self.model_input_names[0]] = required_input + [self.pad_token_id] * difference
+ elif self.padding_side == "left":
+ if return_attention_mask:
+ encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
+ if "token_type_ids" in encoded_inputs:
+ encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
+ "token_type_ids"
+ ]
+ if "bbox" in encoded_inputs:
+ encoded_inputs["bbox"] = [self.pad_token_box] * difference + encoded_inputs["bbox"]
+ if "labels" in encoded_inputs:
+ encoded_inputs["labels"] = [self.pad_token_label] * difference + encoded_inputs["labels"]
+ if "special_tokens_mask" in encoded_inputs:
+ encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+ encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
+ else:
+ raise ValueError("Invalid padding strategy:" + str(self.padding_side))
+
+ return encoded_inputs
diff --git a/src/transformers/models/udop/tokenization_udop_fast.py b/src/transformers/models/udop/tokenization_udop_fast.py
new file mode 100644
index 00000000000000..ee0697595508a7
--- /dev/null
+++ b/src/transformers/models/udop/tokenization_udop_fast.py
@@ -0,0 +1,1012 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+""" Tokenization classes for UDOP model."""
+
+
+import os
+from shutil import copyfile
+from typing import Dict, List, Optional, Tuple, Union
+
+from ...tokenization_utils_base import (
+ BatchEncoding,
+ EncodedInput,
+ PreTokenizedInput,
+ TextInput,
+ TextInputPair,
+ TruncationStrategy,
+)
+from ...tokenization_utils_fast import PreTrainedTokenizerFast
+from ...utils import PaddingStrategy, TensorType, add_end_docstrings, is_sentencepiece_available, logging
+from ..udop.tokenization_udop import (
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES,
+ PRETRAINED_VOCAB_FILES_MAP,
+ VOCAB_FILES_NAMES,
+)
+
+
+if is_sentencepiece_available():
+ from .tokenization_udop import UdopTokenizer
+else:
+ UdopTokenizer = None
+
+
+logger = logging.get_logger(__name__)
+
+UDOP_ENCODE_KWARGS_DOCSTRING = r"""
+ add_special_tokens (`bool`, *optional*, defaults to `True`):
+ Whether or not to encode the sequences with the special tokens relative to their model.
+ padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `False`):
+ Activates and controls padding. Accepts the following values:
+
+ - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+ sequence if provided).
+ - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+ acceptable input length for the model if that argument is not provided.
+ - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+ lengths).
+ truncation (`bool`, `str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
+ Activates and controls truncation. Accepts the following values:
+
+ - `True` or `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or
+ to the maximum acceptable input length for the model if that argument is not provided. This will
+ truncate token by token, removing a token from the longest sequence in the pair if a pair of
+ sequences (or a batch of pairs) is provided.
+ - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
+ maximum acceptable input length for the model if that argument is not provided. This will only
+ truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+ - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
+ maximum acceptable input length for the model if that argument is not provided. This will only
+ truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+ - `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths
+ greater than the model maximum admissible input size).
+ max_length (`int`, *optional*):
+ Controls the maximum length to use by one of the truncation/padding parameters.
+
+ If left unset or set to `None`, this will use the predefined model maximum length if a maximum length
+ is required by one of the truncation/padding parameters. If the model has no specific maximum input
+ length (like XLNet) truncation/padding to a maximum length will be deactivated.
+ stride (`int`, *optional*, defaults to 0):
+ If set to a number along with `max_length`, the overflowing tokens returned when
+ `return_overflowing_tokens=True` will contain some tokens from the end of the truncated sequence
+ returned to provide some overlap between truncated and overflowing sequences. The value of this
+ argument defines the number of overlapping tokens.
+ pad_to_multiple_of (`int`, *optional*):
+ If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
+ the use of Tensor Cores on NVIDIA hardware with compute capability `>= 7.5` (Volta).
+ return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
+ If set, will return tensors instead of list of python integers. Acceptable values are:
+
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
+ - `'np'`: Return Numpy `np.ndarray` objects.
+ return_token_type_ids (`bool`, *optional*):
+ Whether to return token type IDs. If left to the default, will return the token type IDs according to
+ the specific tokenizer's default, defined by the `return_outputs` attribute.
+
+ [What are token type IDs?](../glossary#token-type-ids)
+ return_attention_mask (`bool`, *optional*):
+ Whether to return the attention mask. If left to the default, will return the attention mask according
+ to the specific tokenizer's default, defined by the `return_outputs` attribute.
+
+ [What are attention masks?](../glossary#attention-mask)
+ return_overflowing_tokens (`bool`, *optional*, defaults to `False`):
+ Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
+ of pairs) is provided with `truncation_strategy = longest_first` or `True`, an error is raised instead
+ of returning overflowing tokens.
+ return_special_tokens_mask (`bool`, *optional*, defaults to `False`):
+ Whether or not to return special tokens mask information.
+ return_offsets_mapping (`bool`, *optional*, defaults to `False`):
+ Whether or not to return `(char_start, char_end)` for each token.
+
+ This is only available on fast tokenizers inheriting from [`PreTrainedTokenizerFast`], if using
+ Python's tokenizer, this method will raise `NotImplementedError`.
+ return_length (`bool`, *optional*, defaults to `False`):
+ Whether or not to return the lengths of the encoded inputs.
+ verbose (`bool`, *optional*, defaults to `True`):
+ Whether or not to print more information and warnings.
+ **kwargs: passed to the `self.tokenize()` method
+
+ Return:
+ [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+
+ - **input_ids** -- List of token ids to be fed to a model.
+
+ [What are input IDs?](../glossary#input-ids)
+
+ - **bbox** -- List of bounding boxes to be fed to a model.
+
+ - **token_type_ids** -- List of token type ids to be fed to a model (when `return_token_type_ids=True` or
+ if *"token_type_ids"* is in `self.model_input_names`).
+
+ [What are token type IDs?](../glossary#token-type-ids)
+
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names`).
+
+ [What are attention masks?](../glossary#attention-mask)
+
+ - **labels** -- List of labels to be fed to a model. (when `word_labels` is specified).
+ - **overflowing_tokens** -- List of overflowing tokens sequences (when a `max_length` is specified and
+ `return_overflowing_tokens=True`).
+ - **num_truncated_tokens** -- Number of tokens truncated (when a `max_length` is specified and
+ `return_overflowing_tokens=True`).
+ - **special_tokens_mask** -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
+ regular sequence tokens (when `add_special_tokens=True` and `return_special_tokens_mask=True`).
+ - **length** -- The length of the inputs (when `return_length=True`).
+"""
+
+
+class UdopTokenizerFast(PreTrainedTokenizerFast):
+ """
+ Construct a "fast" UDOP tokenizer (backed by HuggingFace's *tokenizers* library). Adapted from
+ [`LayoutXLMTokenizer`] and [`T5Tokenizer`]. Based on
+ [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
+
+ This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
+ refer to this superclass for more information regarding those methods.
+
+ Args:
+ vocab_file (`str`, *optional*):
+ Path to the vocabulary file.
+
+ tokenizer_file (`str`, *optional*):
+ Path to the tokenizer file.
+ eos_token (`str`, *optional*, defaults to `""`):
+ The end of sequence token.
+
+
+
+ When building a sequence using special tokens, this is not the token that is used for the end of sequence.
+ The token used is the `sep_token`.
+
+
+
+ sep_token (`str`, *optional*, defaults to `""`):
+ The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+ sequence classification or for a text and a question for question answering. It is also used as the last
+ token of a sequence built with special tokens.
+ unk_token (`str`, *optional*, defaults to `""`):
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+ token instead.
+ pad_token (`str`, *optional*, defaults to `""`):
+ The token used for padding, for example when batching sequences of different lengths.
+ sep_token_box (`List[int]`, *optional*, defaults to `[1000, 1000, 1000, 1000]`):
+ The bounding box to use for the special [SEP] token.
+ pad_token_box (`List[int]`, *optional*, defaults to `[0, 0, 0, 0]`):
+ The bounding box to use for the special [PAD] token.
+ pad_token_label (`int`, *optional*, defaults to -100):
+ The label to use for padding tokens. Defaults to -100, which is the `ignore_index` of PyTorch's
+ CrossEntropyLoss.
+ only_label_first_subword (`bool`, *optional*, defaults to `True`):
+ Whether or not to only label the first subword, in case word labels are provided.
+ additional_special_tokens (`List[str]`, *optional*, defaults to `["NOTUSED", " NOTUSED"]`):
+ Additional special tokens used by the tokenizer.
+ """
+
+ vocab_files_names = VOCAB_FILES_NAMES
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+ model_input_names = ["input_ids", "attention_mask"]
+ slow_tokenizer_class = UdopTokenizer
+
+ def __init__(
+ self,
+ vocab_file=None,
+ tokenizer_file=None,
+ eos_token="",
+ sep_token="",
+ unk_token="",
+ pad_token="",
+ sep_token_box=[1000, 1000, 1000, 1000],
+ pad_token_box=[0, 0, 0, 0],
+ pad_token_label=-100,
+ only_label_first_subword=True,
+ additional_special_tokens=None,
+ **kwargs,
+ ):
+ super().__init__(
+ vocab_file,
+ tokenizer_file=tokenizer_file,
+ eos_token=eos_token,
+ sep_token=sep_token,
+ unk_token=unk_token,
+ pad_token=pad_token,
+ sep_token_box=sep_token_box,
+ pad_token_box=pad_token_box,
+ pad_token_label=pad_token_label,
+ only_label_first_subword=only_label_first_subword,
+ additional_special_tokens=additional_special_tokens,
+ **kwargs,
+ )
+
+ self.vocab_file = vocab_file
+
+ # additional properties
+ self.sep_token_box = sep_token_box
+ self.pad_token_box = pad_token_box
+ self.pad_token_label = pad_token_label
+ self.only_label_first_subword = only_label_first_subword
+
+ @property
+ def can_save_slow_tokenizer(self) -> bool:
+ return os.path.isfile(self.vocab_file) if self.vocab_file else False
+
+ @add_end_docstrings(UDOP_ENCODE_KWARGS_DOCSTRING)
+ def __call__(
+ self,
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+ text_pair: Optional[Union[PreTokenizedInput, List[PreTokenizedInput]]] = None,
+ boxes: Union[List[List[int]], List[List[List[int]]]] = None,
+ word_labels: Optional[Union[List[int], List[List[int]]]] = None,
+ text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+ text_pair_target: Optional[
+ Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
+ ] = None,
+ **kwargs,
+ ) -> BatchEncoding:
+ if text is None and text_target is None:
+ raise ValueError("You need to specify either `text` or `text_target`.")
+ if text is not None:
+ # The context manager will send the inputs as normal texts and not text_target, but we shouldn't change the
+ # input mode in this case.
+ if not self._in_target_context_manager:
+ self._switch_to_input_mode()
+ encodings = self.call_boxes(text=text, text_pair=text_pair, boxes=boxes, word_labels=word_labels, **kwargs)
+ if text_target is not None:
+ self._switch_to_target_mode()
+ target_encodings = self._call_one(text=text_target, text_pair=text_pair_target, **kwargs)
+ # Leave back tokenizer in input mode
+ self._switch_to_input_mode()
+
+ if text_target is None:
+ return encodings
+ elif text is None:
+ return target_encodings
+ else:
+ encodings["labels"] = target_encodings["input_ids"]
+ return encodings
+
+ @add_end_docstrings(UDOP_ENCODE_KWARGS_DOCSTRING)
+ def call_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
+ text_pair: Optional[Union[PreTokenizedInput, List[PreTokenizedInput]]] = None,
+ boxes: Union[List[List[int]], List[List[List[int]]]] = None,
+ word_labels: Optional[Union[List[int], List[List[int]]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ """
+ Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of
+ sequences with word-level normalized bounding boxes and optional labels.
+
+ Args:
+ text (`str`, `List[str]`, `List[List[str]]`):
+ The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings
+ (words of a single example or questions of a batch of examples) or a list of list of strings (batch of
+ words).
+ text_pair (`List[str]`, `List[List[str]]`):
+ The sequence or batch of sequences to be encoded. Each sequence should be a list of strings
+ (pretokenized string).
+ boxes (`List[List[int]]`, `List[List[List[int]]]`):
+ Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale.
+ word_labels (`List[int]`, `List[List[int]]`, *optional*):
+ Word-level integer labels (for token classification tasks such as FUNSD, CORD).
+ """
+
+ # Input type checking for clearer error
+ def _is_valid_text_input(t):
+ if isinstance(t, str):
+ # Strings are fine
+ return True
+ elif isinstance(t, (list, tuple)):
+ # List are fine as long as they are...
+ if len(t) == 0:
+ # ... empty
+ return True
+ elif isinstance(t[0], str):
+ # ... list of strings
+ return True
+ elif isinstance(t[0], (list, tuple)):
+ # ... list with an empty list or with a list of strings
+ return len(t[0]) == 0 or isinstance(t[0][0], str)
+ else:
+ return False
+ else:
+ return False
+
+ if text_pair is not None:
+ # in case text + text_pair are provided, text = questions, text_pair = words
+ if not _is_valid_text_input(text):
+ raise ValueError("text input must of type `str` (single example) or `List[str]` (batch of examples). ")
+ if not isinstance(text_pair, (list, tuple)):
+ raise ValueError(
+ "words must of type `List[str]` (single pretokenized example), "
+ "or `List[List[str]]` (batch of pretokenized examples)."
+ )
+ else:
+ # in case only text is provided => must be words
+ if not isinstance(text, (list, tuple)):
+ raise ValueError(
+ "Words must of type `List[str]` (single pretokenized example), "
+ "or `List[List[str]]` (batch of pretokenized examples)."
+ )
+
+ if text_pair is not None:
+ is_batched = isinstance(text, (list, tuple))
+ else:
+ is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
+
+ words = text if text_pair is None else text_pair
+ if boxes is None:
+ raise ValueError("You must provide corresponding bounding boxes")
+ if is_batched:
+ if len(words) != len(boxes):
+ raise ValueError("You must provide words and boxes for an equal amount of examples")
+ for words_example, boxes_example in zip(words, boxes):
+ if len(words_example) != len(boxes_example):
+ raise ValueError("You must provide as many words as there are bounding boxes")
+ else:
+ if len(words) != len(boxes):
+ raise ValueError("You must provide as many words as there are bounding boxes")
+
+ if is_batched:
+ if text_pair is not None and len(text) != len(text_pair):
+ raise ValueError(
+ f"batch length of `text`: {len(text)} does not match batch length of `text_pair`:"
+ f" {len(text_pair)}."
+ )
+ batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
+ is_pair = bool(text_pair is not None)
+ return self.batch_encode_plus_boxes(
+ batch_text_or_text_pairs=batch_text_or_text_pairs,
+ is_pair=is_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+ else:
+ return self.encode_plus_boxes(
+ text=text,
+ text_pair=text_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ # Copied from transformers.models.layoutxlm.tokenization_layoutxlm_fast.LayoutXLMTokenizerFast.tokenize
+ def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs) -> List[str]:
+ batched_input = [(text, pair)] if pair else [text]
+ encodings = self._tokenizer.encode_batch(
+ batched_input, add_special_tokens=add_special_tokens, is_pretokenized=False, **kwargs
+ )
+
+ return encodings[0].tokens
+
+ def batch_encode_plus_boxes(
+ self,
+ batch_text_or_text_pairs: Union[
+ List[TextInput],
+ List[TextInputPair],
+ List[PreTokenizedInput],
+ ],
+ is_pair: bool = None,
+ boxes: Optional[List[List[List[int]]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ is_split_into_words: bool = False,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ """
+ Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
+
+
+
+ This method is deprecated, `__call__` should be used instead.
+
+
+
+ Args:
+ batch_text_or_text_pairs (`List[str]`, `List[Tuple[str, str]]`, `List[List[str]]`, `List[Tuple[List[str], List[str]]]`, and for not-fast tokenizers, also `List[List[int]]`, `List[Tuple[List[int], List[int]]]`):
+ Batch of sequences or pair of sequences to be encoded. This can be a list of
+ string/string-sequences/int-sequences or a list of pair of string/string-sequences/int-sequence (see
+ details in `encode_plus`).
+ """
+
+ # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+ padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ pad_to_multiple_of=pad_to_multiple_of,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ return self._batch_encode_plus_boxes(
+ batch_text_or_text_pairs=batch_text_or_text_pairs,
+ is_pair=is_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding_strategy=padding_strategy,
+ truncation_strategy=truncation_strategy,
+ max_length=max_length,
+ stride=stride,
+ is_split_into_words=is_split_into_words,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ def _batch_encode_plus_boxes(
+ self,
+ batch_text_or_text_pairs: Union[
+ List[TextInput],
+ List[TextInputPair],
+ List[PreTokenizedInput],
+ ],
+ is_pair: bool = None,
+ boxes: Optional[List[List[List[int]]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[str] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ if not isinstance(batch_text_or_text_pairs, list):
+ raise TypeError(f"batch_text_or_text_pairs has to be a list (got {type(batch_text_or_text_pairs)})")
+
+ # Set the truncation and padding strategy and restore the initial configuration
+ self.set_truncation_and_padding(
+ padding_strategy=padding_strategy,
+ truncation_strategy=truncation_strategy,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ )
+
+ if is_pair:
+ batch_text_or_text_pairs = [(text.split(), text_pair) for text, text_pair in batch_text_or_text_pairs]
+
+ encodings = self._tokenizer.encode_batch(
+ batch_text_or_text_pairs,
+ add_special_tokens=add_special_tokens,
+ is_pretokenized=True, # we set this to True as LayoutLMv2 always expects pretokenized inputs
+ )
+
+ # Convert encoding to dict
+ # `Tokens` has type: Tuple[
+ # List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]],
+ # List[EncodingFast]
+ # ]
+ # with nested dimensions corresponding to batch, overflows, sequence length
+ tokens_and_encodings = [
+ self._convert_encoding(
+ encoding=encoding,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=True
+ if word_labels is not None
+ else return_offsets_mapping, # we use offsets to create the labels
+ return_length=return_length,
+ verbose=verbose,
+ )
+ for encoding in encodings
+ ]
+
+ # Convert the output to have dict[list] from list[dict] and remove the additional overflows dimension
+ # From (variable) shape (batch, overflows, sequence length) to ~ (batch * overflows, sequence length)
+ # (we say ~ because the number of overflow varies with the example in the batch)
+ #
+ # To match each overflowing sample with the original sample in the batch
+ # we add an overflow_to_sample_mapping array (see below)
+ sanitized_tokens = {}
+ for key in tokens_and_encodings[0][0].keys():
+ stack = [e for item, _ in tokens_and_encodings for e in item[key]]
+ sanitized_tokens[key] = stack
+ sanitized_encodings = [e for _, item in tokens_and_encodings for e in item]
+
+ # If returning overflowing tokens, we need to return a mapping
+ # from the batch idx to the original sample
+ if return_overflowing_tokens:
+ overflow_to_sample_mapping = []
+ for i, (toks, _) in enumerate(tokens_and_encodings):
+ overflow_to_sample_mapping += [i] * len(toks["input_ids"])
+ sanitized_tokens["overflow_to_sample_mapping"] = overflow_to_sample_mapping
+
+ for input_ids in sanitized_tokens["input_ids"]:
+ self._eventual_warn_about_too_long_sequence(input_ids, max_length, verbose)
+
+ # create the token boxes
+ token_boxes = []
+ for batch_index in range(len(sanitized_tokens["input_ids"])):
+ if return_overflowing_tokens:
+ original_index = sanitized_tokens["overflow_to_sample_mapping"][batch_index]
+ else:
+ original_index = batch_index
+ token_boxes_example = []
+ for id, sequence_id, word_id in zip(
+ sanitized_tokens["input_ids"][batch_index],
+ sanitized_encodings[batch_index].sequence_ids,
+ sanitized_encodings[batch_index].word_ids,
+ ):
+ if word_id is not None:
+ if is_pair and sequence_id == 0:
+ token_boxes_example.append(self.pad_token_box)
+ else:
+ token_boxes_example.append(boxes[original_index][word_id])
+ else:
+ if id == self.sep_token_id:
+ token_boxes_example.append(self.sep_token_box)
+ elif id == self.pad_token_id:
+ token_boxes_example.append(self.pad_token_box)
+ else:
+ raise ValueError("Id not recognized")
+ token_boxes.append(token_boxes_example)
+
+ sanitized_tokens["bbox"] = token_boxes
+
+ # optionally, create the labels
+ if word_labels is not None:
+ labels = []
+ for batch_index in range(len(sanitized_tokens["input_ids"])):
+ if return_overflowing_tokens:
+ original_index = sanitized_tokens["overflow_to_sample_mapping"][batch_index]
+ else:
+ original_index = batch_index
+ labels_example = []
+ previous_token_empty = False
+ for id, offset, word_id in zip(
+ sanitized_tokens["input_ids"][batch_index],
+ sanitized_tokens["offset_mapping"][batch_index],
+ sanitized_encodings[batch_index].word_ids,
+ ):
+ if word_id is not None:
+ if self.only_label_first_subword:
+ if offset[0] == 0 and not previous_token_empty:
+ # Use the real label id for the first token of the word, and padding ids for the remaining tokens
+ labels_example.append(word_labels[original_index][word_id])
+ else:
+ labels_example.append(self.pad_token_label)
+ else:
+ labels_example.append(word_labels[original_index][word_id])
+ if self.decode(id) == "":
+ previous_token_empty = True
+ else:
+ previous_token_empty = False
+ else:
+ labels_example.append(self.pad_token_label)
+ labels.append(labels_example)
+
+ sanitized_tokens["labels"] = labels
+ # finally, remove offsets if the user didn't want them
+ if not return_offsets_mapping:
+ del sanitized_tokens["offset_mapping"]
+
+ return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)
+
+ def _encode_plus_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput],
+ text_pair: Optional[PreTokenizedInput] = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[int]] = None,
+ add_special_tokens: bool = True,
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[bool] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ # make it a batched input
+ # 2 options:
+ # 1) only text, in case text must be a list of str
+ # 2) text + text_pair, in which case text = str and text_pair a list of str
+ batched_input = [(text, text_pair)] if text_pair else [text]
+ batched_boxes = [boxes]
+ batched_word_labels = [word_labels] if word_labels is not None else None
+ batched_output = self._batch_encode_plus_boxes(
+ batched_input,
+ is_pair=bool(text_pair is not None),
+ boxes=batched_boxes,
+ word_labels=batched_word_labels,
+ add_special_tokens=add_special_tokens,
+ padding_strategy=padding_strategy,
+ truncation_strategy=truncation_strategy,
+ max_length=max_length,
+ stride=stride,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ # Return tensor is None, then we can remove the leading batch axis
+ # Overflowing tokens are returned as a batch of output so we keep them in this case
+ if return_tensors is None and not return_overflowing_tokens:
+ batched_output = BatchEncoding(
+ {
+ key: value[0] if len(value) > 0 and isinstance(value[0], list) else value
+ for key, value in batched_output.items()
+ },
+ batched_output.encodings,
+ )
+
+ self._eventual_warn_about_too_long_sequence(batched_output["input_ids"], max_length, verbose)
+
+ return batched_output
+
+ def encode_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput, EncodedInput],
+ text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ **kwargs,
+ ) -> List[int]:
+ """
+ Args:
+ Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing
+ `self.convert_tokens_to_ids(self.tokenize(text))`.
+ text (`str`, `List[str]` or `List[int]`):
+ The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
+ `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ text_pair (`str`, `List[str]` or `List[int]`, *optional*):
+ Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
+ the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ """
+ encoded_inputs = self.encode_plus_boxes(
+ text,
+ text_pair=text_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ stride=stride,
+ return_tensors=return_tensors,
+ **kwargs,
+ )
+
+ return encoded_inputs["input_ids"]
+
+ def encode_plus_boxes(
+ self,
+ text: Union[TextInput, PreTokenizedInput],
+ text_pair: Optional[PreTokenizedInput] = None,
+ boxes: Optional[List[List[int]]] = None,
+ word_labels: Optional[List[List[int]]] = None,
+ add_special_tokens: bool = True,
+ padding: Union[bool, str, PaddingStrategy] = False,
+ truncation: Union[bool, str, TruncationStrategy] = None,
+ max_length: Optional[int] = None,
+ stride: int = 0,
+ is_split_into_words: bool = False,
+ pad_to_multiple_of: Optional[int] = None,
+ return_tensors: Optional[Union[str, TensorType]] = None,
+ return_token_type_ids: Optional[bool] = None,
+ return_attention_mask: Optional[bool] = None,
+ return_overflowing_tokens: bool = False,
+ return_special_tokens_mask: bool = False,
+ return_offsets_mapping: bool = False,
+ return_length: bool = False,
+ verbose: bool = True,
+ **kwargs,
+ ) -> BatchEncoding:
+ """
+ Tokenize and prepare for the model a sequence or a pair of sequences.
+
+
+
+ This method is deprecated, `__call__` should be used instead.
+
+
+
+ Args:
+ text (`str`, `List[str]` or `List[int]` (the latter only for not-fast tokenizers)):
+ The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
+ `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ text_pair (`str`, `List[str]` or `List[int]`, *optional*):
+ Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
+ the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+ method).
+ """
+
+ # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+ padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+ padding=padding,
+ truncation=truncation,
+ max_length=max_length,
+ pad_to_multiple_of=pad_to_multiple_of,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ return self._encode_plus_boxes(
+ text=text,
+ text_pair=text_pair,
+ boxes=boxes,
+ word_labels=word_labels,
+ add_special_tokens=add_special_tokens,
+ padding_strategy=padding_strategy,
+ truncation_strategy=truncation_strategy,
+ max_length=max_length,
+ stride=stride,
+ is_split_into_words=is_split_into_words,
+ pad_to_multiple_of=pad_to_multiple_of,
+ return_tensors=return_tensors,
+ return_token_type_ids=return_token_type_ids,
+ return_attention_mask=return_attention_mask,
+ return_overflowing_tokens=return_overflowing_tokens,
+ return_special_tokens_mask=return_special_tokens_mask,
+ return_offsets_mapping=return_offsets_mapping,
+ return_length=return_length,
+ verbose=verbose,
+ **kwargs,
+ )
+
+ # Copied from transformers.models.layoutxlm.tokenization_layoutxlm_fast.LayoutXLMTokenizerFast._pad
+ def _pad(
+ self,
+ encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+ max_length: Optional[int] = None,
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+ pad_to_multiple_of: Optional[int] = None,
+ return_attention_mask: Optional[bool] = None,
+ ) -> dict:
+ """
+ Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+ Args:
+ encoded_inputs:
+ Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+ max_length: maximum length of the returned list and optionally padding length (see below).
+ Will truncate by taking into account the special tokens.
+ padding_strategy: PaddingStrategy to use for padding.
+
+ - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+ - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+ - PaddingStrategy.DO_NOT_PAD: Do not pad
+ The tokenizer padding sides are defined in self.padding_side:
+
+ - 'left': pads on the left of the sequences
+ - 'right': pads on the right of the sequences
+ pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+ This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+ `>= 7.5` (Volta).
+ return_attention_mask:
+ (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+ """
+ # Load from model defaults
+ if return_attention_mask is None:
+ return_attention_mask = "attention_mask" in self.model_input_names
+
+ required_input = encoded_inputs[self.model_input_names[0]]
+
+ if padding_strategy == PaddingStrategy.LONGEST:
+ max_length = len(required_input)
+
+ if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+ max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+ needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+
+ # Initialize attention mask if not present.
+ if return_attention_mask and "attention_mask" not in encoded_inputs:
+ encoded_inputs["attention_mask"] = [1] * len(required_input)
+
+ if needs_to_be_padded:
+ difference = max_length - len(required_input)
+ if self.padding_side == "right":
+ if return_attention_mask:
+ encoded_inputs["attention_mask"] = encoded_inputs["attention_mask"] + [0] * difference
+ if "token_type_ids" in encoded_inputs:
+ encoded_inputs["token_type_ids"] = (
+ encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+ )
+ if "bbox" in encoded_inputs:
+ encoded_inputs["bbox"] = encoded_inputs["bbox"] + [self.pad_token_box] * difference
+ if "labels" in encoded_inputs:
+ encoded_inputs["labels"] = encoded_inputs["labels"] + [self.pad_token_label] * difference
+ if "special_tokens_mask" in encoded_inputs:
+ encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
+ encoded_inputs[self.model_input_names[0]] = required_input + [self.pad_token_id] * difference
+ elif self.padding_side == "left":
+ if return_attention_mask:
+ encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
+ if "token_type_ids" in encoded_inputs:
+ encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
+ "token_type_ids"
+ ]
+ if "bbox" in encoded_inputs:
+ encoded_inputs["bbox"] = [self.pad_token_box] * difference + encoded_inputs["bbox"]
+ if "labels" in encoded_inputs:
+ encoded_inputs["labels"] = [self.pad_token_label] * difference + encoded_inputs["labels"]
+ if "special_tokens_mask" in encoded_inputs:
+ encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+ encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
+ else:
+ raise ValueError("Invalid padding strategy:" + str(self.padding_side))
+
+ return encoded_inputs
+
+ def build_inputs_with_special_tokens(
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+ ) -> List[int]:
+ """
+ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+ adding special tokens. An XLM-RoBERTa sequence has the following format:
+
+ - single sequence: ` X `
+ - pair of sequences: ` A B `
+
+ Args:
+ token_ids_0 (`List[int]`):
+ List of IDs to which the special tokens will be added.
+ token_ids_1 (`List[int]`, *optional*):
+ Optional second list of IDs for sequence pairs.
+
+ Returns:
+ `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+ """
+
+ if token_ids_1 is None:
+ return token_ids_0 + [self.sep_token_id]
+ sep = [self.sep_token_id]
+ return token_ids_0 + sep + token_ids_1 + sep
+
+ def create_token_type_ids_from_sequences(
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+ ) -> List[int]:
+ """
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. XLM-RoBERTa does
+ not make use of token type ids, therefore a list of zeros is returned.
+
+ Args:
+ token_ids_0 (`List[int]`):
+ List of IDs.
+ token_ids_1 (`List[int]`, *optional*):
+ Optional second list of IDs for sequence pairs.
+
+ Returns:
+ `List[int]`: List of zeros.
+
+ """
+
+ sep = [self.sep_token_id]
+
+ if token_ids_1 is None:
+ return len(token_ids_0 + sep) * [0]
+ return len(token_ids_0 + sep + token_ids_1 + sep) * [0]
+
+ # Copied from transformers.models.layoutxlm.tokenization_layoutxlm_fast.LayoutXLMTokenizerFast.save_vocabulary
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+ if not self.can_save_slow_tokenizer:
+ raise ValueError(
+ "Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
+ "tokenizer."
+ )
+
+ if not os.path.isdir(save_directory):
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory.")
+ return
+ out_vocab_file = os.path.join(
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+ )
+
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
+ copyfile(self.vocab_file, out_vocab_file)
+
+ return (out_vocab_file,)
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index 5c635cf7af2c1c..8f7deb28327abc 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -8341,6 +8341,37 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
+UDOP_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class UdopEncoderModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class UdopForConditionalGeneration(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class UdopModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class UdopPreTrainedModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
class UMT5EncoderModel(metaclass=DummyObject):
_backends = ["torch"]
diff --git a/src/transformers/utils/dummy_sentencepiece_objects.py b/src/transformers/utils/dummy_sentencepiece_objects.py
index 5103626b263d35..33ee907a741f18 100644
--- a/src/transformers/utils/dummy_sentencepiece_objects.py
+++ b/src/transformers/utils/dummy_sentencepiece_objects.py
@@ -219,6 +219,13 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["sentencepiece"])
+class UdopTokenizer(metaclass=DummyObject):
+ _backends = ["sentencepiece"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["sentencepiece"])
+
+
class XGLMTokenizer(metaclass=DummyObject):
_backends = ["sentencepiece"]
diff --git a/src/transformers/utils/dummy_tokenizers_objects.py b/src/transformers/utils/dummy_tokenizers_objects.py
index 5d792a0bbacde6..42b4397622f31d 100644
--- a/src/transformers/utils/dummy_tokenizers_objects.py
+++ b/src/transformers/utils/dummy_tokenizers_objects.py
@@ -408,6 +408,13 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["tokenizers"])
+class UdopTokenizerFast(metaclass=DummyObject):
+ _backends = ["tokenizers"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["tokenizers"])
+
+
class WhisperTokenizerFast(metaclass=DummyObject):
_backends = ["tokenizers"]
diff --git a/tests/models/udop/__init__.py b/tests/models/udop/__init__.py
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/tests/models/udop/test_modeling_udop.py b/tests/models/udop/test_modeling_udop.py
new file mode 100644
index 00000000000000..3947da62cc6fe6
--- /dev/null
+++ b/tests/models/udop/test_modeling_udop.py
@@ -0,0 +1,567 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import inspect
+import unittest
+
+from huggingface_hub import hf_hub_download
+
+from transformers import UdopConfig, is_torch_available, is_vision_available
+from transformers.testing_utils import (
+ require_sentencepiece,
+ require_tokenizers,
+ require_torch,
+ require_vision,
+ slow,
+ torch_device,
+)
+from transformers.utils import cached_property
+
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, ids_tensor
+from ...test_pipeline_mixin import PipelineTesterMixin
+
+
+if is_torch_available():
+ import torch
+
+ from transformers import UdopEncoderModel, UdopForConditionalGeneration, UdopModel, UdopProcessor
+ from transformers.models.udop.modeling_udop import UDOP_PRETRAINED_MODEL_ARCHIVE_LIST
+
+
+if is_vision_available():
+ from PIL import Image
+
+
+class UdopModelTester:
+ def __init__(
+ self,
+ parent,
+ vocab_size=99,
+ batch_size=13,
+ encoder_seq_length=7,
+ decoder_seq_length=9,
+ # For common tests
+ is_training=True,
+ use_attention_mask=True,
+ use_labels=True,
+ hidden_size=32,
+ num_hidden_layers=5,
+ num_attention_heads=4,
+ d_ff=37,
+ relative_attention_num_buckets=32,
+ dropout_rate=0.1,
+ initializer_factor=0.002,
+ eos_token_id=1,
+ pad_token_id=0,
+ scope=None,
+ decoder_layers=None,
+ range_bbox=1000,
+ decoder_start_token_id=0,
+ ):
+ self.parent = parent
+ self.batch_size = batch_size
+ self.encoder_seq_length = encoder_seq_length
+ self.decoder_seq_length = decoder_seq_length
+ # For common tests
+ self.seq_length = self.decoder_seq_length
+ self.is_training = is_training
+ self.use_attention_mask = use_attention_mask
+ self.use_labels = use_labels
+ self.vocab_size = vocab_size
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.d_ff = d_ff
+ self.relative_attention_num_buckets = relative_attention_num_buckets
+ self.dropout_rate = dropout_rate
+ self.initializer_factor = initializer_factor
+ self.eos_token_id = eos_token_id
+ self.pad_token_id = pad_token_id
+ self.scope = None
+ self.decoder_layers = decoder_layers
+ self.range_bbox = range_bbox
+ self.decoder_start_token_id = decoder_start_token_id
+
+ def prepare_config_and_inputs(self):
+ input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+ bbox = ids_tensor([self.batch_size, self.encoder_seq_length, 4], self.range_bbox).float()
+ # Ensure that bbox is legal
+ for i in range(bbox.shape[0]):
+ for j in range(bbox.shape[1]):
+ if bbox[i, j, 3] < bbox[i, j, 1]:
+ t = bbox[i, j, 3]
+ bbox[i, j, 3] = bbox[i, j, 1]
+ bbox[i, j, 1] = t
+ if bbox[i, j, 2] < bbox[i, j, 0]:
+ t = bbox[i, j, 2]
+ bbox[i, j, 2] = bbox[i, j, 0]
+ bbox[i, j, 0] = t
+ decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+ attention_mask = None
+ decoder_attention_mask = None
+ if self.use_attention_mask:
+ attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
+ decoder_attention_mask = ids_tensor([self.batch_size, self.decoder_seq_length], vocab_size=2)
+
+ lm_labels = None
+ if self.use_labels:
+ lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+ config = self.get_config()
+
+ return (
+ config,
+ input_ids,
+ bbox,
+ decoder_input_ids,
+ attention_mask,
+ decoder_attention_mask,
+ lm_labels,
+ )
+
+ def get_config(self):
+ return UdopConfig(
+ vocab_size=self.vocab_size,
+ d_model=self.hidden_size,
+ d_ff=self.d_ff,
+ d_kv=self.hidden_size // self.num_attention_heads,
+ num_layers=self.num_hidden_layers,
+ num_decoder_layers=self.decoder_layers,
+ num_heads=self.num_attention_heads,
+ relative_attention_num_buckets=self.relative_attention_num_buckets,
+ dropout_rate=self.dropout_rate,
+ initializer_factor=self.initializer_factor,
+ eos_token_id=self.eos_token_id,
+ bos_token_id=self.pad_token_id,
+ pad_token_id=self.pad_token_id,
+ decoder_start_token_id=self.decoder_start_token_id,
+ )
+
+ def create_and_check_model(
+ self,
+ config,
+ input_ids,
+ bbox,
+ decoder_input_ids,
+ attention_mask,
+ decoder_attention_mask,
+ lm_labels,
+ ):
+ model = UdopModel(config=config)
+ model.to(torch_device)
+ model.eval()
+ result = model(
+ input_ids=input_ids,
+ bbox=bbox,
+ decoder_input_ids=decoder_input_ids,
+ attention_mask=attention_mask,
+ decoder_attention_mask=decoder_attention_mask,
+ )
+ result = model(input_ids=input_ids, bbox=bbox, decoder_input_ids=decoder_input_ids)
+ decoder_output = result.last_hidden_state
+ decoder_past = result.past_key_values
+ encoder_output = result.encoder_last_hidden_state
+
+ self.parent.assertEqual(encoder_output.size(), (self.batch_size, self.encoder_seq_length, self.hidden_size))
+ self.parent.assertEqual(decoder_output.size(), (self.batch_size, self.decoder_seq_length, self.hidden_size))
+ # There should be `num_layers` key value embeddings stored in decoder_past
+ self.parent.assertEqual(len(decoder_past), config.num_layers)
+ # There should be a self attn key, a self attn value, a cross attn key and a cross attn value stored in each decoder_past tuple
+ self.parent.assertEqual(len(decoder_past[0]), 4)
+
+ def create_and_check_with_lm_head(
+ self,
+ config,
+ input_ids,
+ bbox,
+ decoder_input_ids,
+ attention_mask,
+ decoder_attention_mask,
+ lm_labels,
+ ):
+ model = UdopForConditionalGeneration(config=config).to(torch_device).eval()
+ outputs = model(
+ input_ids=input_ids,
+ bbox=bbox,
+ decoder_input_ids=decoder_input_ids,
+ decoder_attention_mask=decoder_attention_mask,
+ labels=lm_labels,
+ )
+ self.parent.assertEqual(len(outputs), 4)
+ self.parent.assertEqual(outputs["logits"].size(), (self.batch_size, self.decoder_seq_length, self.vocab_size))
+ self.parent.assertEqual(outputs["loss"].size(), ())
+
+ def create_and_check_generate_with_past_key_values(
+ self,
+ config,
+ input_ids,
+ bbox,
+ decoder_input_ids,
+ attention_mask,
+ decoder_attention_mask,
+ lm_labels,
+ ):
+ model = UdopForConditionalGeneration(config=config).to(torch_device).eval()
+ torch.manual_seed(0)
+ output_without_past_cache = model.generate(
+ input_ids[:1], bbox=bbox[:1, :, :], num_beams=2, max_length=5, do_sample=True, use_cache=False
+ )
+ torch.manual_seed(0)
+ output_with_past_cache = model.generate(
+ input_ids[:1], bbox=bbox[:1, :, :], num_beams=2, max_length=5, do_sample=True
+ )
+ self.parent.assertTrue(torch.all(output_with_past_cache == output_without_past_cache))
+
+ def prepare_config_and_inputs_for_common(self):
+ config_and_inputs = self.prepare_config_and_inputs()
+ (
+ config,
+ input_ids,
+ bbox,
+ decoder_input_ids,
+ attention_mask,
+ decoder_attention_mask,
+ lm_labels,
+ ) = config_and_inputs
+
+ inputs_dict = {
+ "input_ids": input_ids,
+ "attention_mask": attention_mask,
+ "bbox": bbox,
+ "decoder_input_ids": decoder_input_ids,
+ "decoder_attention_mask": decoder_attention_mask,
+ "use_cache": False,
+ }
+ return config, inputs_dict
+
+
+@require_torch
+class UdopModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ all_model_classes = (
+ (
+ UdopModel,
+ UdopForConditionalGeneration,
+ )
+ if is_torch_available()
+ else ()
+ )
+ all_generative_model_classes = (UdopForConditionalGeneration,) if is_torch_available() else ()
+ pipeline_model_mapping = {"feature-extraction": UdopModel} if is_torch_available() else {}
+ fx_compatible = False
+ test_pruning = False
+ test_torchscript = False
+ test_head_masking = False
+ test_resize_embeddings = True
+ test_model_parallel = False
+ is_encoder_decoder = True
+ # The small UDOP model needs higher percentages for CPU/MP tests
+ model_split_percents = [0.8, 0.9]
+
+ def setUp(self):
+ self.model_tester = UdopModelTester(self)
+ self.config_tester = ConfigTester(self, config_class=UdopConfig, d_model=37)
+
+ def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
+ inputs_dict = copy.deepcopy(inputs_dict)
+ if model_class.__name__ == "UdopForConditionalGeneration":
+ if return_labels:
+ inputs_dict["labels"] = torch.zeros(
+ (self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
+ )
+
+ return inputs_dict
+
+ def test_config(self):
+ self.config_tester.run_common_tests()
+
+ def test_model(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_model(*config_and_inputs)
+
+ def test_with_lm_head(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_with_lm_head(*config_and_inputs)
+
+ def test_generate_with_past_key_values(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_generate_with_past_key_values(*config_and_inputs)
+
+ @unittest.skipIf(torch_device == "cpu", "Cant do half precision")
+ def test_model_fp16_forward(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_model_fp16_forward(*config_and_inputs)
+
+ @unittest.skip("Gradient checkpointing is not supported by this model")
+ def test_training_gradient_checkpointing(self):
+ pass
+
+ @unittest.skip(
+ reason="This architecure seem to not compute gradients properly when using GC, check: https://github.com/huggingface/transformers/pull/27124"
+ )
+ def test_training_gradient_checkpointing_use_reentrant(self):
+ pass
+
+ @unittest.skip(
+ reason="This architecure seem to not compute gradients properly when using GC, check: https://github.com/huggingface/transformers/pull/27124"
+ )
+ def test_training_gradient_checkpointing_use_reentrant_false(self):
+ pass
+
+ def test_forward_signature(self):
+ config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+ for model_class in self.all_model_classes:
+ model = model_class(config)
+ signature = inspect.signature(model.forward)
+ # signature.parameters is an OrderedDict => so arg_names order is deterministic
+ arg_names = sorted([*signature.parameters.keys()])
+
+ expected_arg_names = [
+ "attention_mask",
+ "bbox",
+ "cross_attn_head_mask",
+ "decoder_attention_mask",
+ "decoder_head_mask",
+ "decoder_input_ids",
+ "decoder_inputs_embeds",
+ "encoder_outputs",
+ "head_mask",
+ "input_ids",
+ "inputs_embeds",
+ ]
+ if model_class in self.all_generative_model_classes:
+ expected_arg_names.append(
+ "labels",
+ )
+ expected_arg_names = sorted(expected_arg_names)
+ self.assertListEqual(sorted(arg_names[: len(expected_arg_names)]), expected_arg_names)
+
+ @unittest.skip(
+ "Not currently compatible. Fails with - NotImplementedError: Cannot copy out of meta tensor; no data!"
+ )
+ def test_save_load_low_cpu_mem_usage(self):
+ pass
+
+ @slow
+ def test_model_from_pretrained(self):
+ for model_name in UDOP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+ model = UdopForConditionalGeneration.from_pretrained(model_name)
+ self.assertIsNotNone(model)
+
+
+class UdopEncoderOnlyModelTester:
+ def __init__(
+ self,
+ parent,
+ vocab_size=99,
+ batch_size=13,
+ seq_length=7,
+ # For common tests
+ is_training=False,
+ use_attention_mask=True,
+ hidden_size=32,
+ num_hidden_layers=5,
+ decoder_layers=2,
+ num_attention_heads=4,
+ d_ff=37,
+ relative_attention_num_buckets=32,
+ dropout_rate=0.1,
+ initializer_factor=0.002,
+ eos_token_id=1,
+ pad_token_id=0,
+ scope=None,
+ range_bbox=1000,
+ ):
+ self.parent = parent
+ self.batch_size = batch_size
+ # For common tests
+ self.seq_length = seq_length
+ self.is_training = is_training
+ self.use_attention_mask = use_attention_mask
+ self.vocab_size = vocab_size
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.decoder_layers = decoder_layers
+ self.num_attention_heads = num_attention_heads
+ self.d_ff = d_ff
+ self.relative_attention_num_buckets = relative_attention_num_buckets
+ self.dropout_rate = dropout_rate
+ self.initializer_factor = initializer_factor
+ self.eos_token_id = eos_token_id
+ self.pad_token_id = pad_token_id
+ self.scope = None
+ self.range_bbox = range_bbox
+
+ def get_config(self):
+ return UdopConfig(
+ vocab_size=self.vocab_size,
+ d_model=self.hidden_size,
+ d_ff=self.d_ff,
+ d_kv=self.hidden_size // self.num_attention_heads,
+ num_layers=self.num_hidden_layers,
+ num_decoder_layers=self.decoder_layers,
+ num_heads=self.num_attention_heads,
+ relative_attention_num_buckets=self.relative_attention_num_buckets,
+ dropout_rate=self.dropout_rate,
+ initializer_factor=self.initializer_factor,
+ eos_token_id=self.eos_token_id,
+ bos_token_id=self.pad_token_id,
+ pad_token_id=self.pad_token_id,
+ is_encoder_decoder=False,
+ )
+
+ def prepare_config_and_inputs(self):
+ input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+ bbox = ids_tensor([self.batch_size, self.seq_length, 4], self.range_bbox).float()
+ # Ensure that bbox is legal
+ for i in range(bbox.shape[0]):
+ for j in range(bbox.shape[1]):
+ if bbox[i, j, 3] < bbox[i, j, 1]:
+ t = bbox[i, j, 3]
+ bbox[i, j, 3] = bbox[i, j, 1]
+ bbox[i, j, 1] = t
+ if bbox[i, j, 2] < bbox[i, j, 0]:
+ t = bbox[i, j, 2]
+ bbox[i, j, 2] = bbox[i, j, 0]
+ bbox[i, j, 0] = t
+
+ attention_mask = None
+ if self.use_attention_mask:
+ attention_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+ config = self.get_config()
+
+ return (
+ config,
+ input_ids,
+ bbox,
+ attention_mask,
+ )
+
+ def prepare_config_and_inputs_for_common(self):
+ config_and_inputs = self.prepare_config_and_inputs()
+ (
+ config,
+ input_ids,
+ bbox,
+ attention_mask,
+ ) = config_and_inputs
+
+ inputs_dict = {
+ "input_ids": input_ids,
+ "bbox": bbox,
+ "attention_mask": attention_mask,
+ }
+ return config, inputs_dict
+
+ def create_and_check_model(
+ self,
+ config,
+ input_ids,
+ bbox,
+ attention_mask,
+ ):
+ model = UdopEncoderModel(config=config)
+ model.to(torch_device)
+ model.eval()
+ result = model(
+ input_ids=input_ids,
+ bbox=bbox,
+ attention_mask=attention_mask,
+ )
+ encoder_output = result.last_hidden_state
+
+ self.parent.assertEqual(encoder_output.size(), (self.batch_size, self.seq_length, self.hidden_size))
+
+ def create_and_check_model_fp16_forward(
+ self,
+ config,
+ input_ids,
+ attention_mask,
+ ):
+ model = UdopEncoderModel(config=config).to(torch_device).half().eval()
+ output = model(input_ids, attention_mask=attention_mask)["last_hidden_state"]
+ self.parent.assertFalse(torch.isnan(output).any().item())
+
+
+class UdopEncoderOnlyModelTest(ModelTesterMixin, unittest.TestCase):
+ all_model_classes = (UdopEncoderModel,) if is_torch_available() else ()
+ test_pruning = False
+ test_torchscript = False
+ test_head_masking = False
+ test_resize_embeddings = False
+ test_model_parallel = True
+ all_parallelizable_model_classes = (UdopEncoderModel,) if is_torch_available() else ()
+
+ def setUp(self):
+ self.model_tester = UdopEncoderOnlyModelTester(self)
+ self.config_tester = ConfigTester(self, config_class=UdopConfig, d_model=37)
+
+ def test_config(self):
+ self.config_tester.run_common_tests()
+
+ def test_model(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_model(*config_and_inputs)
+
+ @unittest.skipIf(torch_device == "cpu", "Cant do half precision")
+ def test_model_fp16_forward(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_model_fp16_forward(*config_and_inputs)
+
+ @unittest.skip(
+ "Not currently compatible. Fails with - NotImplementedError: Cannot copy out of meta tensor; no data!"
+ )
+ def test_save_load_low_cpu_mem_usage(self):
+ pass
+
+
+@require_torch
+@require_sentencepiece
+@require_tokenizers
+@require_vision
+@slow
+class UdopModelIntegrationTests(unittest.TestCase):
+ @cached_property
+ def image(self):
+ filepath = hf_hub_download(
+ repo_id="hf-internal-testing/fixtures_docvqa", filename="document_2.png", repo_type="dataset"
+ )
+ image = Image.open(filepath).convert("RGB")
+
+ return image
+
+ @cached_property
+ def processor(self):
+ return UdopProcessor.from_pretrained("microsoft/udop-large")
+
+ @cached_property
+ def model(self):
+ return UdopForConditionalGeneration.from_pretrained("microsoft/udop-large").to(torch_device)
+
+ def test_conditional_generation(self):
+ processor = self.processor
+ model = self.model
+
+ prompt = "Question answering. In which year is the report made?"
+ encoding = processor(images=self.image, text=prompt, return_tensors="pt")
+
+ predicted_ids = model.generate(**encoding)
+
+ predicted_text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
+ self.assertEquals(predicted_text, "2013")
diff --git a/tests/models/udop/test_processor_udop.py b/tests/models/udop/test_processor_udop.py
new file mode 100644
index 00000000000000..05855991b185ea
--- /dev/null
+++ b/tests/models/udop/test_processor_udop.py
@@ -0,0 +1,508 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+from typing import List
+
+import numpy as np
+
+from transformers import PreTrainedTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizerFast
+from transformers.models.udop import UdopTokenizer, UdopTokenizerFast
+from transformers.testing_utils import (
+ require_pytesseract,
+ require_sentencepiece,
+ require_tokenizers,
+ require_torch,
+ slow,
+)
+from transformers.utils import FEATURE_EXTRACTOR_NAME, cached_property, is_pytesseract_available, is_torch_available
+
+
+if is_torch_available():
+ import torch
+
+
+if is_pytesseract_available():
+ from PIL import Image
+
+ from transformers import LayoutLMv3ImageProcessor, UdopProcessor
+
+
+@require_pytesseract
+@require_sentencepiece
+@require_tokenizers
+class UdopProcessorTest(unittest.TestCase):
+ tokenizer_class = UdopTokenizer
+ rust_tokenizer_class = UdopTokenizerFast
+ maxDiff = None
+
+ def setUp(self):
+ image_processor_map = {
+ "do_resize": True,
+ "size": 224,
+ "apply_ocr": True,
+ }
+
+ self.tmpdirname = tempfile.mkdtemp()
+ self.feature_extraction_file = os.path.join(self.tmpdirname, FEATURE_EXTRACTOR_NAME)
+ with open(self.feature_extraction_file, "w", encoding="utf-8") as fp:
+ fp.write(json.dumps(image_processor_map) + "\n")
+
+ self.tokenizer_pretrained_name = "microsoft/udop-large"
+
+ def get_tokenizer(self, **kwargs) -> PreTrainedTokenizer:
+ return self.tokenizer_class.from_pretrained(self.tokenizer_pretrained_name, **kwargs)
+
+ def get_rust_tokenizer(self, **kwargs) -> PreTrainedTokenizerFast:
+ return self.rust_tokenizer_class.from_pretrained(self.tokenizer_pretrained_name, **kwargs)
+
+ def get_tokenizers(self, **kwargs) -> List[PreTrainedTokenizerBase]:
+ return [self.get_tokenizer(**kwargs), self.get_rust_tokenizer(**kwargs)]
+
+ def get_image_processor(self, **kwargs):
+ return LayoutLMv3ImageProcessor.from_pretrained(self.tmpdirname, **kwargs)
+
+ def tearDown(self):
+ shutil.rmtree(self.tmpdirname)
+
+ def prepare_image_inputs(self):
+ """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+ or a list of PyTorch tensors if one specifies torchify=True.
+ """
+
+ image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+ image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+ return image_inputs
+
+ def test_save_load_pretrained_default(self):
+ image_processor = self.get_image_processor()
+ tokenizers = self.get_tokenizers()
+ for tokenizer in tokenizers:
+ processor = UdopProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+ processor.save_pretrained(self.tmpdirname)
+ processor = UdopProcessor.from_pretrained(self.tmpdirname)
+
+ self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
+ self.assertIsInstance(processor.tokenizer, (UdopTokenizer, UdopTokenizerFast))
+
+ self.assertEqual(processor.image_processor.to_json_string(), image_processor.to_json_string())
+ self.assertIsInstance(processor.image_processor, LayoutLMv3ImageProcessor)
+
+ def test_save_load_pretrained_additional_features(self):
+ processor = UdopProcessor(image_processor=self.get_image_processor(), tokenizer=self.get_tokenizer())
+ processor.save_pretrained(self.tmpdirname)
+
+ # slow tokenizer
+ tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+ image_processor_add_kwargs = self.get_image_processor(do_resize=False, size=30)
+
+ processor = UdopProcessor.from_pretrained(
+ self.tmpdirname,
+ use_fast=False,
+ bos_token="(BOS)",
+ eos_token="(EOS)",
+ do_resize=False,
+ size=30,
+ )
+
+ self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+ self.assertIsInstance(processor.tokenizer, UdopTokenizer)
+
+ self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+ self.assertIsInstance(processor.image_processor, LayoutLMv3ImageProcessor)
+
+ # fast tokenizer
+ tokenizer_add_kwargs = self.get_rust_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+ image_processor_add_kwargs = self.get_image_processor(do_resize=False, size=30)
+
+ processor = UdopProcessor.from_pretrained(
+ self.tmpdirname, use_xlm=True, bos_token="(BOS)", eos_token="(EOS)", do_resize=False, size=30
+ )
+
+ self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+ self.assertIsInstance(processor.tokenizer, UdopTokenizerFast)
+
+ self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+ self.assertIsInstance(processor.image_processor, LayoutLMv3ImageProcessor)
+
+ def test_model_input_names(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = UdopProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+ input_str = "lower newer"
+ image_input = self.prepare_image_inputs()
+
+ inputs = processor(text=input_str, images=image_input)
+
+ self.assertListEqual(list(inputs.keys()), processor.model_input_names)
+
+ def test_text_target(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = UdopProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+ text = "hello world"
+ expected_decoding = "hello world"
+
+ encoding_processor = processor(text_target=text)
+ encoding_tokenizer = tokenizer(text_target=text)
+
+ self.assertListEqual(encoding_processor["input_ids"], [21820, 296, 1])
+ self.assertListEqual(encoding_processor["attention_mask"], [1, 1, 1])
+ self.assertDictEqual(dict(encoding_processor), dict(encoding_tokenizer))
+ self.assertEqual(tokenizer.decode(encoding_processor["input_ids"]), expected_decoding)
+
+ @slow
+ def test_overflowing_tokens(self):
+ # In the case of overflowing tokens, test that we still have 1-to-1 mapping between the images and input_ids (sequences that are too long are broken down into multiple sequences).
+
+ from datasets import load_dataset
+
+ # set up
+ datasets = load_dataset("nielsr/funsd")
+ processor = UdopProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
+
+ def preprocess_data(examples):
+ images = [Image.open(path).convert("RGB") for path in examples["image_path"]]
+ words = examples["words"]
+ boxes = examples["bboxes"]
+ word_labels = examples["ner_tags"]
+ encoded_inputs = processor(
+ images,
+ words,
+ boxes=boxes,
+ word_labels=word_labels,
+ max_length=512,
+ padding="max_length",
+ truncation=True,
+ return_overflowing_tokens=True,
+ stride=50,
+ return_offsets_mapping=True,
+ return_tensors="pt",
+ )
+ return encoded_inputs
+
+ train_data = preprocess_data(datasets["train"])
+
+ self.assertEqual(len(train_data["pixel_values"]), len(train_data["input_ids"]))
+
+
+# different use cases tests
+@require_sentencepiece
+@require_torch
+@require_pytesseract
+class UdopProcessorIntegrationTests(unittest.TestCase):
+ @cached_property
+ def get_images(self):
+ # we verify our implementation on 2 document images from the DocVQA dataset
+ from datasets import load_dataset
+
+ ds = load_dataset("hf-internal-testing/fixtures_docvqa", split="test")
+
+ image_1 = Image.open(ds[0]["file"]).convert("RGB")
+ image_2 = Image.open(ds[1]["file"]).convert("RGB")
+
+ return image_1, image_2
+
+ @cached_property
+ def get_tokenizers(self):
+ slow_tokenizer = UdopTokenizer.from_pretrained("microsoft/udop-large")
+ fast_tokenizer = UdopTokenizerFast.from_pretrained("microsoft/udop-large")
+ return [slow_tokenizer, fast_tokenizer]
+
+ @slow
+ def test_processor_case_1(self):
+ # case 1: document image classification (training, inference) + token classification (inference), apply_ocr = True
+
+ image_processor = LayoutLMv3ImageProcessor()
+ tokenizers = self.get_tokenizers
+ images = self.get_images
+
+ for tokenizer in tokenizers:
+ processor = UdopProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+ # not batched
+ input_image_processor = image_processor(images[0], return_tensors="pt")
+ input_processor = processor(images[0], return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify pixel_values
+ self.assertTrue(
+ torch.allclose(input_image_processor["pixel_values"], input_processor["pixel_values"], atol=1e-2)
+ )
+
+ # verify input_ids
+ # this was obtained with Tesseract 4.1.1
+ # fmt: off
+ expected_decoding = "11:14 to 11:39 a.m 11:39 to 11:44 a.m. 11:44 a.m. to 12:25 p.m. 12:25 to 12:58 p.m. 12:58 to 4:00 p.m. 2:00 to 5:00 p.m. Coffee Break Coffee will be served for men and women in the lobby adjacent to exhibit area. Please move into exhibit area. (Exhibits Open) TRRF GENERAL SESSION (PART |) Presiding: Lee A. Waller TRRF Vice President “Introductory Remarks” Lee A. Waller, TRRF Vice Presi- dent Individual Interviews with TRRF Public Board Members and Sci- entific Advisory Council Mem- bers Conducted by TRRF Treasurer Philip G. Kuehn to get answers which the public refrigerated warehousing industry is looking for. Plus questions from the floor. Dr. Emil M. Mrak, University of Cal- ifornia, Chairman, TRRF Board; Sam R. Cecil, University of Georgia College of Agriculture; Dr. Stanley Charm, Tufts University School of Medicine; Dr. Robert H. Cotton, ITT Continental Baking Company; Dr. Owen Fennema, University of Wis- consin; Dr. Robert E. Hardenburg, USDA. Questions and Answers Exhibits Open Capt. Jack Stoney Room TRRF Scientific Advisory Council Meeting Ballroom Foyer" # noqa: E231
+ # fmt: on
+ decoding = processor.decode(input_processor.input_ids.squeeze().tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # batched
+ input_image_processor = image_processor(images, return_tensors="pt")
+ input_processor = processor(images, padding=True, return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify pixel_values
+ self.assertTrue(
+ torch.allclose(input_image_processor["pixel_values"], input_processor["pixel_values"], atol=1e-2)
+ )
+
+ # verify input_ids
+ # this was obtained with Tesseract 4.1.1
+ # fmt: off
+ expected_decoding = "7 ITC Limited REPORT AND ACCOUNTS 2013 ITC’s Brands: An Asset for the Nation The consumer needs and aspirations they fulfil, the benefit they generate for millions across ITC’s value chains, the future-ready capabilities that support them, and the value that they create for the country, have made ITC’s brands national assets, adding to India’s competitiveness. It is ITC’s aspiration to be the No 1 FMCG player in the country, driven by its new FMCG businesses. A recent Nielsen report has highlighted that ITC's new FMCG businesses are the fastest growing among the top consumer goods companies operating in India. ITC takes justifiable pride that, along with generating economic value, these celebrated Indian brands also drive the creation of larger societal capital through the virtuous cycle of sustainable and inclusive growth. DI WILLS * ; LOVE DELIGHTFULLY SOFT SKIN? aia Ans Source: https://www.industrydocuments.ucsf.edu/docs/snbx0223" # noqa: E231
+ # fmt: on
+ decoding = processor.decode(input_processor.input_ids[1].tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ @slow
+ def test_processor_case_2(self):
+ # case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False
+
+ image_processor = LayoutLMv3ImageProcessor(apply_ocr=False)
+ tokenizers = self.get_tokenizers
+ images = self.get_images
+
+ for tokenizer in tokenizers:
+ processor = UdopProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+ # not batched
+ words = ["hello", "world"]
+ boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]
+ input_processor = processor(images[0], words, boxes=boxes, return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = list(input_processor.keys())
+ for key in expected_keys:
+ self.assertIn(key, actual_keys)
+
+ # verify input_ids
+ expected_decoding = "hello world"
+ decoding = processor.decode(input_processor.input_ids.squeeze().tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # batched
+ words = [["hello", "world"], ["my", "name", "is", "niels"]]
+ boxes = [[[1, 2, 3, 4], [5, 6, 7, 8]], [[3, 2, 5, 1], [6, 7, 4, 2], [3, 9, 2, 4], [1, 1, 2, 3]]]
+ input_processor = processor(images, words, boxes=boxes, padding=True, return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify input_ids
+ expected_decoding = "hello world"
+ decoding = processor.decode(input_processor.input_ids[0].tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # verify bbox
+ expected_bbox = [
+ [3, 2, 5, 1],
+ [6, 7, 4, 2],
+ [3, 9, 2, 4],
+ [1, 1, 2, 3],
+ [1, 1, 2, 3],
+ [1, 1, 2, 3],
+ [1000, 1000, 1000, 1000],
+ ]
+ self.assertListEqual(input_processor.bbox[1].tolist(), expected_bbox)
+
+ @slow
+ def test_processor_case_3(self):
+ # case 3: token classification (training), apply_ocr=False
+
+ image_processor = LayoutLMv3ImageProcessor(apply_ocr=False)
+ tokenizers = self.get_tokenizers
+ images = self.get_images
+
+ for tokenizer in tokenizers:
+ processor = UdopProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+ # not batched
+ words = ["weirdly", "world"]
+ boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]
+ word_labels = [1, 2]
+ input_processor = processor(images[0], words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "labels", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify input_ids
+ expected_decoding = "weirdly world"
+ decoding = processor.decode(input_processor.input_ids.squeeze().tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # verify labels
+ expected_labels = [1, -100, 2, -100]
+ self.assertListEqual(input_processor.labels.squeeze().tolist(), expected_labels)
+
+ # batched
+ words = [["hello", "world"], ["my", "name", "is", "niels"]]
+ boxes = [[[1, 2, 3, 4], [5, 6, 7, 8]], [[3, 2, 5, 1], [6, 7, 4, 2], [3, 9, 2, 4], [1, 1, 2, 3]]]
+ word_labels = [[1, 2], [6, 3, 10, 2]]
+ input_processor = processor(
+ images, words, boxes=boxes, word_labels=word_labels, padding=True, return_tensors="pt"
+ )
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "labels", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify input_ids
+ expected_decoding = "my name is niels"
+ decoding = processor.decode(input_processor.input_ids[1].tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # verify bbox
+ expected_bbox = [
+ [3, 2, 5, 1],
+ [6, 7, 4, 2],
+ [3, 9, 2, 4],
+ [1, 1, 2, 3],
+ [1, 1, 2, 3],
+ [1, 1, 2, 3],
+ [1000, 1000, 1000, 1000],
+ ]
+ self.assertListEqual(input_processor.bbox[1].tolist(), expected_bbox)
+
+ # verify labels
+ expected_labels = [6, 3, 10, 2, -100, -100, -100]
+ self.assertListEqual(input_processor.labels[1].tolist(), expected_labels)
+
+ @slow
+ def test_processor_case_4(self):
+ # case 4: visual question answering (inference), apply_ocr=True
+
+ image_processor = LayoutLMv3ImageProcessor()
+ tokenizers = self.get_tokenizers
+ images = self.get_images
+
+ for tokenizer in tokenizers:
+ processor = UdopProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+ # not batched
+ question = "What's his name?"
+ input_processor = processor(images[0], question, return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify input_ids
+ # this was obtained with Tesseract 4.1.1
+ # fmt: off
+ expected_decoding = "What's his name? 11:14 to 11:39 a.m 11:39 to 11:44 a.m. 11:44 a.m. to 12:25 p.m. 12:25 to 12:58 p.m. 12:58 to 4:00 p.m. 2:00 to 5:00 p.m. Coffee Break Coffee will be served for men and women in the lobby adjacent to exhibit area. Please move into exhibit area. (Exhibits Open) TRRF GENERAL SESSION (PART |) Presiding: Lee A. Waller TRRF Vice President “Introductory Remarks” Lee A. Waller, TRRF Vice Presi- dent Individual Interviews with TRRF Public Board Members and Sci- entific Advisory Council Mem- bers Conducted by TRRF Treasurer Philip G. Kuehn to get answers which the public refrigerated warehousing industry is looking for. Plus questions from the floor. Dr. Emil M. Mrak, University of Cal- ifornia, Chairman, TRRF Board; Sam R. Cecil, University of Georgia College of Agriculture; Dr. Stanley Charm, Tufts University School of Medicine; Dr. Robert H. Cotton, ITT Continental Baking Company; Dr. Owen Fennema, University of Wis- consin; Dr. Robert E. Hardenburg, USDA. Questions and Answers Exhibits Open Capt. Jack Stoney Room TRRF Scientific Advisory Council Meeting Ballroom Foyer" # noqa: E231
+ # fmt: on
+ decoding = processor.decode(input_processor.input_ids.squeeze().tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # batched
+ questions = ["How old is he?", "what's the time"]
+ input_processor = processor(
+ images, questions, padding="max_length", max_length=20, truncation=True, return_tensors="pt"
+ )
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify input_ids
+ # this was obtained with Tesseract 4.1.1
+ expected_decoding = "what's the time 7 ITC Limited REPORT AND ACCOUNTS 2013 I"
+ decoding = processor.decode(input_processor.input_ids[1].tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # verify bbox
+ # fmt: off
+ expected_bbox = [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [1000, 1000, 1000, 1000], [0, 45, 67, 80], [72, 56, 109, 67], [72, 56, 109, 67], [116, 56, 189, 67], [198, 59, 253, 66], [198, 59, 253, 66], [257, 59, 285, 66], [289, 59, 365, 66], [289, 59, 365, 66], [289, 59, 365, 66], [289, 59, 365, 66], [372, 59, 407, 66], [74, 136, 161, 158], [1000, 1000, 1000, 1000]] # noqa: E231
+ # fmt: on
+ self.assertListEqual(input_processor.bbox[1].tolist(), expected_bbox)
+
+ @slow
+ def test_processor_case_5(self):
+ # case 5: visual question answering (inference), apply_ocr=False
+
+ image_processor = LayoutLMv3ImageProcessor(apply_ocr=False)
+ tokenizers = self.get_tokenizers
+ images = self.get_images
+
+ for tokenizer in tokenizers:
+ processor = UdopProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+ # not batched
+ question = "What's his name?"
+ words = ["hello", "world"]
+ boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]
+ input_processor = processor(images[0], question, words, boxes, return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify input_ids
+ expected_decoding = "What's his name? hello world"
+ decoding = processor.decode(input_processor.input_ids.squeeze().tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # batched
+ questions = ["How old is he?", "what's the time"]
+ words = [["hello", "world"], ["my", "name", "is", "niels"]]
+ boxes = [[[1, 2, 3, 4], [5, 6, 7, 8]], [[3, 2, 5, 1], [6, 7, 4, 2], [3, 9, 2, 4], [1, 1, 2, 3]]]
+ input_processor = processor(images, questions, words, boxes, padding=True, return_tensors="pt")
+
+ # verify keys
+ expected_keys = ["attention_mask", "bbox", "input_ids", "pixel_values"]
+ actual_keys = sorted(input_processor.keys())
+ self.assertListEqual(actual_keys, expected_keys)
+
+ # verify input_ids
+ expected_decoding = "How old is he? hello world"
+ decoding = processor.decode(input_processor.input_ids[0].tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ expected_decoding = "what's the time my name is niels"
+ decoding = processor.decode(input_processor.input_ids[1].tolist())
+ self.assertSequenceEqual(decoding, expected_decoding)
+
+ # verify bbox
+ expected_bbox = [[3, 9, 2, 4], [1, 1, 2, 3], [1, 1, 2, 3], [1, 1, 2, 3], [1000, 1000, 1000, 1000]]
+ self.assertListEqual(input_processor.bbox[1].tolist()[-5:], expected_bbox)
diff --git a/tests/models/udop/test_tokenization_udop.py b/tests/models/udop/test_tokenization_udop.py
new file mode 100644
index 00000000000000..e9d41c5b77a872
--- /dev/null
+++ b/tests/models/udop/test_tokenization_udop.py
@@ -0,0 +1,1886 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import shutil
+import tempfile
+import unittest
+from typing import List
+
+from transformers import (
+ AddedToken,
+ SpecialTokensMixin,
+ UdopTokenizerFast,
+ is_tf_available,
+ is_torch_available,
+ logging,
+)
+from transformers.models.udop.tokenization_udop import UdopTokenizer
+from transformers.testing_utils import (
+ get_tests_dir,
+ is_pt_tf_cross_test,
+ require_pandas,
+ require_sentencepiece,
+ require_tokenizers,
+ require_torch,
+ slow,
+)
+
+from ...test_tokenization_common import (
+ SMALL_TRAINING_CORPUS,
+ TokenizerTesterMixin,
+ filter_non_english,
+ merge_model_tokenizer_mappings,
+)
+
+
+logger = logging.get_logger(__name__)
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+
+
+@require_sentencepiece
+@require_tokenizers
+@require_pandas
+class UdopTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+ tokenizer_class = UdopTokenizer
+ rust_tokenizer_class = UdopTokenizerFast
+ test_rust_tokenizer = True
+ from_pretrained_filter = filter_non_english
+ test_seq2seq = False
+ test_sentencepiece = True
+
+ def get_words_and_boxes(self):
+ words = ["a", "weirdly", "test", "hello"]
+ boxes = [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129], [961, 885, 992, 912]]
+
+ return words, boxes
+
+ def get_words_and_boxes_batch(self):
+ words = [["a", "weirdly", "test"], ["hello", "my", "name", "is", "bob"]]
+ boxes = [
+ [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]],
+ [[961, 885, 992, 912], [256, 38, 330, 58], [256, 38, 330, 58], [336, 42, 353, 57], [34, 42, 66, 69]],
+ ]
+
+ return words, boxes
+
+ def get_question_words_and_boxes(self):
+ question = "what's his name?"
+ words = ["a", "weirdly", "test"]
+ boxes = [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]]
+
+ return question, words, boxes
+
+ def get_question_words_and_boxes_batch(self):
+ questions = ["what's his name?", "how is he called?"]
+ words = [["a", "weirdly", "test"], ["what", "a", "laif", "gastn"]]
+ boxes = [
+ [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]],
+ [[256, 38, 330, 58], [256, 38, 330, 58], [336, 42, 353, 57], [34, 42, 66, 69]],
+ ]
+
+ return questions, words, boxes
+
+ def setUp(self):
+ super().setUp()
+
+ # We have a SentencePiece fixture for testing
+ tokenizer = UdopTokenizer(SAMPLE_VOCAB, keep_accents=True)
+ tokenizer.save_pretrained(self.tmpdirname)
+
+ def get_input_output_texts(self, tokenizer):
+ input_text = "UNwant\u00E9d,running"
+ output_text = "unwanted, running"
+ return input_text, output_text
+
+ # override test in `test_tokenization_common.py` because of the required input format of the `__call__`` method of
+ # this tokenizer
+ def test_save_sentencepiece_tokenizer(self) -> None:
+ if not self.test_sentencepiece or not self.test_slow_tokenizer:
+ return
+ # We want to verify that we will be able to save the tokenizer even if the original files that were used to
+ # build the tokenizer have been deleted in the meantime.
+ words, boxes = self.get_words_and_boxes()
+
+ tokenizer_slow_1 = self.get_tokenizer()
+ encoding_tokenizer_slow_1 = tokenizer_slow_1(
+ words,
+ boxes=boxes,
+ )
+
+ tmpdirname_1 = tempfile.mkdtemp()
+ tmpdirname_2 = tempfile.mkdtemp()
+
+ tokenizer_slow_1.save_pretrained(tmpdirname_1)
+ tokenizer_slow_2 = self.tokenizer_class.from_pretrained(tmpdirname_1)
+ encoding_tokenizer_slow_2 = tokenizer_slow_2(
+ words,
+ boxes=boxes,
+ )
+
+ shutil.rmtree(tmpdirname_1)
+ tokenizer_slow_2.save_pretrained(tmpdirname_2)
+
+ tokenizer_slow_3 = self.tokenizer_class.from_pretrained(tmpdirname_2)
+ encoding_tokenizer_slow_3 = tokenizer_slow_3(
+ words,
+ boxes=boxes,
+ )
+ shutil.rmtree(tmpdirname_2)
+
+ self.assertEqual(encoding_tokenizer_slow_1, encoding_tokenizer_slow_2)
+ self.assertEqual(encoding_tokenizer_slow_1, encoding_tokenizer_slow_3)
+
+ @slow
+ def test_sequence_builders(self):
+ tokenizer = self.tokenizer_class.from_pretrained("microsoft/udop-large")
+
+ question, words, boxes = self.get_question_words_and_boxes()
+
+ text = tokenizer.encode_boxes(
+ question.split(),
+ boxes=[tokenizer.pad_token_box for _ in range(len(question.split()))],
+ add_special_tokens=False,
+ )
+ text_2 = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+
+ encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+ assert encoded_pair == text + [1] + text_2 + [1]
+
+ def test_add_special_tokens(self):
+ tokenizers: List[UdopTokenizer] = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ special_token = "[SPECIAL_TOKEN]"
+ special_token_box = [1000, 1000, 1000, 1000]
+
+ tokenizer.add_special_tokens({"cls_token": special_token})
+ encoded_special_token = tokenizer.encode_boxes(
+ [special_token], boxes=[special_token_box], add_special_tokens=False
+ )
+ self.assertEqual(len(encoded_special_token), 1)
+
+ decoded = tokenizer.decode(encoded_special_token, skip_special_tokens=True)
+ self.assertTrue(special_token not in decoded)
+
+ def test_add_tokens_tokenizer(self):
+ tokenizers: List[UdopTokenizer] = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ vocab_size = tokenizer.vocab_size
+ all_size = len(tokenizer)
+
+ self.assertNotEqual(vocab_size, 0)
+
+ # We usually have added tokens from the start in tests because our vocab fixtures are
+ # smaller than the original vocabs - let's not assert this
+ # self.assertEqual(vocab_size, all_size)
+
+ new_toks = ["aaaaa", "bbbbbb", "cccccccccdddddddd"]
+ added_toks = tokenizer.add_tokens(new_toks)
+ vocab_size_2 = tokenizer.vocab_size
+ all_size_2 = len(tokenizer)
+
+ self.assertNotEqual(vocab_size_2, 0)
+ self.assertEqual(vocab_size, vocab_size_2)
+ self.assertEqual(added_toks, len(new_toks))
+ self.assertEqual(all_size_2, all_size + len(new_toks))
+
+ words = "aaaaa bbbbbb low cccccccccdddddddd l".split()
+ boxes = [[1000, 1000, 1000, 1000] for _ in range(len(words))]
+
+ tokens = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+
+ self.assertGreaterEqual(len(tokens), 4)
+ self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+ self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+ new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+ added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+ vocab_size_3 = tokenizer.vocab_size
+ all_size_3 = len(tokenizer)
+
+ self.assertNotEqual(vocab_size_3, 0)
+ self.assertEqual(vocab_size, vocab_size_3)
+ self.assertEqual(added_toks_2, len(new_toks_2))
+ self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+ words = ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l".split()
+ boxes = [[1000, 1000, 1000, 1000] for _ in range(len(words))]
+
+ tokens = tokenizer.encode_boxes(
+ words,
+ boxes=boxes,
+ add_special_tokens=False,
+ )
+
+ self.assertGreaterEqual(len(tokens), 6)
+ self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+ self.assertGreater(tokens[0], tokens[1])
+ self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+ self.assertGreater(tokens[-2], tokens[-3])
+ self.assertEqual(tokens[0], tokenizer.eos_token_id)
+ self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+ @require_tokenizers
+ def test_encode_decode_with_spaces(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+
+ new_toks = [AddedToken("[ABC]", normalized=False), AddedToken("[DEF]", normalized=False)]
+ tokenizer.add_tokens(new_toks)
+ input = "[ABC][DEF][ABC][DEF]"
+ if self.space_between_special_tokens:
+ output = "[ABC] [DEF] [ABC] [DEF]"
+ else:
+ output = input
+ encoded = tokenizer.encode_boxes(input.split(), boxes=boxes, add_special_tokens=False)
+ decoded = tokenizer.decode(encoded, spaces_between_special_tokens=self.space_between_special_tokens)
+ self.assertIn(decoded, [output, output.lower()])
+
+ def test_encode_plus_with_padding(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+
+ # check correct behaviour if no pad_token_id exists and add it eventually
+ self._check_no_pad_token_padding(tokenizer, words)
+
+ padding_size = 10
+ padding_idx = tokenizer.pad_token_id
+
+ encoded_sequence = tokenizer.encode_plus_boxes(words, boxes=boxes, return_special_tokens_mask=True)
+ input_ids = encoded_sequence["input_ids"]
+ special_tokens_mask = encoded_sequence["special_tokens_mask"]
+ sequence_length = len(input_ids)
+
+ # Test 'longest' and 'no_padding' don't do anything
+ tokenizer.padding_side = "right"
+
+ not_padded_sequence = tokenizer.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ padding=False,
+ return_special_tokens_mask=True,
+ )
+ not_padded_input_ids = not_padded_sequence["input_ids"]
+
+ not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"]
+ not_padded_sequence_length = len(not_padded_input_ids)
+
+ self.assertTrue(sequence_length == not_padded_sequence_length)
+ self.assertTrue(input_ids == not_padded_input_ids)
+ self.assertTrue(special_tokens_mask == not_padded_special_tokens_mask)
+
+ not_padded_sequence = tokenizer.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ padding=False,
+ return_special_tokens_mask=True,
+ )
+ not_padded_input_ids = not_padded_sequence["input_ids"]
+
+ not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"]
+ not_padded_sequence_length = len(not_padded_input_ids)
+
+ self.assertTrue(sequence_length == not_padded_sequence_length)
+ self.assertTrue(input_ids == not_padded_input_ids)
+ self.assertTrue(special_tokens_mask == not_padded_special_tokens_mask)
+
+ # Test right padding
+ tokenizer.padding_side = "right"
+
+ right_padded_sequence = tokenizer.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=sequence_length + padding_size,
+ padding="max_length",
+ return_special_tokens_mask=True,
+ )
+ right_padded_input_ids = right_padded_sequence["input_ids"]
+
+ right_padded_special_tokens_mask = right_padded_sequence["special_tokens_mask"]
+ right_padded_sequence_length = len(right_padded_input_ids)
+
+ self.assertTrue(sequence_length + padding_size == right_padded_sequence_length)
+ self.assertTrue(input_ids + [padding_idx] * padding_size == right_padded_input_ids)
+ self.assertTrue(special_tokens_mask + [1] * padding_size == right_padded_special_tokens_mask)
+
+ # Test left padding
+ tokenizer.padding_side = "left"
+ left_padded_sequence = tokenizer.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=sequence_length + padding_size,
+ padding="max_length",
+ return_special_tokens_mask=True,
+ )
+ left_padded_input_ids = left_padded_sequence["input_ids"]
+ left_padded_special_tokens_mask = left_padded_sequence["special_tokens_mask"]
+ left_padded_sequence_length = len(left_padded_input_ids)
+
+ self.assertTrue(sequence_length + padding_size == left_padded_sequence_length)
+ self.assertTrue([padding_idx] * padding_size + input_ids == left_padded_input_ids)
+ self.assertTrue([1] * padding_size + special_tokens_mask == left_padded_special_tokens_mask)
+
+ if "token_type_ids" in tokenizer.model_input_names:
+ token_type_ids = encoded_sequence["token_type_ids"]
+ left_padded_token_type_ids = left_padded_sequence["token_type_ids"]
+ right_padded_token_type_ids = right_padded_sequence["token_type_ids"]
+
+ assert token_type_ids + [0] * padding_size == right_padded_token_type_ids
+ assert [0] * padding_size + token_type_ids == left_padded_token_type_ids
+
+ if "attention_mask" in tokenizer.model_input_names:
+ attention_mask = encoded_sequence["attention_mask"]
+ right_padded_attention_mask = right_padded_sequence["attention_mask"]
+ left_padded_attention_mask = left_padded_sequence["attention_mask"]
+
+ self.assertTrue(attention_mask + [0] * padding_size == right_padded_attention_mask)
+ self.assertTrue([0] * padding_size + attention_mask == left_padded_attention_mask)
+
+ def test_internal_consistency(self):
+ tokenizers = self.get_tokenizers()
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+
+ tokens = []
+ for word in words:
+ tokens.extend(tokenizer.tokenize(word))
+ ids = tokenizer.convert_tokens_to_ids(tokens)
+ ids_2 = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ self.assertListEqual(ids, ids_2)
+
+ tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+ self.assertNotEqual(len(tokens_2), 0)
+ text_2 = tokenizer.decode(ids)
+ self.assertIsInstance(text_2, str)
+
+ output_text = "a weirdly test hello"
+ self.assertEqual(text_2, output_text)
+
+ def test_mask_output(self):
+ tokenizers = self.get_tokenizers(fast=False, do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+
+ if (
+ tokenizer.build_inputs_with_special_tokens.__qualname__.split(".")[0] != "PreTrainedTokenizer"
+ and "token_type_ids" in tokenizer.model_input_names
+ ):
+ information = tokenizer.encode_plus_boxes(words, boxes=boxes, add_special_tokens=True)
+ sequences, mask = information["input_ids"], information["token_type_ids"]
+ self.assertEqual(len(sequences), len(mask))
+
+ def test_number_of_added_tokens(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ # test 1: single sequence
+ words, boxes = self.get_words_and_boxes()
+
+ sequences = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ attached_sequences = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=True)
+
+ # Method is implemented (e.g. not GPT-2)
+ if len(attached_sequences) != 2:
+ self.assertEqual(
+ tokenizer.num_special_tokens_to_add(pair=False), len(attached_sequences) - len(sequences)
+ )
+
+ # test 2: two sequences
+ question, words, boxes = self.get_question_words_and_boxes()
+
+ sequences = tokenizer.encode_boxes(question, words, boxes=boxes, add_special_tokens=False)
+ attached_sequences = tokenizer.encode_boxes(question, words, boxes=boxes, add_special_tokens=True)
+
+ # Method is implemented (e.g. not GPT-2)
+ if len(attached_sequences) != 2:
+ self.assertEqual(
+ tokenizer.num_special_tokens_to_add(pair=True), len(attached_sequences) - len(sequences)
+ )
+
+ def test_padding_to_max_length(self):
+ """We keep this test for backward compatibility but it should be removed when `pad_to_max_length` will be deprecated"""
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+ padding_size = 10
+
+ # check correct behaviour if no pad_token_id exists and add it eventually
+ self._check_no_pad_token_padding(tokenizer, words)
+
+ padding_idx = tokenizer.pad_token_id
+
+ # Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+ tokenizer.padding_side = "right"
+ encoded_sequence = tokenizer.encode_boxes(words, boxes=boxes)
+ sequence_length = len(encoded_sequence)
+ # FIXME: the next line should be padding(max_length) to avoid warning
+ padded_sequence = tokenizer.encode_boxes(
+ words, boxes=boxes, max_length=sequence_length + padding_size, pad_to_max_length=True
+ )
+ padded_sequence_length = len(padded_sequence)
+ assert sequence_length + padding_size == padded_sequence_length
+ assert encoded_sequence + [padding_idx] * padding_size == padded_sequence
+
+ # Check that nothing is done when a maximum length is not specified
+ encoded_sequence = tokenizer.encode_boxes(words, boxes=boxes)
+ sequence_length = len(encoded_sequence)
+
+ tokenizer.padding_side = "right"
+ padded_sequence_right = tokenizer.encode_boxes(words, boxes=boxes, pad_to_max_length=True)
+ padded_sequence_right_length = len(padded_sequence_right)
+ assert sequence_length == padded_sequence_right_length
+ assert encoded_sequence == padded_sequence_right
+
+ def test_padding(self, max_length=50):
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+ tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+ self.assertEqual(tokenizer_p.pad_token_id, tokenizer_r.pad_token_id)
+ pad_token_id = tokenizer_p.pad_token_id
+
+ # Encode - Simple input
+ words, boxes = self.get_words_and_boxes()
+ input_r = tokenizer_r.encode_boxes(words, boxes=boxes, max_length=max_length, pad_to_max_length=True)
+ input_p = tokenizer_p.encode_boxes(words, boxes=boxes, max_length=max_length, pad_to_max_length=True)
+ self.assert_padded_input_match(input_r, input_p, max_length, pad_token_id)
+ input_r = tokenizer_r.encode_boxes(words, boxes=boxes, max_length=max_length, padding="max_length")
+ input_p = tokenizer_p.encode_boxes(words, boxes=boxes, max_length=max_length, padding="max_length")
+ self.assert_padded_input_match(input_r, input_p, max_length, pad_token_id)
+
+ input_r = tokenizer_r.encode_boxes(words, boxes=boxes, padding="longest")
+ input_p = tokenizer_p.encode_boxes(words, boxes=boxes, padding=True)
+ self.assert_padded_input_match(input_r, input_p, len(input_r), pad_token_id)
+
+ # Encode - Pair input
+ question, words, boxes = self.get_question_words_and_boxes()
+ input_r = tokenizer_r.encode_boxes(
+ question, words, boxes=boxes, max_length=max_length, pad_to_max_length=True
+ )
+ input_p = tokenizer_p.encode_boxes(
+ question, words, boxes=boxes, max_length=max_length, pad_to_max_length=True
+ )
+ self.assert_padded_input_match(input_r, input_p, max_length, pad_token_id)
+ input_r = tokenizer_r.encode_boxes(
+ question, words, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ input_p = tokenizer_p.encode_boxes(
+ question, words, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ self.assert_padded_input_match(input_r, input_p, max_length, pad_token_id)
+ input_r = tokenizer_r.encode_boxes(question, words, boxes=boxes, padding=True)
+ input_p = tokenizer_p.encode_boxes(question, words, boxes=boxes, padding="longest")
+ self.assert_padded_input_match(input_r, input_p, len(input_r), pad_token_id)
+
+ # Encode_plus - Simple input
+ words, boxes = self.get_words_and_boxes()
+ input_r = tokenizer_r.encode_plus_boxes(
+ words, boxes=boxes, max_length=max_length, pad_to_max_length=True
+ )
+ input_p = tokenizer_p.encode_plus_boxes(
+ words, boxes=boxes, max_length=max_length, pad_to_max_length=True
+ )
+ self.assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length, pad_token_id)
+ self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
+ input_r = tokenizer_r.encode_plus_boxes(
+ words, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ input_p = tokenizer_p.encode_plus_boxes(
+ words, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ self.assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length, pad_token_id)
+ self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
+
+ input_r = tokenizer_r.encode_plus_boxes(words, boxes=boxes, padding="longest")
+ input_p = tokenizer_p.encode_plus_boxes(words, boxes=boxes, padding=True)
+ self.assert_padded_input_match(
+ input_r["input_ids"], input_p["input_ids"], len(input_r["input_ids"]), pad_token_id
+ )
+
+ self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
+
+ # Encode_plus - Pair input
+ question, words, boxes = self.get_question_words_and_boxes()
+ input_r = tokenizer_r.encode_plus_boxes(
+ question, words, boxes=boxes, max_length=max_length, pad_to_max_length=True
+ )
+ input_p = tokenizer_p.encode_plus_boxes(
+ question, words, boxes=boxes, max_length=max_length, pad_to_max_length=True
+ )
+ self.assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length, pad_token_id)
+ self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
+ input_r = tokenizer_r.encode_plus_boxes(
+ question, words, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ input_p = tokenizer_p.encode_plus_boxes(
+ question, words, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ self.assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length, pad_token_id)
+ self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
+ input_r = tokenizer_r.encode_plus_boxes(question, words, boxes=boxes, padding="longest")
+ input_p = tokenizer_p.encode_plus_boxes(question, words, boxes=boxes, padding=True)
+ self.assert_padded_input_match(
+ input_r["input_ids"], input_p["input_ids"], len(input_r["input_ids"]), pad_token_id
+ )
+ self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
+
+ # Batch_encode_plus - Simple input
+ words, boxes = self.get_words_and_boxes_batch()
+
+ input_r = tokenizer_r.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=max_length,
+ pad_to_max_length=True,
+ )
+ input_p = tokenizer_p.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=max_length,
+ pad_to_max_length=True,
+ )
+ self.assert_batch_padded_input_match(input_r, input_p, max_length, pad_token_id)
+
+ input_r = tokenizer_r.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=max_length,
+ padding="max_length",
+ )
+ input_p = tokenizer_p.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=max_length,
+ padding="max_length",
+ )
+ self.assert_batch_padded_input_match(input_r, input_p, max_length, pad_token_id)
+
+ input_r = tokenizer_r.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=max_length,
+ padding="longest",
+ )
+ input_p = tokenizer_p.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=max_length,
+ padding=True,
+ )
+ self.assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]), pad_token_id)
+
+ input_r = tokenizer_r.batch_encode_plus_boxes(words, boxes=boxes, padding="longest")
+ input_p = tokenizer_p.batch_encode_plus_boxes(words, boxes=boxes, padding=True)
+ self.assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]), pad_token_id)
+
+ # Batch_encode_plus - Pair input
+ questions, words, boxes = self.get_question_words_and_boxes_batch()
+
+ input_r = tokenizer_r.batch_encode_plus_boxes(
+ list(zip(questions, words)),
+ is_pair=True,
+ boxes=boxes,
+ max_length=max_length,
+ truncation=True,
+ padding="max_length",
+ )
+ input_p = tokenizer_p.batch_encode_plus_boxes(
+ list(zip(questions, words)),
+ is_pair=True,
+ boxes=boxes,
+ max_length=max_length,
+ truncation=True,
+ padding="max_length",
+ )
+ self.assert_batch_padded_input_match(input_r, input_p, max_length, pad_token_id)
+
+ input_r = tokenizer_r.batch_encode_plus_boxes(
+ list(zip(questions, words)),
+ is_pair=True,
+ boxes=boxes,
+ padding=True,
+ )
+ input_p = tokenizer_p.batch_encode_plus_boxes(
+ list(zip(questions, words)),
+ is_pair=True,
+ boxes=boxes,
+ padding="longest",
+ )
+ self.assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]), pad_token_id)
+
+ # Using pad on single examples after tokenization
+ words, boxes = self.get_words_and_boxes()
+ input_r = tokenizer_r.encode_plus_boxes(words, boxes=boxes)
+ input_r = tokenizer_r.pad(input_r)
+
+ input_p = tokenizer_r.encode_plus_boxes(words, boxes=boxes)
+ input_p = tokenizer_r.pad(input_p)
+
+ self.assert_padded_input_match(
+ input_r["input_ids"], input_p["input_ids"], len(input_r["input_ids"]), pad_token_id
+ )
+
+ # Using pad on single examples after tokenization
+ input_r = tokenizer_r.encode_plus_boxes(words, boxes=boxes)
+ input_r = tokenizer_r.pad(input_r, max_length=max_length, padding="max_length")
+
+ input_p = tokenizer_r.encode_plus_boxes(words, boxes=boxes)
+ input_p = tokenizer_r.pad(input_p, max_length=max_length, padding="max_length")
+
+ self.assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length, pad_token_id)
+
+ # Using pad after tokenization
+ words, boxes = self.get_words_and_boxes_batch()
+ input_r = tokenizer_r.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ )
+ input_r = tokenizer_r.pad(input_r)
+
+ input_p = tokenizer_r.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ )
+ input_p = tokenizer_r.pad(input_p)
+
+ self.assert_batch_padded_input_match(input_r, input_p, len(input_r["input_ids"][0]), pad_token_id)
+
+ # Using pad after tokenization
+ words, boxes = self.get_words_and_boxes_batch()
+ input_r = tokenizer_r.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ )
+ input_r = tokenizer_r.pad(input_r, max_length=max_length, padding="max_length")
+
+ input_p = tokenizer_r.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ )
+ input_p = tokenizer_r.pad(input_p, max_length=max_length, padding="max_length")
+
+ self.assert_batch_padded_input_match(input_r, input_p, max_length, pad_token_id)
+
+ def test_padding_warning_message_fast_tokenizer(self):
+ if not self.test_rust_tokenizer:
+ return
+
+ words, boxes = self.get_words_and_boxes_batch()
+
+ tokenizer_fast = self.get_rust_tokenizer()
+
+ encoding_fast = tokenizer_fast(
+ words,
+ boxes=boxes,
+ )
+
+ with self.assertLogs("transformers", level="WARNING") as cm:
+ tokenizer_fast.pad(encoding_fast)
+ self.assertEqual(len(cm.records), 1)
+ self.assertIn(
+ "Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to"
+ " encode the text followed by a call to the `pad` method to get a padded encoding.",
+ cm.records[0].message,
+ )
+
+ if not self.test_slow_tokenizer:
+ return
+
+ tokenizer_slow = self.get_tokenizer()
+
+ encoding_slow = tokenizer_slow(
+ words,
+ boxes=boxes,
+ )
+
+ with self.assertLogs(level="WARNING") as cm:
+ # We want to assert there are no warnings, but the 'assertLogs' method does not support that.
+ # Therefore, we are adding a dummy warning, and then we will assert it is the only warning.
+ logger.warning("Dummy warning")
+ tokenizer_slow.pad(encoding_slow)
+ self.assertEqual(len(cm.records), 1)
+ self.assertIn(
+ "Dummy warning",
+ cm.records[0].message,
+ )
+
+ def test_call(self):
+ # Tests that all call wrap to encode_plus and batch_encode_plus
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ # Test not batched
+ words, boxes = self.get_words_and_boxes()
+ encoded_sequences_1 = tokenizer.encode_plus_boxes(words, boxes=boxes)
+ encoded_sequences_2 = tokenizer(words, boxes=boxes)
+ self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+ # Test not batched pairs
+ question, words, boxes = self.get_question_words_and_boxes()
+ encoded_sequences_1 = tokenizer.encode_plus_boxes(words, boxes=boxes)
+ encoded_sequences_2 = tokenizer(words, boxes=boxes)
+ self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+ # Test batched
+ words, boxes = self.get_words_and_boxes_batch()
+ encoded_sequences_1 = tokenizer.batch_encode_plus_boxes(words, is_pair=False, boxes=boxes)
+ encoded_sequences_2 = tokenizer(words, boxes=boxes)
+ self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+ def test_batch_encode_plus_batch_sequence_length(self):
+ # Tests that all encoded values have the correct size
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes_batch()
+
+ encoded_sequences = [
+ tokenizer.encode_plus_boxes(words_example, boxes=boxes_example)
+ for words_example, boxes_example in zip(words, boxes)
+ ]
+ encoded_sequences_batch = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, padding=False
+ )
+ self.assertListEqual(
+ encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+ )
+
+ maximum_length = len(
+ max([encoded_sequence["input_ids"] for encoded_sequence in encoded_sequences], key=len)
+ )
+
+ # check correct behaviour if no pad_token_id exists and add it eventually
+ self._check_no_pad_token_padding(tokenizer, words)
+
+ encoded_sequences_padded = [
+ tokenizer.encode_plus_boxes(
+ words_example, boxes=boxes_example, max_length=maximum_length, padding="max_length"
+ )
+ for words_example, boxes_example in zip(words, boxes)
+ ]
+
+ encoded_sequences_batch_padded = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, padding=True
+ )
+ self.assertListEqual(
+ encoded_sequences_padded,
+ self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch_padded),
+ )
+
+ # check 'longest' is unsensitive to a max length
+ encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, padding=True
+ )
+ encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, max_length=maximum_length + 10, padding="longest"
+ )
+ for key in encoded_sequences_batch_padded_1.keys():
+ self.assertListEqual(
+ encoded_sequences_batch_padded_1[key],
+ encoded_sequences_batch_padded_2[key],
+ )
+
+ # check 'no_padding' is unsensitive to a max length
+ encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, padding=False
+ )
+ encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, max_length=maximum_length + 10, padding=False
+ )
+ for key in encoded_sequences_batch_padded_1.keys():
+ self.assertListEqual(
+ encoded_sequences_batch_padded_1[key],
+ encoded_sequences_batch_padded_2[key],
+ )
+
+ @unittest.skip("batch_encode_plus does not handle overflowing tokens.")
+ def test_batch_encode_plus_overflowing_tokens(self):
+ pass
+
+ def test_batch_encode_plus_padding(self):
+ # Test that padded sequences are equivalent between batch_encode_plus and encode_plus
+
+ # Right padding tests
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes_batch()
+
+ max_length = 100
+
+ # check correct behaviour if no pad_token_id exists and add it eventually
+ self._check_no_pad_token_padding(tokenizer, words)
+
+ encoded_sequences = [
+ tokenizer.encode_plus_boxes(
+ words_example, boxes=boxes_example, max_length=max_length, padding="max_length"
+ )
+ for words_example, boxes_example in zip(words, boxes)
+ ]
+ encoded_sequences_batch = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ self.assertListEqual(
+ encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+ )
+
+ # Left padding tests
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ tokenizer.padding_side = "left"
+ words, boxes = self.get_words_and_boxes_batch()
+
+ max_length = 100
+
+ # check correct behaviour if no pad_token_id exists and add it eventually
+ self._check_no_pad_token_padding(tokenizer, words)
+
+ encoded_sequences = [
+ tokenizer.encode_plus_boxes(
+ words_example, boxes=boxes_example, max_length=max_length, padding="max_length"
+ )
+ for words_example, boxes_example in zip(words, boxes)
+ ]
+ encoded_sequences_batch = tokenizer.batch_encode_plus_boxes(
+ words, is_pair=False, boxes=boxes, max_length=max_length, padding="max_length"
+ )
+ self.assertListEqual(
+ encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+ )
+
+ def test_padding_to_multiple_of(self):
+ tokenizers = self.get_tokenizers()
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ if tokenizer.pad_token is None:
+ self.skipTest("No padding token.")
+ else:
+ words, boxes = self.get_words_and_boxes()
+
+ normal_tokens = tokenizer(words, boxes=boxes, padding=True, pad_to_multiple_of=8)
+
+ for key, value in normal_tokens.items():
+ self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+ normal_tokens = tokenizer(words, boxes=boxes, pad_to_multiple_of=8)
+ for key, value in normal_tokens.items():
+ self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+ # Should also work with truncation
+ normal_tokens = tokenizer(words, boxes=boxes, padding=True, truncation=True, pad_to_multiple_of=8)
+ for key, value in normal_tokens.items():
+ self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+ # truncation to something which is not a multiple of pad_to_multiple_of raises an error
+ self.assertRaises(
+ ValueError,
+ tokenizer.__call__,
+ words,
+ boxes=boxes,
+ padding=True,
+ truncation=True,
+ max_length=12,
+ pad_to_multiple_of=8,
+ )
+
+ def test_tokenizer_slow_store_full_signature(self):
+ signature = inspect.signature(self.tokenizer_class.__init__)
+ tokenizer = self.get_tokenizer()
+
+ for parameter_name, parameter in signature.parameters.items():
+ if parameter.default != inspect.Parameter.empty:
+ self.assertIn(parameter_name, tokenizer.init_kwargs)
+
+ def test_build_inputs_with_special_tokens(self):
+ if not self.test_slow_tokenizer:
+ # as we don't have a slow version, we can't compare the outputs between slow and fast versions
+ return
+
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+ tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+ # Input tokens id
+ words, boxes = self.get_words_and_boxes()
+ input_simple = tokenizer_p.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ input_pair = tokenizer_p.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+
+ # Generate output
+ output_r = tokenizer_r.build_inputs_with_special_tokens(input_simple)
+ output_p = tokenizer_p.build_inputs_with_special_tokens(input_simple)
+ self.assertEqual(output_p, output_r)
+
+ # Generate pair output
+ output_r = tokenizer_r.build_inputs_with_special_tokens(input_simple, input_pair)
+ output_p = tokenizer_p.build_inputs_with_special_tokens(input_simple, input_pair)
+ self.assertEqual(output_p, output_r)
+
+ def test_special_tokens_mask_input_pairs(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+ encoded_sequence = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ encoded_sequence_dict = tokenizer.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ add_special_tokens=True,
+ return_special_tokens_mask=True,
+ # add_prefix_space=False,
+ )
+ encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+ special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+ self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+ filtered_sequence = [
+ (x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)
+ ]
+ filtered_sequence = [x for x in filtered_sequence if x is not None]
+ self.assertEqual(encoded_sequence, filtered_sequence)
+
+ def test_special_tokens_mask(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+ # Testing single inputs
+ encoded_sequence = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ encoded_sequence_dict = tokenizer.encode_plus_boxes(
+ words, boxes=boxes, add_special_tokens=True, return_special_tokens_mask=True
+ )
+ encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+ special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+ self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+ filtered_sequence = [x for i, x in enumerate(encoded_sequence_w_special) if not special_tokens_mask[i]]
+ self.assertEqual(encoded_sequence, filtered_sequence)
+
+ def test_save_and_load_tokenizer(self):
+ # safety check on max_len default value so we are sure the test works
+ tokenizers = self.get_tokenizers()
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ self.assertNotEqual(tokenizer.model_max_length, 42)
+
+ # Now let's start the test
+ tokenizers = self.get_tokenizers()
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ # Isolate this from the other tests because we save additional tokens/etc
+ words, boxes = self.get_words_and_boxes()
+ tmpdirname = tempfile.mkdtemp()
+
+ before_tokens = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ before_vocab = tokenizer.get_vocab()
+ tokenizer.save_pretrained(tmpdirname)
+
+ after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
+ after_tokens = after_tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ after_vocab = after_tokenizer.get_vocab()
+ self.assertListEqual(before_tokens, after_tokens)
+ self.assertDictEqual(before_vocab, after_vocab)
+
+ shutil.rmtree(tmpdirname)
+
+ @unittest.skip("Not implemented")
+ def test_right_and_left_truncation(self):
+ pass
+
+ def test_right_and_left_padding(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+ sequence = "Sequence"
+ padding_size = 10
+
+ # check correct behaviour if no pad_token_id exists and add it eventually
+ self._check_no_pad_token_padding(tokenizer, sequence)
+
+ padding_idx = tokenizer.pad_token_id
+
+ # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+ tokenizer.padding_side = "right"
+ encoded_sequence = tokenizer.encode_boxes(words, boxes=boxes)
+ sequence_length = len(encoded_sequence)
+ padded_sequence = tokenizer.encode_boxes(
+ words, boxes=boxes, max_length=sequence_length + padding_size, padding="max_length"
+ )
+ padded_sequence_length = len(padded_sequence)
+ assert sequence_length + padding_size == padded_sequence_length
+ assert encoded_sequence + [padding_idx] * padding_size == padded_sequence
+
+ # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+ tokenizer.padding_side = "left"
+ encoded_sequence = tokenizer.encode_boxes(words, boxes=boxes)
+ sequence_length = len(encoded_sequence)
+ padded_sequence = tokenizer.encode_boxes(
+ words, boxes=boxes, max_length=sequence_length + padding_size, padding="max_length"
+ )
+ padded_sequence_length = len(padded_sequence)
+ assert sequence_length + padding_size == padded_sequence_length
+ assert [padding_idx] * padding_size + encoded_sequence == padded_sequence
+
+ # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding'
+ encoded_sequence = tokenizer.encode_boxes(words, boxes=boxes)
+ sequence_length = len(encoded_sequence)
+
+ tokenizer.padding_side = "right"
+ padded_sequence_right = tokenizer.encode_boxes(words, boxes=boxes, padding=True)
+ padded_sequence_right_length = len(padded_sequence_right)
+ assert sequence_length == padded_sequence_right_length
+ assert encoded_sequence == padded_sequence_right
+
+ tokenizer.padding_side = "left"
+ padded_sequence_left = tokenizer.encode_boxes(words, boxes=boxes, padding="longest")
+ padded_sequence_left_length = len(padded_sequence_left)
+ assert sequence_length == padded_sequence_left_length
+ assert encoded_sequence == padded_sequence_left
+
+ tokenizer.padding_side = "right"
+ padded_sequence_right = tokenizer.encode_boxes(words, boxes=boxes)
+ padded_sequence_right_length = len(padded_sequence_right)
+ assert sequence_length == padded_sequence_right_length
+ assert encoded_sequence == padded_sequence_right
+
+ tokenizer.padding_side = "left"
+ padded_sequence_left = tokenizer.encode_boxes(words, boxes=boxes, padding=False)
+ padded_sequence_left_length = len(padded_sequence_left)
+ assert sequence_length == padded_sequence_left_length
+ assert encoded_sequence == padded_sequence_left
+
+ def test_token_type_ids(self):
+ tokenizers = self.get_tokenizers()
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ # test 1: single sequence
+ words, boxes = self.get_words_and_boxes()
+
+ output = tokenizer(words, boxes=boxes, return_token_type_ids=True)
+
+ # Assert that the token type IDs have the same length as the input IDs
+ self.assertEqual(len(output["token_type_ids"]), len(output["input_ids"]))
+
+ # Assert that the token type IDs have the same length as the attention mask
+ self.assertEqual(len(output["token_type_ids"]), len(output["attention_mask"]))
+
+ self.assertIn(0, output["token_type_ids"])
+ self.assertNotIn(1, output["token_type_ids"])
+
+ # test 2: two sequences (question + words)
+ question, words, boxes = self.get_question_words_and_boxes()
+
+ output = tokenizer(question, words, boxes, return_token_type_ids=True)
+
+ # Assert that the token type IDs have the same length as the input IDs
+ self.assertEqual(len(output["token_type_ids"]), len(output["input_ids"]))
+
+ # Assert that the token type IDs have the same length as the attention mask
+ self.assertEqual(len(output["token_type_ids"]), len(output["attention_mask"]))
+
+ self.assertIn(0, output["token_type_ids"])
+ self.assertNotIn(1, output["token_type_ids"])
+
+ def test_offsets_mapping(self):
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+ text = ["a", "wonderful", "test"]
+ boxes = [[1, 8, 12, 20] for _ in range(len(text))]
+
+ # No pair
+ tokens_with_offsets = tokenizer_r.encode_plus_boxes(
+ text,
+ boxes=boxes,
+ return_special_tokens_mask=True,
+ return_offsets_mapping=True,
+ add_special_tokens=True,
+ )
+ added_tokens = tokenizer_r.num_special_tokens_to_add(False)
+ offsets = tokens_with_offsets["offset_mapping"]
+
+ # Assert there is the same number of tokens and offsets
+ self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+ # Assert there is online added_tokens special_tokens
+ self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+ # Pairs
+ text = "what's his name"
+ pair = ["a", "wonderful", "test"]
+ boxes = [[1, 8, 12, 20] for _ in range(len(pair))]
+ tokens_with_offsets = tokenizer_r.encode_plus_boxes(
+ text,
+ pair,
+ boxes=boxes,
+ return_special_tokens_mask=True,
+ return_offsets_mapping=True,
+ add_special_tokens=True,
+ )
+ added_tokens = tokenizer_r.num_special_tokens_to_add(True)
+ offsets = tokens_with_offsets["offset_mapping"]
+
+ # Assert there is the same number of tokens and offsets
+ self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+ # Assert there is online added_tokens special_tokens
+ self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+ @require_torch
+ @slow
+ def test_torch_encode_plus_sent_to_model(self):
+ import torch
+
+ from transformers import MODEL_MAPPING, TOKENIZER_MAPPING
+
+ MODEL_TOKENIZER_MAPPING = merge_model_tokenizer_mappings(MODEL_MAPPING, TOKENIZER_MAPPING)
+
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ if tokenizer.__class__ not in MODEL_TOKENIZER_MAPPING:
+ return
+
+ config_class, model_class = MODEL_TOKENIZER_MAPPING[tokenizer.__class__]
+ config = config_class()
+
+ if config.is_encoder_decoder or config.pad_token_id is None:
+ return
+
+ model = model_class(config)
+
+ # Make sure the model contains at least the full vocabulary size in its embedding matrix
+ is_using_common_embeddings = hasattr(model.get_input_embeddings(), "weight")
+ assert (
+ (model.get_input_embeddings().weight.shape[0] >= len(tokenizer))
+ if is_using_common_embeddings
+ else True
+ )
+
+ # Build sequence
+ words, boxes = self.get_words_and_boxes()
+ encoded_sequence = tokenizer.encode_plus_boxes(words, boxes=boxes, return_tensors="pt")
+ batch_encoded_sequence = tokenizer.batch_encode_plus_boxes(
+ [words, words], [boxes, boxes], return_tensors="pt"
+ )
+ # This should not fail
+
+ with torch.no_grad(): # saves some time
+ model(**encoded_sequence)
+ model(**batch_encoded_sequence)
+
+ def test_rust_and_python_full_tokenizers(self):
+ if not self.test_rust_tokenizer:
+ return
+
+ if not self.test_slow_tokenizer:
+ # as we don't have a slow version, we can't compare the outputs between slow and fast versions
+ return
+
+ tokenizer = self.get_tokenizer()
+ rust_tokenizer = self.get_rust_tokenizer()
+
+ words, boxes = self.get_words_and_boxes()
+
+ ids = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ rust_ids = rust_tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ self.assertListEqual(ids, rust_ids)
+
+ ids = tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=True)
+ rust_ids = rust_tokenizer.encode_boxes(words, boxes=boxes, add_special_tokens=True)
+ self.assertListEqual(ids, rust_ids)
+
+ def test_tokenization_python_rust_equals(self):
+ if not self.test_slow_tokenizer:
+ # as we don't have a slow version, we can't compare the outputs between slow and fast versions
+ return
+
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+ tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+ words, boxes = self.get_words_and_boxes()
+
+ # Ensure basic input match
+ input_p = tokenizer_p.encode_plus_boxes(words, boxes=boxes)
+ input_r = tokenizer_r.encode_plus_boxes(words, boxes=boxes)
+
+ for key in filter(
+ lambda x: x in ["input_ids", "token_type_ids", "attention_mask", "bbox"], input_p.keys()
+ ):
+ self.assertSequenceEqual(input_p[key], input_r[key])
+
+ input_pairs_p = tokenizer_p.encode_plus_boxes(words, boxes=boxes)
+ input_pairs_r = tokenizer_r.encode_plus_boxes(words, boxes=boxes)
+
+ for key in filter(
+ lambda x: x in ["input_ids", "token_type_ids", "attention_mask", "bbox"], input_p.keys()
+ ):
+ self.assertSequenceEqual(input_pairs_p[key], input_pairs_r[key])
+
+ words = ["hello" for _ in range(1000)]
+ boxes = [[1000, 1000, 1000, 1000] for _ in range(1000)]
+
+ # Ensure truncation match
+ input_p = tokenizer_p.encode_plus_boxes(words, boxes=boxes, max_length=512, truncation=True)
+ input_r = tokenizer_r.encode_plus_boxes(words, boxes=boxes, max_length=512, truncation=True)
+
+ for key in filter(
+ lambda x: x in ["input_ids", "token_type_ids", "attention_mask", "bbox"], input_p.keys()
+ ):
+ self.assertSequenceEqual(input_p[key], input_r[key])
+
+ # Ensure truncation with stride match
+ input_p = tokenizer_p.encode_plus_boxes(
+ words, boxes=boxes, max_length=512, truncation=True, stride=3, return_overflowing_tokens=True
+ )
+ input_r = tokenizer_r.encode_plus_boxes(
+ words, boxes=boxes, max_length=512, truncation=True, stride=3, return_overflowing_tokens=True
+ )
+
+ for key in filter(
+ lambda x: x in ["input_ids", "token_type_ids", "attention_mask", "bbox"], input_p.keys()
+ ):
+ self.assertSequenceEqual(input_p[key], input_r[key][0])
+
+ def test_embeded_special_tokens(self):
+ if not self.test_slow_tokenizer:
+ # as we don't have a slow version, we can't compare the outputs between slow and fast versions
+ return
+
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+ tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+ words, boxes = self.get_words_and_boxes()
+ tokens_r = tokenizer_r.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ add_special_tokens=True,
+ )
+ tokens_p = tokenizer_p.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ add_special_tokens=True,
+ )
+
+ for key in tokens_p.keys():
+ self.assertEqual(tokens_r[key], tokens_p[key])
+
+ if "token_type_ids" in tokens_r:
+ self.assertEqual(sum(tokens_r["token_type_ids"]), sum(tokens_p["token_type_ids"]))
+
+ tokens_r = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
+ tokens_p = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
+ self.assertSequenceEqual(tokens_r, tokens_p)
+
+ def test_compare_add_special_tokens(self):
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+ simple_num_special_tokens_to_add = tokenizer_r.num_special_tokens_to_add(pair=False)
+
+ words, boxes = self.get_words_and_boxes()
+ # tokenize()
+ no_special_tokens = tokenizer_r.tokenize(" ".join(words), add_special_tokens=False)
+ with_special_tokens = tokenizer_r.tokenize(" ".join(words), add_special_tokens=True)
+ self.assertEqual(len(no_special_tokens), len(with_special_tokens) - simple_num_special_tokens_to_add)
+
+ # encode()
+ no_special_tokens = tokenizer_r.encode_boxes(words, boxes=boxes, add_special_tokens=False)
+ with_special_tokens = tokenizer_r.encode_boxes(words, boxes=boxes, add_special_tokens=True)
+ self.assertEqual(len(no_special_tokens), len(with_special_tokens) - simple_num_special_tokens_to_add)
+
+ # encode_plus()
+ no_special_tokens = tokenizer_r.encode_plus_boxes(words, boxes=boxes, add_special_tokens=False)
+ with_special_tokens = tokenizer_r.encode_plus_boxes(words, boxes=boxes, add_special_tokens=True)
+ for key in no_special_tokens.keys():
+ self.assertEqual(
+ len(no_special_tokens[key]),
+ len(with_special_tokens[key]) - simple_num_special_tokens_to_add,
+ )
+
+ # # batch_encode_plus
+ words, boxes = self.get_words_and_boxes_batch()
+
+ no_special_tokens = tokenizer_r.batch_encode_plus_boxes(words, boxes=boxes, add_special_tokens=False)
+ with_special_tokens = tokenizer_r.batch_encode_plus_boxes(words, boxes=boxes, add_special_tokens=True)
+ for key in no_special_tokens.keys():
+ for i_no, i_with in zip(no_special_tokens[key], with_special_tokens[key]):
+ self.assertEqual(len(i_no), len(i_with) - simple_num_special_tokens_to_add)
+
+ @slow
+ def test_udop_truncation_integration_test(self):
+ words, boxes = self.get_words_and_boxes()
+
+ tokenizer = UdopTokenizer.from_pretrained("microsoft/udop-large", model_max_length=512)
+
+ for i in range(12, 512):
+ new_encoded_inputs = tokenizer.encode_boxes(words, boxes=boxes, max_length=i, truncation=True)
+
+ # Ensure that the input IDs are less than the max length defined.
+ self.assertLessEqual(len(new_encoded_inputs), i)
+
+ tokenizer.model_max_length = 20
+ new_encoded_inputs = tokenizer.encode_boxes(words, boxes=boxes, truncation=True)
+ dropped_encoded_inputs = tokenizer.encode_boxes(words, boxes=boxes, truncation=True)
+
+ # Ensure that the input IDs are still truncated when no max_length is specified
+ self.assertListEqual(new_encoded_inputs, dropped_encoded_inputs)
+ self.assertLessEqual(len(new_encoded_inputs), 20)
+
+ @is_pt_tf_cross_test
+ def test_batch_encode_plus_tensors(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes_batch()
+
+ # A Tensor cannot be build by sequences which are not the same size
+ self.assertRaises(
+ ValueError, tokenizer.batch_encode_plus_boxes, words, boxes=boxes, return_tensors="pt"
+ )
+ self.assertRaises(
+ ValueError, tokenizer.batch_encode_plus_boxes, words, boxes=boxes, return_tensors="tf"
+ )
+
+ if tokenizer.pad_token_id is None:
+ self.assertRaises(
+ ValueError,
+ tokenizer.batch_encode_plus_boxes,
+ words,
+ boxes=boxes,
+ padding=True,
+ return_tensors="pt",
+ )
+ self.assertRaises(
+ ValueError,
+ tokenizer.batch_encode_plus_boxes,
+ words,
+ boxes=boxes,
+ padding="longest",
+ return_tensors="tf",
+ )
+ else:
+ pytorch_tensor = tokenizer.batch_encode_plus_boxes(
+ words, boxes=boxes, padding=True, return_tensors="pt"
+ )
+ tensorflow_tensor = tokenizer.batch_encode_plus_boxes(
+ words, boxes=boxes, padding="longest", return_tensors="tf"
+ )
+ encoded_sequences = tokenizer.batch_encode_plus_boxes(words, boxes=boxes, padding=True)
+
+ for key in encoded_sequences.keys():
+ pytorch_value = pytorch_tensor[key].tolist()
+ tensorflow_value = tensorflow_tensor[key].numpy().tolist()
+ encoded_value = encoded_sequences[key]
+
+ self.assertEqual(pytorch_value, tensorflow_value, encoded_value)
+
+ def test_sequence_ids(self):
+ tokenizers = self.get_tokenizers()
+ for tokenizer in tokenizers:
+ if not tokenizer.is_fast:
+ continue
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ seq_0 = "Test this method."
+ seq_1 = ["With", "these", "inputs."]
+ boxes = [[1000, 1000, 1000, 1000] for _ in range(len(seq_1))]
+
+ # We want to have sequence 0 and sequence 1 are tagged
+ # respectively with 0 and 1 token_ids
+ # (regardless of whether the model use token type ids)
+ # We use this assumption in the QA pipeline among other place
+ output = tokenizer(seq_0.split(), boxes=boxes)
+ self.assertIn(0, output.sequence_ids())
+
+ output = tokenizer(seq_0, seq_1, boxes=boxes)
+ self.assertIn(0, output.sequence_ids())
+ self.assertIn(1, output.sequence_ids())
+
+ if tokenizer.num_special_tokens_to_add(pair=True):
+ self.assertIn(None, output.sequence_ids())
+
+ def test_special_tokens_initialization(self):
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ added_tokens = [AddedToken("", lstrip=True)]
+
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(
+ pretrained_name, additional_special_tokens=added_tokens, **kwargs
+ )
+ words = "Hey this is a token".split()
+ boxes = [[1000, 1000, 1000, 1000] for _ in range(len(words))]
+ r_output = tokenizer_r.encode_boxes(words, boxes=boxes)
+
+ special_token_id = tokenizer_r.encode_boxes(
+ [""], boxes=[1000, 1000, 1000, 1000], add_special_tokens=False
+ )[0]
+
+ self.assertTrue(special_token_id in r_output)
+
+ if self.test_slow_tokenizer:
+ tokenizer_cr = self.rust_tokenizer_class.from_pretrained(
+ pretrained_name, additional_special_tokens=added_tokens, **kwargs, from_slow=True
+ )
+ tokenizer_p = self.tokenizer_class.from_pretrained(
+ pretrained_name, additional_special_tokens=added_tokens, **kwargs
+ )
+
+ words = "Hey this is a token".split()
+ boxes = [[1000, 1000, 1000, 1000] for _ in range(len(words))]
+
+ p_output = tokenizer_p.encode_boxes(words, boxes=boxes)
+ cr_output = tokenizer_cr.encode_boxes(words, boxes=boxes)
+
+ self.assertEqual(p_output, r_output)
+ self.assertEqual(cr_output, r_output)
+ self.assertTrue(special_token_id in p_output)
+ self.assertTrue(special_token_id in cr_output)
+
+ def test_training_new_tokenizer(self):
+ # This feature only exists for fast tokenizers
+ if not self.test_rust_tokenizer:
+ return
+
+ tokenizer = self.get_rust_tokenizer()
+ new_tokenizer = tokenizer.train_new_from_iterator(SMALL_TRAINING_CORPUS, 100)
+
+ # Test we can use the new tokenizer with something not seen during training
+ text = [["this", "is", "the"], ["how", "are", "you"]]
+ boxes = [[[1, 2, 3, 4], [5, 6, 7, 8], [1, 3, 4, 8]], [[5, 6, 7, 8], [4, 5, 6, 7], [3, 9, 2, 7]]]
+ inputs = new_tokenizer(text, boxes=boxes)
+ self.assertEqual(len(inputs["input_ids"]), 2)
+ decoded_input = new_tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
+ expected_result = "this is the"
+
+ if tokenizer.backend_tokenizer.normalizer is not None:
+ expected_result = tokenizer.backend_tokenizer.normalizer.normalize_str(expected_result)
+ self.assertEqual(expected_result, decoded_input)
+
+ # We check that the parameters of the tokenizer remained the same
+ # Check we have the same number of added_tokens for both pair and non-pair inputs.
+ self.assertEqual(tokenizer.num_special_tokens_to_add(False), new_tokenizer.num_special_tokens_to_add(False))
+ self.assertEqual(tokenizer.num_special_tokens_to_add(True), new_tokenizer.num_special_tokens_to_add(True))
+
+ # Check we have the correct max_length for both pair and non-pair inputs.
+ self.assertEqual(tokenizer.max_len_single_sentence, new_tokenizer.max_len_single_sentence)
+ self.assertEqual(tokenizer.max_len_sentences_pair, new_tokenizer.max_len_sentences_pair)
+
+ # Assert the set of special tokens match as we didn't ask to change them
+ self.assertSequenceEqual(
+ tokenizer.all_special_tokens_extended,
+ new_tokenizer.all_special_tokens_extended,
+ )
+
+ self.assertDictEqual(tokenizer.special_tokens_map, new_tokenizer.special_tokens_map)
+
+ def test_training_new_tokenizer_with_special_tokens_change(self):
+ # This feature only exists for fast tokenizers
+ if not self.test_rust_tokenizer:
+ return
+
+ tokenizer = self.get_rust_tokenizer()
+ # Test with a special tokens map
+ class_signature = inspect.signature(tokenizer.__class__)
+ if "cls_token" in class_signature.parameters:
+ new_tokenizer = tokenizer.train_new_from_iterator(
+ SMALL_TRAINING_CORPUS, 100, special_tokens_map={tokenizer.cls_token: ""}
+ )
+ cls_id = new_tokenizer.get_vocab()[""]
+ self.assertEqual(new_tokenizer.cls_token, "")
+ self.assertEqual(new_tokenizer.cls_token_id, cls_id)
+
+ # Create a new mapping from the special tokens defined in the original tokenizer
+ special_tokens_list = SpecialTokensMixin.SPECIAL_TOKENS_ATTRIBUTES.copy()
+ special_tokens_list.remove("additional_special_tokens")
+ special_tokens_map = {}
+ for token in special_tokens_list:
+ # Get the private one to avoid unnecessary warnings.
+ if getattr(tokenizer, f"_{token}") is not None:
+ special_token = getattr(tokenizer, token)
+ special_tokens_map[special_token] = f"{special_token}a"
+
+ # Train new tokenizer
+ new_tokenizer = tokenizer.train_new_from_iterator(
+ SMALL_TRAINING_CORPUS, 100, special_tokens_map=special_tokens_map
+ )
+
+ # Check the changes
+ for token in special_tokens_list:
+ # Get the private one to avoid unnecessary warnings.
+ if getattr(tokenizer, f"_{token}") is None:
+ continue
+ special_token = getattr(tokenizer, token)
+ if special_token in special_tokens_map:
+ new_special_token = getattr(new_tokenizer, token)
+ self.assertEqual(special_tokens_map[special_token], new_special_token)
+
+ new_id = new_tokenizer.get_vocab()[new_special_token]
+ self.assertEqual(getattr(new_tokenizer, f"{token}_id"), new_id)
+
+ # Check if the AddedToken / string format has been kept
+ for special_token in tokenizer.all_special_tokens_extended:
+ if isinstance(special_token, AddedToken) and special_token.content not in special_tokens_map:
+ # The special token must appear identically in the list of the new tokenizer.
+ self.assertTrue(
+ special_token in new_tokenizer.all_special_tokens_extended,
+ f"'{special_token}' should be in {new_tokenizer.all_special_tokens_extended}",
+ )
+ elif isinstance(special_token, AddedToken):
+ # The special token must appear in the list of the new tokenizer as an object of type AddedToken with
+ # the same parameters as the old AddedToken except the content that the user has requested to change.
+ special_token_str = special_token.content
+ new_special_token_str = special_tokens_map[special_token_str]
+
+ find = False
+ for candidate in new_tokenizer.all_special_tokens_extended:
+ if (
+ isinstance(candidate, AddedToken)
+ and candidate.content == new_special_token_str
+ and candidate.lstrip == special_token.lstrip
+ and candidate.rstrip == special_token.rstrip
+ and candidate.normalized == special_token.normalized
+ and candidate.single_word == special_token.single_word
+ ):
+ find = True
+ break
+ self.assertTrue(
+ find,
+ f"'{new_special_token_str}' doesn't appear in the list "
+ f"'{new_tokenizer.all_special_tokens_extended}' as an AddedToken with the same parameters as "
+ f"'{special_token}' in the list {tokenizer.all_special_tokens_extended}",
+ )
+ elif special_token not in special_tokens_map:
+ # The special token must appear identically in the list of the new tokenizer.
+ self.assertTrue(
+ special_token in new_tokenizer.all_special_tokens_extended,
+ f"'{special_token}' should be in {new_tokenizer.all_special_tokens_extended}",
+ )
+
+ else:
+ # The special token must appear in the list of the new tokenizer as an object of type string.
+ self.assertTrue(special_tokens_map[special_token] in new_tokenizer.all_special_tokens_extended)
+
+ # Test we can use the new tokenizer with something not seen during training
+ words = [["this", "is"], ["hello", "🤗"]]
+ boxes = [[[1, 2, 3, 4], [5, 6, 7, 8]], [[1, 2, 3, 4], [5, 6, 7, 8]]]
+ inputs = new_tokenizer(words, boxes=boxes)
+ self.assertEqual(len(inputs["input_ids"]), 2)
+ decoded_input = new_tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
+ expected_result = "this is"
+
+ if tokenizer.backend_tokenizer.normalizer is not None:
+ expected_result = tokenizer.backend_tokenizer.normalizer.normalize_str(expected_result)
+ self.assertEqual(expected_result, decoded_input)
+
+ def test_prepare_for_model(self):
+ tokenizers = self.get_tokenizers(do_lower_case=False)
+ for tokenizer in tokenizers:
+ # only test prepare_for_model for the slow tokenizer
+ if tokenizer.__class__.__name__ == "UdopTokenizerFast":
+ continue
+ with self.subTest(f"{tokenizer.__class__.__name__}"):
+ words, boxes = self.get_words_and_boxes()
+ prepared_input_dict = tokenizer.prepare_for_model_boxes(words, boxes=boxes, add_special_tokens=True)
+
+ input_dict = tokenizer.encode_plus_boxes(words, boxes=boxes, add_special_tokens=True)
+
+ self.assertEqual(input_dict, prepared_input_dict)
+
+ def test_padding_different_model_input_name(self):
+ if not self.test_slow_tokenizer:
+ # as we don't have a slow version, we can't compare the outputs between slow and fast versions
+ return
+
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+ tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+ tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+ self.assertEqual(tokenizer_p.pad_token_id, tokenizer_r.pad_token_id)
+ pad_token_id = tokenizer_p.pad_token_id
+
+ words, boxes = self.get_words_and_boxes_batch()
+
+ input_r = tokenizer_r.batch_encode_plus_boxes(words, boxes=boxes)
+ input_p = tokenizer_r.batch_encode_plus_boxes(words, boxes=boxes)
+
+ # rename encoded batch to "inputs"
+ input_r["inputs"] = input_r[tokenizer_r.model_input_names[0]]
+ del input_r[tokenizer_r.model_input_names[0]]
+
+ input_p["inputs"] = input_p[tokenizer_p.model_input_names[0]]
+ del input_p[tokenizer_p.model_input_names[0]]
+
+ # Renaming `input_ids` to `inputs`
+ tokenizer_r.model_input_names = ["inputs"] + tokenizer_r.model_input_names[1:]
+ tokenizer_p.model_input_names = ["inputs"] + tokenizer_p.model_input_names[1:]
+
+ input_r = tokenizer_r.pad(input_r, padding="longest")
+ input_p = tokenizer_r.pad(input_p, padding="longest")
+
+ max_length = len(input_p["inputs"][0])
+ self.assert_batch_padded_input_match(
+ input_r, input_p, max_length, pad_token_id, model_main_input_name="inputs"
+ )
+
+ def test_batch_encode_dynamic_overflowing(self):
+ """
+ When calling batch_encode with multiple sequences, it can return different number of
+ overflowing encoding for each sequence:
+ [
+ Sequence 1: [Encoding 1, Encoding 2],
+ Sequence 2: [Encoding 1],
+ Sequence 3: [Encoding 1, Encoding 2, ... Encoding N]
+ ]
+ This needs to be padded so that it can represented as a tensor
+ """
+ for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+ tokenizer = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+ with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name}, {tokenizer.__class__.__name__})"):
+ if is_torch_available():
+ returned_tensor = "pt"
+ elif is_tf_available():
+ returned_tensor = "tf"
+ else:
+ returned_tensor = "jax"
+
+ # Single example
+ words, boxes = self.get_words_and_boxes()
+ tokens = tokenizer.encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=6,
+ padding=True,
+ truncation=True,
+ return_tensors=returned_tensor,
+ return_overflowing_tokens=True,
+ )
+
+ for key in filter(lambda x: "overflow_to_sample_mapping" not in x, tokens.keys()):
+ if key != "bbox":
+ self.assertEqual(len(tokens[key].shape), 2)
+ else:
+ self.assertEqual(len(tokens[key].shape), 3)
+
+ # Batch of examples
+ # For these 2 examples, 3 training examples will be created
+ words, boxes = self.get_words_and_boxes_batch()
+ tokens = tokenizer.batch_encode_plus_boxes(
+ words,
+ boxes=boxes,
+ max_length=6,
+ padding=True,
+ truncation="only_first",
+ return_tensors=returned_tensor,
+ return_overflowing_tokens=True,
+ )
+
+ for key in filter(lambda x: "overflow_to_sample_mapping" not in x, tokens.keys()):
+ if key != "bbox":
+ self.assertEqual(len(tokens[key].shape), 2)
+ self.assertEqual(tokens[key].shape[-1], 6)
+ else:
+ self.assertEqual(len(tokens[key].shape), 3)
+ self.assertEqual(tokens[key].shape[-1], 4)
+
+ @unittest.skip("TO DO: overwrite this very extensive test.")
+ def test_alignement_methods(self):
+ pass
+
+ @unittest.skip("UDOP tokenizer requires boxes besides sequences.")
+ def test_maximum_encoding_length_pair_input(self):
+ pass
+
+ @unittest.skip("UDOP tokenizer requires boxes besides sequences.")
+ def test_maximum_encoding_length_single_input(self):
+ pass
+
+ @unittest.skip("UDOP tokenizer requires boxes besides sequences.")
+ def test_pretokenized_inputs(self):
+ pass
+
+ @unittest.skip("UDOP tokenizer always expects pretokenized inputs.")
+ def test_compare_pretokenized_inputs(self):
+ pass
+
+ @unittest.skip("UDOP fast tokenizer does not support prepare_for_model")
+ def test_compare_prepare_for_model(self):
+ pass
+
+ @slow
+ def test_only_label_first_subword(self):
+ words = ["hello", "niels"]
+ boxes = [[1000, 1000, 1000, 1000] for _ in range(len(words))]
+ word_labels = [0, 1]
+
+ # test slow tokenizer
+ tokenizer_p = UdopTokenizer.from_pretrained("microsoft/udop-large")
+ encoding = tokenizer_p(words, boxes=boxes, word_labels=word_labels)
+ self.assertListEqual(encoding.labels, [0, 1, -100, -100, -100])
+
+ tokenizer_p = UdopTokenizer.from_pretrained("microsoft/udop-large", only_label_first_subword=False)
+ encoding = tokenizer_p(words, boxes=boxes, word_labels=word_labels)
+ self.assertListEqual(encoding.labels, [0, 1, 1, 1, -100])
+
+ # test fast tokenizer
+ tokenizer_r = UdopTokenizerFast.from_pretrained("microsoft/udop-large")
+ encoding = tokenizer_r(words, boxes=boxes, word_labels=word_labels)
+ self.assertListEqual(encoding.labels, [0, 1, -100, -100, -100])
+
+ tokenizer_r = UdopTokenizerFast.from_pretrained("microsoft/udop-large", only_label_first_subword=False)
+ encoding = tokenizer_r(words, boxes=boxes, word_labels=word_labels)
+ self.assertListEqual(encoding.labels, [0, 1, 1, 1, -100])
+
+ @slow
+ def test_udop_integration_test(self):
+ tokenizer_p = UdopTokenizer.from_pretrained("microsoft/udop-large")
+ tokenizer_r = UdopTokenizerFast.from_pretrained("microsoft/udop-large")
+
+ # There are 3 cases:
+ # CASE 1: document image classification (training + inference), document image token classification (inference),
+ # in which case only words and normalized bounding boxes are provided to the tokenizer
+ # CASE 2: document image token classification (training),
+ # in which case one also provides word labels to the tokenizer
+ # CASE 3: document image visual question answering (inference),
+ # in which case one also provides a question to the tokenizer
+
+ # We need to test all 3 cases both on batched and non-batched inputs.
+
+ # CASE 1: not batched
+ words, boxes = self.get_words_and_boxes()
+
+ # fmt: off
+ expected_results = {'input_ids': [3, 9, 10088, 120, 794, 21820, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'bbox': [[423, 237, 440, 251], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [961, 885, 992, 912], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} # noqa: E231
+ # fmt: on
+
+ encoding_p = tokenizer_p(words, boxes=boxes, padding="max_length", max_length=20)
+ encoding_r = tokenizer_r(words, boxes=boxes, padding="max_length", max_length=20)
+ self.assertDictEqual(dict(encoding_p), expected_results)
+ self.assertDictEqual(dict(encoding_r), expected_results)
+
+ # CASE 1: batched
+ words, boxes = self.get_words_and_boxes_batch()
+
+ # fmt: off
+ expected_results = {'input_ids': [[3, 9, 10088, 120, 794, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [21820, 82, 564, 19, 3, 17396, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'bbox': [[[423, 237, 440, 251], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[961, 885, 992, 912], [256, 38, 330, 58], [256, 38, 330, 58], [336, 42, 353, 57], [34, 42, 66, 69], [34, 42, 66, 69], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]} # noqa: E231
+ # fmt: on
+
+ encoding_p = tokenizer_p(words, boxes=boxes, padding="max_length", max_length=20)
+ encoding_r = tokenizer_r(words, boxes=boxes, padding="max_length", max_length=20)
+ self.assertDictEqual(dict(encoding_p), expected_results)
+ self.assertDictEqual(dict(encoding_r), expected_results)
+
+ # CASE 2: not batched
+ words, boxes = self.get_words_and_boxes()
+ word_labels = [1, 2, 3, 4]
+
+ # fmt: off
+ expected_results = {'input_ids': [3, 9, 10088, 120, 794, 21820, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'bbox': [[423, 237, 440, 251], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [961, 885, 992, 912], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 'labels': [1, -100, 2, -100, 3, 4, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]} # noqa: E231
+ # fmt: on
+
+ encoding_p = tokenizer_p(words, boxes=boxes, word_labels=word_labels, padding="max_length", max_length=20)
+ encoding_r = tokenizer_r(words, boxes=boxes, word_labels=word_labels, padding="max_length", max_length=20)
+
+ for key in expected_results:
+ self.assertListEqual(encoding_p[key], encoding_r[key])
+
+ self.assertDictEqual(dict(encoding_p), expected_results)
+ self.assertDictEqual(dict(encoding_r), expected_results)
+
+ # CASE 2: batched
+ words, boxes = self.get_words_and_boxes_batch()
+ word_labels = [[1, 2, 3], [2, 46, 17, 22, 3]]
+
+ # fmt: off
+ expected_results = {'input_ids': [[3, 9, 10088, 120, 794, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [21820, 82, 564, 19, 3, 17396, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'bbox': [[[423, 237, 440, 251], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[961, 885, 992, 912], [256, 38, 330, 58], [256, 38, 330, 58], [336, 42, 353, 57], [34, 42, 66, 69], [34, 42, 66, 69], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]], 'labels': [[1, -100, 2, -100, 3, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100], [2, 46, 17, 22, 3, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]} # noqa: E231
+ # fmt: on
+
+ encoding_p = tokenizer_p(words, boxes=boxes, word_labels=word_labels, padding="max_length", max_length=20)
+ encoding_r = tokenizer_r(words, boxes=boxes, word_labels=word_labels, padding="max_length", max_length=20)
+ self.assertDictEqual(dict(encoding_p), expected_results)
+ self.assertDictEqual(dict(encoding_r), expected_results)
+
+ # CASE 3: not batched
+ question, words, boxes = self.get_question_words_and_boxes()
+
+ # fmt: off
+ expected_results = {'input_ids': [125, 31, 7, 112, 564, 58, 1, 3, 9, 10088, 120, 794, 1, 0, 0, 0, 0, 0, 0, 0], 'bbox': [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [1000, 1000, 1000, 1000], [423, 237, 440, 251], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]} # noqa: E231
+ # fmt: on
+
+ encoding_p = tokenizer_p(question, words, boxes, padding="max_length", max_length=20)
+ encoding_r = tokenizer_r(question, words, boxes, padding="max_length", max_length=20)
+ self.assertDictEqual(dict(encoding_p), expected_results)
+ self.assertDictEqual(dict(encoding_r), expected_results)
+
+ # CASE 3: batched
+ questions, words, boxes = self.get_question_words_and_boxes_batch()
+
+ # fmt: off
+ expected_results = {'input_ids': [[125, 31, 7, 112, 564, 58, 1, 3, 9, 10088, 120, 794, 1, 0, 0, 0, 0, 0, 0, 0], [149, 19, 3, 88, 718, 58, 1, 125, 3, 9, 50, 99, 1807, 17, 29, 1, 0, 0, 0, 0]], 'bbox': [[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [1000, 1000, 1000, 1000], [423, 237, 440, 251], [423, 237, 440, 251], [427, 272, 441, 287], [427, 272, 441, 287], [419, 115, 437, 129], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [1000, 1000, 1000, 1000], [256, 38, 330, 58], [256, 38, 330, 58], [256, 38, 330, 58], [336, 42, 353, 57], [336, 42, 353, 57], [34, 42, 66, 69], [34, 42, 66, 69], [34, 42, 66, 69], [1000, 1000, 1000, 1000], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]} # noqa: E231
+ # fmt: on
+
+ encoding_p = tokenizer_p(questions, words, boxes, padding="max_length", max_length=20)
+ encoding_r = tokenizer_r(questions, words, boxes, padding="max_length", max_length=20)
+ self.assertDictEqual(dict(encoding_p), expected_results)
+ self.assertDictEqual(dict(encoding_r), expected_results)
+
+ @unittest.skip("Doesn't support another framework than PyTorch")
+ def test_np_encode_plus_sent_to_model(self):
+ pass
+
+ @unittest.skip("Doesn't use SentencePiece")
+ def test_sentencepiece_tokenize_and_convert_tokens_to_string(self):
+ pass
+
+ @unittest.skip("Doesn't use SentencePiece")
+ def test_sentencepiece_tokenize_and_decode(self):
+ pass
+
+ def test_text_target(self):
+ tokenizer_p = UdopTokenizer.from_pretrained("microsoft/udop-large")
+ tokenizer_r = UdopTokenizerFast.from_pretrained("microsoft/udop-large")
+
+ text = "hello world"
+ expected_decoding = "hello world"
+
+ # should raise an error if we don't provide it using the `text_target` argument
+ with self.assertRaises(ValueError):
+ tokenizer_p(text)
+
+ encoding_p = tokenizer_p(text_target=text)
+ encoding_r = tokenizer_r(text_target=text)
+
+ self.assertListEqual(encoding_p["input_ids"], [21820, 296, 1])
+ self.assertListEqual(encoding_p["attention_mask"], [1, 1, 1])
+ self.assertDictEqual(dict(encoding_p), dict(encoding_r))
+ self.assertEqual(tokenizer_p.decode(encoding_p["input_ids"]), expected_decoding)
diff --git a/utils/check_config_attributes.py b/utils/check_config_attributes.py
index da4a1210357daf..fae3ed8da0b4ef 100644
--- a/utils/check_config_attributes.py
+++ b/utils/check_config_attributes.py
@@ -84,6 +84,8 @@
"ClapAudioConfig": ["num_classes"],
# Not used, but providing useful information to users
"SpeechT5HifiGanConfig": ["sampling_rate"],
+ # used internally in the configuration class file
+ "UdopConfig": ["feed_forward_proj"],
# Actually used in the config or generation config, in that case necessary for the sub-components generation
"SeamlessM4TConfig": [
"max_new_tokens",
diff --git a/utils/check_repo.py b/utils/check_repo.py
index 7cc06c6781164c..44c99194f309a2 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -61,6 +61,7 @@
PRIVATE_MODELS = [
"AltRobertaModel",
"DPRSpanPredictor",
+ "UdopStack",
"LongT5Stack",
"RealmBertModel",
"T5Stack",
@@ -304,6 +305,7 @@
"SeamlessM4TCodeHifiGan",
"SeamlessM4TForSpeechToSpeech", # no auto class for speech-to-speech
"TvpForVideoGrounding",
+ "UdopForConditionalGeneration",
"SeamlessM4Tv2NARTextToUnitModel",
"SeamlessM4Tv2NARTextToUnitForConditionalGeneration",
"SeamlessM4Tv2CodeHifiGan",
From e9476832942a19cf99354776ef112babc83c139a Mon Sep 17 00:00:00 2001
From: njackman-2344 <110741503+njackman-2344@users.noreply.github.com>
Date: Mon, 4 Mar 2024 13:57:51 -0800
Subject: [PATCH 077/549] [Docs] Spanish Translation -Torchscript md & Trainer
md (#29310)
* torchscript and trainer md es translation
* corrected md es files and even corrected spelling in en md
* made es corrections to trainer.md
* deleted entrenamiento... title on yml
* placed entrenamiento in right place
---
docs/source/en/trainer.md | 2 +-
docs/source/es/_toctree.yml | 4 +
docs/source/es/torchscript.md | 167 ++++++++++++++
docs/source/es/trainer.md | 409 ++++++++++++++++++++++++++++++++++
4 files changed, 581 insertions(+), 1 deletion(-)
create mode 100644 docs/source/es/torchscript.md
create mode 100644 docs/source/es/trainer.md
diff --git a/docs/source/en/trainer.md b/docs/source/en/trainer.md
index 22ef9a0c160e9c..65bfa4176dd2a9 100644
--- a/docs/source/en/trainer.md
+++ b/docs/source/en/trainer.md
@@ -104,7 +104,7 @@ trainer.train(resume_from_checkpoint="your-model/checkpoint-1000")
You can save your checkpoints (the optimizer state is not saved by default) to the Hub by setting `push_to_hub=True` in [`TrainingArguments`] to commit and push them. Other options for deciding how your checkpoints are saved are set up in the [`hub_strategy`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.hub_strategy) parameter:
* `hub_strategy="checkpoint"` pushes the latest checkpoint to a subfolder named "last-checkpoint" from which you can resume training
-* `hug_strategy="all_checkpoints"` pushes all checkpoints to the directory defined in `output_dir` (you'll see one checkpoint per folder in your model repository)
+* `hub_strategy="all_checkpoints"` pushes all checkpoints to the directory defined in `output_dir` (you'll see one checkpoint per folder in your model repository)
When you resume training from a checkpoint, the [`Trainer`] tries to keep the Python, NumPy, and PyTorch RNG states the same as they were when the checkpoint was saved. But because PyTorch has various non-deterministic default settings, the RNG states aren't guaranteed to be the same. If you want to enable full determinism, take a look at the [Controlling sources of randomness](https://pytorch.org/docs/stable/notes/randomness#controlling-sources-of-randomness) guide to learn what you can enable to make your training fully deterministic. Keep in mind though that by making certain settings deterministic, training may be slower.
diff --git a/docs/source/es/_toctree.yml b/docs/source/es/_toctree.yml
index 69334ba267e42e..80e371d308dec2 100644
--- a/docs/source/es/_toctree.yml
+++ b/docs/source/es/_toctree.yml
@@ -56,12 +56,16 @@
title: Compartir modelos personalizados
- local: run_scripts
title: Entrenamiento con scripts
+ - local: trainer
+ title: Entrenador
- local: sagemaker
title: Ejecutar el entrenamiento en Amazon SageMaker
- local: converting_tensorflow_models
title: Convertir checkpoints de TensorFlow
- local: serialization
title: Exportar a ONNX
+ - local: torchscript
+ title: Exportar a TorchScript
- local: community
title: Los recursos de la comunidad
title: Guías para desarrolladores
diff --git a/docs/source/es/torchscript.md b/docs/source/es/torchscript.md
new file mode 100644
index 00000000000000..93873fadcae800
--- /dev/null
+++ b/docs/source/es/torchscript.md
@@ -0,0 +1,167 @@
+
+
+# Exportar a TorchScript
+
+
+Este es el comienzo de nuestros experimentos con TorchScript y todavía estamos explorando sus capacidades con modelos de variables de entrada. Es un tema de interés para nosotros y profundizaremos en nuestro análisis en las próximas versiones, con más ejemplos de código, una implementación más flexible y comparativas de rendimiento comparando códigos basados en Python con TorchScript compilado.
+
+
+
+De acuerdo con la documentación de TorchScript:
+
+> "TorchScript es una manera de crear modelos serializables y optimizables a partir del código PyTorch."
+
+Hay dos módulos de PyTorch, [JIT y TRACE](https://pytorch.org/docs/stable/jit.html), que permiten a los desarrolladores exportar sus modelos para ser reusados en otros programas, como los programas de C++ orientados a la eficiencia.
+
+Nosotros proveemos una interface que te permite exportar los modelos 🤗Transformers a TorchScript para que puedan ser reusados en un entorno diferente al de los programas Python basados en PyTorch. Aquí explicamos como exportar y usar nuestros modelos utilizando TorchScript.
+
+Exportar un modelo requiere de dos cosas:
+
+- La instanciación del modelo con la bandera TorchScript.
+- Un paso hacia adelante con entradas ficticias.
+
+Estas necesidades implican varias cosas de las que los desarrolladores deben tener cuidado, como se detalla a continuación.
+
+## Bandera TorchScript y pesos atados.
+
+La bandera `torchscript` es necesaria porque la mayoría de los modelos de lenguaje de 🤗Transformers tienen pesos atados entre su `capa de incrustación` (`Embedding`) y su `capa de decodificación` (`Decoding`). TorchScript no te permite exportar modelos que tienen pesos atados, por lo que es necesario desatar y clonar los pesos de antemano.
+
+Los modelos instanciados con la bandera `torchscript` tienen su `capa de incrustación` (`Embedding`) y su `capa de decodificación` (`Decoding`) separadas, lo que significa que no deben ser entrenados más adelante. Entrenar desincronizaría las dos capas, lo que llevaría a resultados inesperados.
+
+Esto no es así para los modelos que no tienen una cabeza de modelo de lenguaje, ya que esos modelos no tienen pesos atados. Estos modelos pueden ser exportados de manera segura sin la bandera `torchscript`.
+
+## Entradas ficticias y longitudes estándar
+
+Las entradas ficticias se utilizan para un paso del modelo hacia adelante. Mientras los valores de las entradas se propagan a través de las capas, PyTorch realiza un seguimiento de las diferentes operaciones ejecutadas en cada tensor. Estas operaciones registradas se utilizan luego para crear *la traza* del modelo.
+La traza se crea en relación con las dimensiones de las entradas. Por lo tanto, está limitada por las dimensiones de la entrada ficticia y no funcionará para ninguna otra longitud de secuencia o tamaño de lote. Cuando se intenta con un tamaño diferente, se genera el siguiente error:
+
+```
+`El tamaño expandido del tensor (3) debe coincidir con el tamaño existente (7) en la dimensión no singleton 2`.
+```
+
+Recomendamos trazar el modelo con un tamaño de entrada ficticio al menos tan grande como la entrada más grande con la que se alimentará al modelo durante la inferencia. El relleno puede ayudar a completar los valores faltantes. Sin embargo, dado que el modelo se traza con un tamaño de entrada más grande, las dimensiones de la matriz también serán grandes, lo que resultará en más cálculos.
+
+Ten cuidado con el número total de operaciones realizadas en cada entrada y sigue de cerca el rendimiento al exportar modelos con longitudes de secuencia variables.
+
+## Usando TorchScript en Python
+
+Esta sección demuestra cómo guardar y cargar modelos, así como cómo usar la traza para la inferencia.
+
+### Guardando un modelo
+
+Para exportar un `BertModel` con TorchScript, instancia `BertModel` a partir de la clase `BertConfig` y luego guárdalo en disco bajo el nombre de archivo `traced_bert.pt`:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+
+enc = BertTokenizer.from_pretrained("bert-base-uncased")
+
+# Tokenizing input text
+text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+tokenized_text = enc.tokenize(text)
+
+# Masking one of the input tokens
+masked_index = 8
+tokenized_text[masked_index] = "[MASK]"
+indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
+segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+
+# Creating a dummy input
+tokens_tensor = torch.tensor([indexed_tokens])
+segments_tensors = torch.tensor([segments_ids])
+dummy_input = [tokens_tensor, segments_tensors]
+
+# Initializing the model with the torchscript flag
+# Flag set to True even though it is not necessary as this model does not have an LM Head.
+config = BertConfig(
+ vocab_size_or_config_json_file=32000,
+ hidden_size=768,
+ num_hidden_layers=12,
+ num_attention_heads=12,
+ intermediate_size=3072,
+ torchscript=True,
+)
+
+# Instantiating the model
+model = BertModel(config)
+
+# The model needs to be in evaluation mode
+model.eval()
+
+# If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
+model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
+
+# Creating the trace
+traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
+torch.jit.save(traced_model, "traced_bert.pt")
+```
+### Cargando un modelo
+
+Ahora puedes cargar el `BertModel` guardado anteriormente, `traced_bert.pt`, desde el disco y usarlo en la entrada ficticia (`dummy_input`) previamente inicializada:
+
+```python
+loaded_model = torch.jit.load("traced_bert.pt")
+loaded_model.eval()
+
+all_encoder_layers, pooled_output = loaded_model(*dummy_input)
+```
+
+## Usando un modelo trazado para inferencia
+
+Utiliza el modelo trazado para inferencia utilizando su método `_call_` dunder:
+
+```python
+traced_model(tokens_tensor, segments_tensors)
+```
+## Despliega modelos TorchScript de Hugging Face en AWS con el Neuron SDK
+
+AWS introdujo la familia de instancias [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) para inferencia de aprendizaje automático de alto rendimiento y bajo costo en la nube. Las instancias Inf1 están alimentadas por el chip AWS Inferentia, un acelerador de hardware personalizado que se especializa en cargas de trabajo de inferencia de aprendizaje profundo. [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#) es el SDK para Inferentia que admite el trazado y la optimización de modelos de transformers para implementación en Inf1. El SDK Neuron proporciona:
+
+1. Una API fácil de usar con un solo cambio de línea de código para trazar y optimizar un modelo TorchScript para inferencia en la nube.
+
+2. Optimizaciones de rendimiento listas para usar [para mejorar el rendimiento y el costo](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>).
+
+3. Soporte para modelos de transformers de Hugging Face construidos tanto con [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html) como con [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html).
+
+### Implicaciones
+
+Los modelos transformers basados en la arquitectura [BERT (Bidirectional Encoder Representations from Transformers)](https://huggingface.co/docs/transformers/main/model_doc/bert), o sus variantes como [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert) y [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta), funcionan mejor en Inf1 para tareas no generativas como la respuesta a preguntas extractivas, la clasificación de secuencias y la clasificación de tokens. Sin embargo, las tareas de generación de texto aún pueden adaptarse para ejecutarse en Inf1 según este [tutorial de AWS Neuron MarianMT](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html). Se puede encontrar más información sobre los modelos que se pueden convertir fácilmente para usar en Inferentia en la sección de [Model Architecture Fit](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia) de la documentación de Neuron.
+
+### Dependencias
+
+El uso de AWS Neuron para convertir modelos requiere un [entorno de Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide) que viene preconfigurado en [la AMI de AWS Deep Learning](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html).
+
+### Convertir un modelo para AWS Neuron
+
+Convierte un modelo para AWS NEURON utilizando el mismo código de [Uso de TorchScript en Python](torchscript#using-torchscript-in-python) para trazar un `BertModel`. Importa la extensión del framework `torch.neuron` para acceder a los componentes del Neuron SDK a través de una API de Python:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+import torch.neuron
+```
+Solo necesitas la linea sigueda:
+
+```diff
+- torch.jit.trace(model, [tokens_tensor, segments_tensors])
++ torch.neuron.trace(model, [token_tensor, segments_tensors])
+```
+
+Esto permite que el Neuron SDK trace el modelo y lo optimice para las instancias Inf1.
+
+Para obtener más información sobre las características, herramientas, tutoriales de ejemplo y últimas actualizaciones del AWS Neuron SDK, consulta [la documentación de AWS NeuronSDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html).
\ No newline at end of file
diff --git a/docs/source/es/trainer.md b/docs/source/es/trainer.md
new file mode 100644
index 00000000000000..9a36e3867c17e3
--- /dev/null
+++ b/docs/source/es/trainer.md
@@ -0,0 +1,409 @@
+
+
+# El Trainer
+
+El [`Trainer`] es un bucle completo de entrenamiento y evaluación para modelos de PyTorch implementado en la biblioteca Transformers. Solo necesitas pasarle las piezas necesarias para el entrenamiento (modelo, tokenizador, conjunto de datos, función de evaluación, hiperparámetros de entrenamiento, etc.), y la clase [`Trainer`] se encarga del resto. Esto facilita comenzar a entrenar más rápido sin tener que escribir manualmente tu propio bucle de entrenamiento. Pero al mismo tiempo, [`Trainer`] es muy personalizable y ofrece una gran cantidad de opciones de entrenamiento para que puedas adaptarlo a tus necesidades exactas de entrenamiento.
+
+
+
+Además de la clase [`Trainer`], Transformers también proporciona una clase [`Seq2SeqTrainer`] para tareas de secuencia a secuencia como traducción o resumen. También está la clase [~trl.SFTTrainer] de la biblioteca [TRL](https://hf.co/docs/trl) que envuelve la clase [`Trainer`] y está optimizada para entrenar modelos de lenguaje como Llama-2 y Mistral con técnicas autoregresivas. [`~trl.SFTTrainer`] también admite funciones como el empaquetado de secuencias, LoRA, cuantización y DeepSpeed para escalar eficientemente a cualquier tamaño de modelo.
+
+
+
+Siéntete libre de consultar [la referencia de API](./main_classes/trainer) para estas otras clases tipo [`Trainer`] para aprender más sobre cuándo usar cada una. En general, [`Trainer`] es la opción más versátil y es apropiada para una amplia gama de tareas. [`Seq2SeqTrainer`] está diseñado para tareas de secuencia a secuencia y [`~trl.SFTTrainer`] está diseñado para entrenar modelos de lenguaje.
+
+
+
+Antes de comenzar, asegúrate de tener instalado [Accelerate](https://hf.co/docs/accelerate), una biblioteca para habilitar y ejecutar el entrenamiento de PyTorch en entornos distribuidos.
+
+```bash
+pip install accelerate
+
+# upgrade
+pip install accelerate --upgrade
+```
+Esta guía proporciona una visión general de la clase [`Trainer`].
+
+## Uso básico
+
+[`Trainer`] incluye todo el código que encontrarías en un bucle de entrenamiento básico:
+1. Realiza un paso de entrenamiento para calcular la pérdida
+2. Calcula los gradientes con el método [~accelerate.Accelerator.backward]
+3. Actualiza los pesos basados en los gradientes
+4. Repite este proceso hasta alcanzar un número predeterminado de épocas
+
+La clase [`Trainer`] abstrae todo este código para que no tengas que preocuparte por escribir manualmente un bucle de entrenamiento cada vez o si estás empezando con PyTorch y el entrenamiento. Solo necesitas proporcionar los componentes esenciales requeridos para el entrenamiento, como un modelo y un conjunto de datos, y la clase [`Trainer`] maneja todo lo demás.
+
+Si deseas especificar opciones de entrenamiento o hiperparámetros, puedes encontrarlos en la clase [`TrainingArguments`]. Por ejemplo, vamos a definir dónde guardar el modelo en output_dir y subir el modelo al Hub después del entrenamiento con `push_to_hub=True`.
+
+```py
+from transformers import TrainingArguments
+
+training_args = TrainingArguments(
+ output_dir="your-model",
+ learning_rate=2e-5,
+ per_device_train_batch_size=16,
+ per_device_eval_batch_size=16,
+ num_train_epochs=2,
+ weight_decay=0.01,
+ evaluation_strategy="epoch",
+ save_strategy="epoch",
+ load_best_model_at_end=True,
+ push_to_hub=True,
+)
+```
+
+Pase `training_args` al [`Trainer`] con un modelo, un conjunto de datos o algo para preprocesar el conjunto de datos (dependiendo en el tipo de datos pueda ser un tokenizer, extractor de caracteristicas o procesor del imagen), un recopilador de datos y una función para calcular las métricas que desea rastrear durante el entrenamiento.
+
+Finalmente, llame [`~Trainer.train`] para empezar entrenamiento!
+
+```py
+from transformers import Trainer
+
+trainer = Trainer(
+ model=model,
+ args=training_args,
+ train_dataset=dataset["train"],
+ eval_dataset=dataset["test"],
+ tokenizer=tokenizer,
+ data_collator=data_collator,
+ compute_metrics=compute_metrics,
+)
+
+trainer.train()
+```
+
+### Los puntos de control
+
+La clase [`Trainer`] guarda los puntos de control del modelo en el directorio especificado en el parámetro `output_dir` de [`TrainingArguments`]. Encontrarás los puntos de control guardados en una subcarpeta checkpoint-000 donde los números al final corresponden al paso de entrenamiento. Guardar puntos de control es útil para reanudar el entrenamiento más tarde.
+
+```py
+# resume from latest checkpoint
+trainer.train(resume_from_checkpoint=True)
+
+# resume from specific checkpoint saved in output directory
+trainer.train(resume_from_checkpoint="your-model/checkpoint-1000")
+```
+
+Puedes guardar tus puntos de control (por defecto, el estado del optimizador no se guarda) en el Hub configurando `push_to_hub=True` en [`TrainingArguments`] para confirmar y enviarlos. Otras opciones para decidir cómo se guardan tus puntos de control están configuradas en el parámetro [`hub_strategy`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.hub_strategy):
+
+* hub_strategy="checkpoint" envía el último punto de control a una subcarpeta llamada "last-checkpoint" desde la cual puedes reanudar el entrenamiento.
+
+* hub_strategy="all_checkpoints" envía todos los puntos de control al directorio definido en `output_dir` (verás un punto de control por carpeta en tu repositorio de modelos).
+
+Cuando reanudas el entrenamiento desde un punto de control, el [`Trainer`] intenta mantener los estados de los generadores de números aleatorios (RNG) de Python, NumPy y PyTorch iguales a como estaban cuando se guardó el punto de control. Pero debido a que PyTorch tiene varias configuraciones predeterminadas no determinísticas, no se garantiza que los estados de RNG sean los mismos. Si deseas habilitar la plena determinismo, echa un vistazo a la guía ["Controlling sources of randomness"](https://pytorch.org/docs/stable/notes/randomness#controlling-sources-of-randomness) para aprender qué puedes habilitar para hacer que tu entrenamiento sea completamente determinista. Sin embargo, ten en cuenta que al hacer ciertas configuraciones deterministas, el entrenamiento puede ser más lento.
+
+## Personaliza el Trainer
+
+Si bien la clase [`Trainer`] está diseñada para ser accesible y fácil de usar, también ofrece mucha capacidad de personalización para usuarios más aventureros. Muchos de los métodos del [`Trainer`] pueden ser subclasificados y sobrescritos para admitir la funcionalidad que deseas, sin tener que reescribir todo el bucle de entrenamiento desde cero para adaptarlo. Estos métodos incluyen:
+
+* [~Trainer.get_train_dataloader] crea un entrenamiento de DataLoader
+* [~Trainer.get_eval_dataloader] crea una evaluación DataLoader
+* [~Trainer.get_test_dataloader] crea una prueba de DataLoader
+* [~Trainer.log] anota la información de los objetos varios que observa el entrenamiento
+* [~Trainer.create_optimizer_and_scheduler] crea un optimizador y la tasa programada de aprendizaje si no lo pasaron en __init__; estos pueden ser personalizados independientes con [~Trainer.create_optimizer] y [~Trainer.create_scheduler] respectivamente
+* [~Trainer.compute_loss] computa la pérdida en lote con las aportes del entrenamiento
+* [~Trainer.training_step] realiza el paso del entrenamiento
+* [~Trainer.prediction_step] realiza la predicción y paso de prueba
+* [~Trainer.evaluate] evalua el modelo y da las metricas evaluativas
+* [~Trainer.predict] hace las predicciones (con las metricas si hay etiquetas disponibles) en lote de prueba
+
+Por ejemplo, si deseas personalizar el método [`~Trainer.compute_loss`] para usar una pérdida ponderada en su lugar, puedes hacerlo de la siguiente manera:
+
+```py
+from torch import nn
+from transformers import Trainer
+
+class CustomTrainer(Trainer):
+ def compute_loss(self, model, inputs, return_outputs=False):
+ labels = inputs.pop("labels")
+ # forward pass
+ outputs = model(**inputs)
+ logits = outputs.get("logits")
+ # compute custom loss for 3 labels with different weights
+ loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
+ loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
+ return (loss, outputs) if return_outputs else loss
+```
+### Callbacks
+
+Otra opción para personalizar el [`Trainer`] es utilizar [callbacks](callbacks). Los callbacks *no cambian nada* en el bucle de entrenamiento. Inspeccionan el estado del bucle de entrenamiento y luego ejecutan alguna acción (detención anticipada, registro de resultados, etc.) según el estado. En otras palabras, un callback no puede usarse para implementar algo como una función de pérdida personalizada y necesitarás subclasificar y sobrescribir el método [`~Trainer.compute_loss`] para eso.
+
+Por ejemplo, si deseas agregar un callback de detención anticipada al bucle de entrenamiento después de 10 pasos.
+
+```py
+from transformers import TrainerCallback
+
+class EarlyStoppingCallback(TrainerCallback):
+ def __init__(self, num_steps=10):
+ self.num_steps = num_steps
+
+ def on_step_end(self, args, state, control, **kwargs):
+ if state.global_step >= self.num_steps:
+ return {"should_training_stop": True}
+ else:
+ return {}
+
+```
+Luego, pásalo al parámetro `callback` del [`Trainer`]:
+
+```py
+from transformers import Trainer
+
+trainer = Trainer(
+ model=model,
+ args=training_args,
+ train_dataset=dataset["train"],
+ eval_dataset=dataset["test"],
+ tokenizer=tokenizer,
+ data_collator=data_collator,
+ compute_metrics=compute_metrics,
+ callback=[EarlyStoppingCallback()],
+)
+```
+
+## Logging
+
+
+
+Comprueba el API referencia [logging](./main_classes/logging) para mas información sobre los niveles differentes de logging.
+
+
+
+El [`Trainer`] está configurado a `logging.INFO` de forma predeterminada el cual informa errores, advertencias y otra información basica. Un [`Trainer`] réplica - en entornos distributos - está configurado a `logging.WARNING` el cual solamente informa errores y advertencias. Puedes cambiar el nivel de logging con los parametros [`log_level`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level) y [`log_level_replica`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level_replica) en [`TrainingArguments`].
+
+Para configurar el nivel de registro para cada nodo, usa el parámetro [`log_on_each_node`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.log_on_each_node) para determinar si deseas utilizar el nivel de registro en cada nodo o solo en el nodo principal.
+
+
+
+[`Trainer`] establece el nivel de registro por separado para cada nodo en el método [`Trainer.init`], por lo que es posible que desees considerar establecer esto antes si estás utilizando otras funcionalidades de Transformers antes de crear el objeto [`Trainer`].
+
+
+
+Por ejemplo, para establecer que tu código principal y los módulos utilicen el mismo nivel de registro según cada nodo:
+
+```py
+logger = logging.getLogger(__name__)
+
+logging.basicConfig(
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+ datefmt="%m/%d/%Y %H:%M:%S",
+ handlers=[logging.StreamHandler(sys.stdout)],
+)
+
+log_level = training_args.get_process_log_level()
+logger.setLevel(log_level)
+datasets.utils.logging.set_verbosity(log_level)
+transformers.utils.logging.set_verbosity(log_level)
+
+trainer = Trainer(...)
+```
+
+
+
+Usa diferentes combinaciones de `log_level` y `log_level_replica` para configurar qué se registra en cada uno de los nodos.
+
+```bash
+my_app.py ... --log_level warning --log_level_replica error
+```
+
+
+
+
+Agrega el parámetro `log_on_each_node 0` para entornos multi-nodo.
+
+```bash
+my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0
+
+# set to only report errors
+my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0
+```
+
+
+
+
+## NEFTune
+
+[NEFTune](https://hf.co/papers/2310.05914) es una técnica que puede mejorar el rendimiento al agregar ruido a los vectores de incrustación durante el entrenamiento. Para habilitarlo en [`Trainer`], establece el parámetro `neftune_noise_alpha` en [`TrainingArguments`] para controlar cuánto ruido se agrega.
+
+```py
+from transformers import TrainingArguments, Trainer
+
+training_args = TrainingArguments(..., neftune_noise_alpha=0.1)
+trainer = Trainer(..., args=training_args)
+```
+
+NEFTune se desactiva después del entrenamiento para restaurar la capa de incrustación original y evitar cualquier comportamiento inesperado.
+
+## Accelerate y Trainer
+
+La clase [`Trainer`] está impulsada por [Accelerate](https://hf.co/docs/accelerate), una biblioteca para entrenar fácilmente modelos de PyTorch en entornos distribuidos con soporte para integraciones como [FullyShardedDataParallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) y [DeepSpeed](https://www.deepspeed.ai/).
+
+
+
+Aprende más sobre las estrategias de fragmentación FSDP, descarga de CPU y más con el [`Trainer`] en la guía [Paralela de Datos Completamente Fragmentados](fsdp).
+
+
+
+Para usar Accelerate con [`Trainer`], ejecuta el comando [`accelerate.config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) para configurar el entrenamiento para tu entorno de entrenamiento. Este comando crea un `config_file.yaml` que se utilizará cuando inicies tu script de entrenamiento. Por ejemplo, algunas configuraciones de ejemplo que puedes configurar son:
+
+
+
+
+```yml
+compute_environment: LOCAL_MACHINE
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0 #change rank as per the node
+main_process_ip: 192.168.20.1
+main_process_port: 9898
+main_training_function: main
+mixed_precision: fp16
+num_machines: 2
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+
+
+```yml
+compute_environment: LOCAL_MACHINE
+distributed_type: FSDP
+downcast_bf16: 'no'
+fsdp_config:
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+ fsdp_backward_prefetch_policy: BACKWARD_PRE
+ fsdp_forward_prefetch: true
+ fsdp_offload_params: false
+ fsdp_sharding_strategy: 1
+ fsdp_state_dict_type: FULL_STATE_DICT
+ fsdp_sync_module_states: true
+ fsdp_transformer_layer_cls_to_wrap: BertLayer
+ fsdp_use_orig_params: true
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 2
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+
+
+```yml
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+ deepspeed_config_file: /home/user/configs/ds_zero3_config.json
+ zero3_init_flag: true
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+num_machines: 1
+num_processes: 4
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+
+
+```yml
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+ gradient_accumulation_steps: 1
+ gradient_clipping: 0.7
+ offload_optimizer_device: cpu
+ offload_param_device: cpu
+ zero3_init_flag: true
+ zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+
+```
+
+
+
+
+El comando [`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) es la forma recomendada de lanzar tu script de entrenamiento en un sistema distribuido con Accelerate y [`Trainer`] con los parámetros especificados en `config_file.yaml`. Este archivo se guarda en la carpeta de caché de Accelerate y se carga automáticamente cuando ejecutas `accelerate_launch`.
+
+Por ejemplo, para ejecutar el script de entrenamiento [`run_glue.py`](https://github.com/huggingface/transformers/blob/f4db565b695582891e43a5e042e5d318e28f20b8/examples/pytorch/text-classification/run_glue.py#L4) con la configuración de FSDP:
+
+```bash
+accelerate launch \
+ ./examples/pytorch/text-classification/run_glue.py \
+ --model_name_or_path bert-base-cased \
+ --task_name $TASK_NAME \
+ --do_train \
+ --do_eval \
+ --max_seq_length 128 \
+ --per_device_train_batch_size 16 \
+ --learning_rate 5e-5 \
+ --num_train_epochs 3 \
+ --output_dir /tmp/$TASK_NAME/ \
+ --overwrite_output_dir
+```
+
+También puedes especificar los parámetros del archivo config_file.yaml directamente en la línea de comandos:
+
+```bash
+accelerate launch --num_processes=2 \
+ --use_fsdp \
+ --mixed_precision=bf16 \
+ --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
+ --fsdp_transformer_layer_cls_to_wrap="BertLayer" \
+ --fsdp_sharding_strategy=1 \
+ --fsdp_state_dict_type=FULL_STATE_DICT \
+ ./examples/pytorch/text-classification/run_glue.py
+ --model_name_or_path bert-base-cased \
+ --task_name $TASK_NAME \
+ --do_train \
+ --do_eval \
+ --max_seq_length 128 \
+ --per_device_train_batch_size 16 \
+ --learning_rate 5e-5 \
+ --num_train_epochs 3 \
+ --output_dir /tmp/$TASK_NAME/ \
+ --overwrite_output_dir
+```
+
+Consulta el tutorial [Lanzamiento de tus scripts con Accelerate](https://huggingface.co/docs/accelerate/basic_tutorials/launch) para obtener más información sobre `accelerate_launch` y las configuraciones personalizadas.
\ No newline at end of file
From 81c8191b4651de216c00e25e1af607683e980614 Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Tue, 5 Mar 2024 02:29:19 +0100
Subject: [PATCH 078/549] FIX [`Generation`] Fix some issues when running the
MaxLength criteria on CPU (#29317)
fix the bitwise or issue
---
src/transformers/generation/stopping_criteria.py | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/transformers/generation/stopping_criteria.py b/src/transformers/generation/stopping_criteria.py
index 8516c6157250d4..f4624296d237f7 100644
--- a/src/transformers/generation/stopping_criteria.py
+++ b/src/transformers/generation/stopping_criteria.py
@@ -73,7 +73,7 @@ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwa
f"maximum length ({self.max_position_embeddings}). Depending on the model, you may observe "
"exceptions, performance degradation, or nothing at all."
)
- return torch.full((input_ids.shape[0],), is_done, device=input_ids.device)
+ return torch.full((input_ids.shape[0],), is_done, device=input_ids.device, dtype=torch.bool)
class MaxNewTokensCriteria(StoppingCriteria):
@@ -103,7 +103,7 @@ def __init__(self, start_length: int, max_new_tokens: int):
@add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
is_done = input_ids.shape[-1] >= self.max_length
- return torch.full((input_ids.shape[0],), is_done, device=input_ids.device)
+ return torch.full((input_ids.shape[0],), is_done, device=input_ids.device, dtype=torch.bool)
class MaxTimeCriteria(StoppingCriteria):
@@ -126,7 +126,7 @@ def __init__(self, max_time: float, initial_timestamp: Optional[float] = None):
@add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
is_done = time.time() - self.initial_timestamp > self.max_time
- return torch.full((input_ids.shape[0],), is_done, device=input_ids.device)
+ return torch.full((input_ids.shape[0],), is_done, device=input_ids.device, dtype=torch.bool)
class StoppingCriteriaList(list):
From 4fc708f98c9c8d5cb48e8a2639e3f7a21c65802f Mon Sep 17 00:00:00 2001
From: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date: Tue, 5 Mar 2024 03:22:48 +0100
Subject: [PATCH 079/549] Exllama kernels support for AWQ models (#28634)
* added exllama kernels support for awq models
* doc
* style
* Update src/transformers/modeling_utils.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* refactor
* moved exllama post init to after device dispatching
* bump autoawq version
* added exllama test
* style
* configurable exllama kernels
* copy exllama_config from gptq
* moved exllama version check to post init
* moved to quantization dockerfile
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
---
.../Dockerfile | 2 +-
src/transformers/integrations/__init__.py | 12 +++-
src/transformers/integrations/awq.py | 59 +++++++++++++++++--
src/transformers/quantizers/quantizer_awq.py | 11 ++++
src/transformers/utils/quantization_config.py | 42 +++++++++++--
tests/quantization/autoawq/test_awq.py | 14 +++++
6 files changed, 127 insertions(+), 13 deletions(-)
diff --git a/docker/transformers-quantization-latest-gpu/Dockerfile b/docker/transformers-quantization-latest-gpu/Dockerfile
index 66bdcc42bae9fd..1eadb1505b1a07 100644
--- a/docker/transformers-quantization-latest-gpu/Dockerfile
+++ b/docker/transformers-quantization-latest-gpu/Dockerfile
@@ -43,7 +43,7 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/opt
RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
# Add autoawq for quantization testing
-RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp38-cp38-linux_x86_64.whl
+RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.0/autoawq-0.2.0+cu118-cp38-cp38-linux_x86_64.whl
# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
diff --git a/src/transformers/integrations/__init__.py b/src/transformers/integrations/__init__.py
index bded6b3984a59c..200607b0d5a55b 100644
--- a/src/transformers/integrations/__init__.py
+++ b/src/transformers/integrations/__init__.py
@@ -18,7 +18,11 @@
_import_structure = {
"aqlm": ["replace_with_aqlm_linear"],
- "awq": ["fuse_awq_modules", "replace_with_awq_linear"],
+ "awq": [
+ "fuse_awq_modules",
+ "post_init_awq_exllama_modules",
+ "replace_with_awq_linear",
+ ],
"bitsandbytes": [
"get_keys_to_not_convert",
"replace_8bit_linear",
@@ -82,7 +86,11 @@
if TYPE_CHECKING:
from .aqlm import replace_with_aqlm_linear
- from .awq import fuse_awq_modules, replace_with_awq_linear
+ from .awq import (
+ fuse_awq_modules,
+ post_init_awq_exllama_modules,
+ replace_with_awq_linear,
+ )
from .bitsandbytes import (
get_keys_to_not_convert,
replace_8bit_linear,
diff --git a/src/transformers/integrations/awq.py b/src/transformers/integrations/awq.py
index dd8578ef606d38..3f9f0d1d216f1c 100644
--- a/src/transformers/integrations/awq.py
+++ b/src/transformers/integrations/awq.py
@@ -15,7 +15,12 @@
from ..activations import ACT2FN
from ..modeling_utils import PreTrainedModel
from ..utils import is_auto_awq_available, is_torch_available
-from ..utils.quantization_config import AwqBackendPackingMethod, AwqConfig, AWQLinearVersion
+from ..utils.quantization_config import (
+ AwqBackendPackingMethod,
+ AwqConfig,
+ AWQLinearVersion,
+ ExllamaVersion,
+)
if is_torch_available():
@@ -91,13 +96,30 @@ def replace_with_awq_linear(
)
if backend == AwqBackendPackingMethod.AUTOAWQ:
- from awq.modules.linear import WQLinear_GEMM, WQLinear_GEMV
- elif backend == AwqBackendPackingMethod.LLMAWQ:
- from awq.quantize.qmodule import WQLinear
+ if quantization_config.version == AWQLinearVersion.GEMM:
+ from awq.modules.linear.gemm import WQLinear_GEMM
- if backend == AwqBackendPackingMethod.AUTOAWQ:
- target_cls = WQLinear_GEMM if quantization_config.version == AWQLinearVersion.GEMM else WQLinear_GEMV
+ target_cls = WQLinear_GEMM
+ elif quantization_config.version == AWQLinearVersion.GEMV:
+ from awq.modules.linear.gemv import WQLinear_GEMV
+
+ target_cls = WQLinear_GEMV
+ elif quantization_config.version == AWQLinearVersion.EXLLAMA:
+ if quantization_config.exllama_config["version"] == ExllamaVersion.ONE:
+ from awq.modules.linear.exllama import WQLinear_Exllama
+
+ target_cls = WQLinear_Exllama
+ elif quantization_config.exllama_config["version"] == ExllamaVersion.TWO:
+ from awq.modules.linear.exllamav2 import WQLinear_ExllamaV2
+
+ target_cls = WQLinear_ExllamaV2
+ else:
+ raise ValueError(f"Unrecognized Exllama version: {quantization_config.exllama_config['version']}")
+ else:
+ raise ValueError(f"Unrecognized AWQ version: {quantization_config.version}")
else:
+ from awq.quantize.qmodule import WQLinear
+
target_cls = WQLinear
for name, module in model.named_children():
@@ -372,3 +394,28 @@ def _fuse_awq_attention_layers(model, module, modules_to_fuse, current_module_na
setattr(parent, child_name, fused_attention_layer.to(previous_device))
del q_proj, k_proj, v_proj, o_proj
+
+
+def post_init_awq_exllama_modules(model, exllama_config):
+ """
+ Runs post init for Exllama layers which performs:
+ - Weights unpacking, reordering and repacking
+ - Devices scratch space allocation
+ """
+
+ if exllama_config["version"] == ExllamaVersion.ONE:
+ from awq.modules.linear.exllama import exllama_post_init
+
+ model = exllama_post_init(model)
+ elif exllama_config["version"] == ExllamaVersion.TWO:
+ from awq.modules.linear.exllamav2 import exllamav2_post_init
+
+ model = exllamav2_post_init(
+ model,
+ max_input_len=exllama_config["max_input_len"],
+ max_batch_size=exllama_config["max_batch_size"],
+ )
+ else:
+ raise ValueError(f"Unrecognized Exllama version: {exllama_config['version']}")
+
+ return model
diff --git a/src/transformers/quantizers/quantizer_awq.py b/src/transformers/quantizers/quantizer_awq.py
index 08342df175f7ca..5e66f9baf1c0a7 100644
--- a/src/transformers/quantizers/quantizer_awq.py
+++ b/src/transformers/quantizers/quantizer_awq.py
@@ -23,6 +23,7 @@
from ..modeling_utils import PreTrainedModel
from ..utils import is_accelerate_available, is_auto_awq_available, is_torch_available, logging
+from ..utils.quantization_config import AWQLinearVersion
if is_torch_available():
@@ -98,12 +99,22 @@ def _process_model_after_weight_loading(self, model):
model = fuse_awq_modules(model, self.quantization_config)
model._awq_is_fused = True # TODO: consider storing this flag in model.config instead
+ if self.quantization_config.version == AWQLinearVersion.EXLLAMA:
+ from ..integrations import post_init_awq_exllama_modules
+
+ model = post_init_awq_exllama_modules(model, self.quantization_config.exllama_config)
+
@property
def is_serializable(self):
# AWQ through auto-awq has been always serializable, except if the model is fused.
if self.quantization_config.do_fuse:
logger.warning("You cannot save an AWQ model that uses fused modules!")
return False
+
+ if self.quantization_config.version == AWQLinearVersion.EXLLAMA:
+ logger.warning("You cannot save an AWQ model that uses Exllama backend!")
+ return False
+
return True
@property
diff --git a/src/transformers/utils/quantization_config.py b/src/transformers/utils/quantization_config.py
index bcf31ebfaba0e4..a29886d8c68961 100644
--- a/src/transformers/utils/quantization_config.py
+++ b/src/transformers/utils/quantization_config.py
@@ -44,6 +44,7 @@ class QuantizationMethod(str, Enum):
class AWQLinearVersion(str, Enum):
GEMM = "gemm"
GEMV = "gemv"
+ EXLLAMA = "exllama"
@staticmethod
def from_str(version: str):
@@ -52,6 +53,8 @@ def from_str(version: str):
return AWQLinearVersion.GEMM
elif version == "gemv":
return AWQLinearVersion.GEMV
+ elif version == "exllama":
+ return AWQLinearVersion.EXLLAMA
else:
raise ValueError(f"Unknown AWQLinearVersion {version}")
@@ -606,7 +609,7 @@ class AwqConfig(QuantizationConfigMixin):
Whether to use zero point quantization.
version (`AWQLinearVersion`, *optional*, defaults to `AWQLinearVersion.GEMM`):
The version of the quantization algorithm to use. GEMM is better for big batch_size (e.g. >= 8) otherwise,
- GEMV is better (e.g. < 8 )
+ GEMV is better (e.g. < 8 ). GEMM models are compatible with Exllama kernels.
backend (`AwqBackendPackingMethod`, *optional*, defaults to `AwqBackendPackingMethod.AUTOAWQ`):
The quantization backend. Some models might be quantized using `llm-awq` backend. This is useful for users
that quantize their own models using `llm-awq` library.
@@ -620,6 +623,10 @@ class AwqConfig(QuantizationConfigMixin):
The list of modules to not quantize, useful for quantizing models that explicitly require to have
some modules left in their original precision (e.g. Whisper encoder, Llava encoder, Mixtral gate layers).
Note you cannot quantize directly with transformers, please refer to `AutoAWQ` documentation for quantizing HF models.
+ exllama_config (`Dict[str, Any]`, *optional*):
+ You can specify the version of the exllama kernel through the `version` key, the maximum sequence
+ length through the `max_input_len` key, and the maximum batch size through the `max_batch_size` key.
+ Defaults to `{"version": 2, "max_input_len": 2048, "max_batch_size": 8}` if unset.
"""
def __init__(
@@ -633,6 +640,7 @@ def __init__(
fuse_max_seq_len: Optional[int] = None,
modules_to_fuse: Optional[dict] = None,
modules_to_not_convert: Optional[List] = None,
+ exllama_config: Optional[Dict[str, int]] = None,
**kwargs,
):
self.quant_method = QuantizationMethod.AWQ
@@ -644,6 +652,7 @@ def __init__(
self.backend = backend
self.fuse_max_seq_len = fuse_max_seq_len
self.modules_to_not_convert = modules_to_not_convert
+ self.exllama_config = exllama_config
self.modules_to_fuse = modules_to_fuse
if do_fuse is None:
@@ -667,9 +676,9 @@ def post_init(self):
)
self.version = AWQLinearVersion.from_str(self.version)
- if self.version not in [AWQLinearVersion.GEMM, AWQLinearVersion.GEMV]:
+ if self.version not in [AWQLinearVersion.GEMM, AWQLinearVersion.GEMV, AWQLinearVersion.EXLLAMA]:
raise ValueError(
- f"Only supported versions are in [AWQLinearVersion.GEMM, AWQLinearVersion.GEMV] - not recognized version {self.version}"
+ f"Only supported versions are in [AWQLinearVersion.GEMM, AWQLinearVersion.GEMV, AWQLinearVersion.EXLLAMA] - not recognized version {self.version}"
)
if self.backend == AwqBackendPackingMethod.LLMAWQ:
@@ -724,9 +733,34 @@ def post_init(self):
f"Required fields are missing in the fusing mapping, required fields are {required_keys}"
)
+ if self.version == AWQLinearVersion.EXLLAMA:
+ awq_version_supports_exllama = False
+ MIN_AWQ_VERSION = "0.2.0"
+ if is_auto_awq_available():
+ awq_version_supports_exllama = version.parse(importlib.metadata.version("autoawq")) >= version.parse(
+ MIN_AWQ_VERSION
+ )
+
+ if not awq_version_supports_exllama:
+ raise ValueError(
+ f"You current version of `autoawq` does not support exllama backend, "
+ f"please upgrade `autoawq` package to at least {MIN_AWQ_VERSION}."
+ )
+
+ if self.exllama_config is None:
+ self.exllama_config = {"version": ExllamaVersion.TWO, "max_input_len": 2048, "max_batch_size": 8}
+ else:
+ if "version" not in self.exllama_config:
+ raise ValueError("`exllama_config` needs to have a `version` key.")
+ elif self.exllama_config["version"] not in [ExllamaVersion.ONE, ExllamaVersion.TWO]:
+ exllama_version = self.exllama_config["version"]
+ raise ValueError(
+ f"Only supported versions are in [ExllamaVersion.ONE, ExllamaVersion.TWO] - not recognized version {exllama_version}"
+ )
+
def get_loading_attributes(self):
attibutes_dict = copy.deepcopy(self.__dict__)
- loading_attibutes = ["do_fuse", "modules_to_fuse", "fuse_max_seq_len"]
+ loading_attibutes = ["version", "do_fuse", "modules_to_fuse", "fuse_max_seq_len"]
loading_attibutes_dict = {i: j for i, j in attibutes_dict.items() if i in loading_attibutes}
return loading_attibutes_dict
diff --git a/tests/quantization/autoawq/test_awq.py b/tests/quantization/autoawq/test_awq.py
index a2dbd904a59bb6..8ed8c394f424c5 100644
--- a/tests/quantization/autoawq/test_awq.py
+++ b/tests/quantization/autoawq/test_awq.py
@@ -192,6 +192,20 @@ def test_quantized_model_bf16(self):
output = quantized_model.generate(**input_ids, max_new_tokens=40)
self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT_BF16)
+ def test_quantized_model_exllama(self):
+ """
+ Simple test that checks if the quantized model is working properly with exllama backend
+ """
+ input_ids = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
+
+ quantization_config = AwqConfig(version="exllama")
+ quantized_model = AutoModelForCausalLM.from_pretrained(
+ self.model_name, quantization_config=quantization_config
+ ).to(torch_device)
+
+ output = quantized_model.generate(**input_ids, max_new_tokens=40)
+ self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT)
+
def test_quantized_model_no_device_map(self):
"""
Simple test that checks if the quantized model is working properly
From bd891aed01cfcdd96565d74b0a9de6da36456289 Mon Sep 17 00:00:00 2001
From: Raushan Turganbay
Date: Tue, 5 Mar 2024 12:18:22 +0500
Subject: [PATCH 080/549] Fix max length for BLIP generation (#29296)
* fix mal_length for blip
* update also min length
* fixes
* add a comment
* Update src/transformers/models/instructblip/modeling_instructblip.py
Co-authored-by: Joao Gante
* Update src/transformers/models/blip_2/modeling_blip_2.py
Co-authored-by: Joao Gante
* make fixup
* fix length when user passed
* remove else
* remove brackets
---------
Co-authored-by: Joao Gante
---
src/transformers/generation/utils.py | 1 +
src/transformers/models/blip_2/modeling_blip_2.py | 5 +++++
.../models/instructblip/modeling_instructblip.py | 5 +++++
3 files changed, 11 insertions(+)
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 5b7d18e06c1d10..4bfae470cf720f 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -1453,6 +1453,7 @@ def generate(
and not self.config.is_encoder_decoder
):
generation_config.max_length -= inputs_tensor.shape[1]
+ generation_config.min_length = max(generation_config.min_length - inputs_tensor.shape[1], 0)
if generation_config.cache_implementation in NEED_SETUP_CACHE_CLASSES_MAPPING:
if generation_config.cache_implementation == "static":
diff --git a/src/transformers/models/blip_2/modeling_blip_2.py b/src/transformers/models/blip_2/modeling_blip_2.py
index 00433f3ea349ac..3e63fac66fd843 100644
--- a/src/transformers/models/blip_2/modeling_blip_2.py
+++ b/src/transformers/models/blip_2/modeling_blip_2.py
@@ -1827,6 +1827,11 @@ def generate(
inputs_embeds = self.get_input_embeddings()(input_ids)
inputs_embeds = torch.cat([language_model_inputs, inputs_embeds.to(language_model_inputs.device)], dim=1)
+ # add image_embeds length to max_length, so that the final max_length in counted only on token embeds
+ if not self.language_model.config.is_encoder_decoder:
+ generate_kwargs["max_length"] = generate_kwargs.get("max_length", 20) + language_model_inputs.shape[1]
+ generate_kwargs["min_length"] = generate_kwargs.get("min_length", 0) + language_model_inputs.shape[1]
+
outputs = self.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
diff --git a/src/transformers/models/instructblip/modeling_instructblip.py b/src/transformers/models/instructblip/modeling_instructblip.py
index e175cd57285aab..da0b02551ff804 100644
--- a/src/transformers/models/instructblip/modeling_instructblip.py
+++ b/src/transformers/models/instructblip/modeling_instructblip.py
@@ -1537,6 +1537,11 @@ def generate(
inputs_embeds = self.get_input_embeddings()(input_ids)
inputs_embeds = torch.cat([language_model_inputs, inputs_embeds.to(language_model_inputs.device)], dim=1)
+ # add image_embeds length to max_length, so that the final max_length in counted only on token embeds
+ if not self.language_model.config.is_encoder_decoder:
+ generate_kwargs["max_length"] = generate_kwargs.get("max_length", 20) + language_model_inputs.shape[1]
+ generate_kwargs["min_length"] = generate_kwargs.get("min_length", 0) + language_model_inputs.shape[1]
+
outputs = self.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
From ebccb09169f42ef0d1e508b47d1ca8227ab4b019 Mon Sep 17 00:00:00 2001
From: Joshua Lochner
Date: Tue, 5 Mar 2024 09:57:33 +0200
Subject: [PATCH 081/549] [docs] Update starcoder2 paper link (#29418)
Update starcoder2 paper link
---
README.md | 2 +-
README_es.md | 2 +-
README_fr.md | 2 +-
README_hd.md | 2 +-
README_ja.md | 2 +-
README_ko.md | 2 +-
README_zh-hans.md | 2 +-
README_zh-hant.md | 2 +-
8 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/README.md b/README.md
index 30f7cd08a77643..868547c42f584e 100644
--- a/README.md
+++ b/README.md
@@ -493,7 +493,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
diff --git a/README_es.md b/README_es.md
index 6e808e0e2b1cf1..0c33e07440b69e 100644
--- a/README_es.md
+++ b/README_es.md
@@ -466,7 +466,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
diff --git a/README_fr.md b/README_fr.md
index 3bd57830076a5f..26247139595e86 100644
--- a/README_fr.md
+++ b/README_fr.md
@@ -487,7 +487,7 @@ Nombre actuel de points de contrôle : ![](https://img.shields.io/endpoint?url=h
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (de l'Université de Tel Aviv), publié dans l'article [Réponse à quelques questions avec peu d'exemples par la pré-sélection des spans](https://arxiv.org/abs/2101.00438) par Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (de Berkeley) a été publié dans l'article [SqueezeBERT : Que l'apprentissage automatique peut-il apprendre au traitement du langage naturel sur les réseaux neuronaux efficaces ?](https://arxiv.org/abs/2006.11316) par Forrest N. Iandola, Albert E. Shaw, Ravi Krishna et Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (de MBZUAI) a été publié dans l'article [SwiftFormer : Attention additive efficace pour les applications de vision mobile en temps réel basées sur des transformateurs](https://arxiv.org/abs/2303.15446) par Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (de Microsoft) a été publié dans l'article [Swin Transformer : Transformateur hiérarchique de la vision utilisant des fenêtres décalées](https://arxiv.org/abs/2103.14030) par Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (de Microsoft) a été publié dans l'article [Swin Transformer V2 : Augmentation de la capacité et de la résolution](https://arxiv.org/abs/2111.09883) par Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
diff --git a/README_hd.md b/README_hd.md
index 0353eb4d8fbda6..cdb126cd23e627 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -440,7 +440,7 @@ conda install conda-forge::transformers
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (तेल अवीव यूनिवर्सिटी से) साथ में पेपर [स्पैन सिलेक्शन को प्री-ट्रेनिंग करके कुछ-शॉट क्वेश्चन आंसरिंग](https://arxiv.org/abs/2101.00438) ओरि राम, युवल कर्स्टन, जोनाथन बेरेंट, अमीर ग्लोबर्सन, ओमर लेवी द्वारा।
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (बर्कले से) कागज के साथ [SqueezeBERT: कुशल तंत्रिका नेटवर्क के बारे में NLP को कंप्यूटर विज़न क्या सिखा सकता है?](https://arxiv.org/abs/2006.11316) फॉरेस्ट एन. इनडोला, अल्बर्ट ई. शॉ, रवि कृष्णा, और कर्ट डब्ल्यू. केटज़र द्वारा।
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI से) Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. द्वाराअनुसंधान पत्र [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) के साथ जारी किया गया
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (माइक्रोसॉफ्ट से) साथ में कागज [स्वाइन ट्रांसफॉर्मर: शिफ्टेड विंडोज का उपयोग कर पदानुक्रमित विजन ट्रांसफॉर्मर](https://arxiv.org/abs/2103.14030) ज़ी लियू, युटोंग लिन, यू काओ, हान हू, यिक्सुआन वेई, झेंग झांग, स्टीफन लिन, बैनिंग गुओ द्वारा।
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft से) साथ वाला पेपर [Swin Transformer V2: स्केलिंग अप कैपेसिटी एंड रेजोल्यूशन](https://arxiv.org/abs/2111.09883) ज़ी लियू, हान हू, युटोंग लिन, ज़ुलिआंग याओ, ज़ेंडा ज़ी, यिक्सुआन वेई, जिया निंग, यू काओ, झेंग झांग, ली डोंग, फुरु वेई, बैनिंग गुओ द्वारा।
diff --git a/README_ja.md b/README_ja.md
index 599865ab5a7d49..52a147b54da753 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -500,7 +500,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (Tel Aviv University から), Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy から公開された研究論文: [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438)
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (Berkeley から) Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer から公開された研究論文: [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316)
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI から) Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. から公開された研究論文 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446)
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (Microsoft から) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo から公開された研究論文: [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft から) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo から公開された研究論文: [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883)
diff --git a/README_ko.md b/README_ko.md
index e48159c7999339..787e8e17008abe 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -415,7 +415,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (Tel Aviv University 에서) Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy 의 [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) 논문과 함께 발표했습니다.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (Berkeley 에서) Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer 의 [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 논문과 함께 발표했습니다.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI 에서 제공)은 Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.의 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446)논문과 함께 발표했습니다.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (Microsoft 에서) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 의 [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 논문과 함께 발표했습니다.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft 에서) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 의 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index a9e1997da38c83..9b34fcf450cbe6 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -439,7 +439,7 @@ conda install conda-forge::transformers
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (来自 Tel Aviv University) 伴随论文 [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) 由 Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy 发布。
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (来自 Berkeley) 伴随论文 [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 由 Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer 发布。
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (来自 MBZUAI) 伴随论文 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) 由 Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan 发布。
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (来自 Microsoft) 伴随论文 [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 由 Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 发布。
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (来自 Microsoft) 伴随论文 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 由 Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index 2c724f309ef304..e4a888205a655f 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -451,7 +451,7 @@ conda install conda-forge::transformers
1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University) released with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
-1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with a coming soon paper.
+1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
From fa7f3cf336eb5d93cfaa7611723c299e7851fb02 Mon Sep 17 00:00:00 2001
From: Fanli Lin
Date: Tue, 5 Mar 2024 16:16:05 +0800
Subject: [PATCH 082/549] [tests] enable test_pipeline_accelerate_top_p on XPU
(#29309)
* use torch_device
* Update tests/pipelines/test_pipelines_text_generation.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix style
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
tests/pipelines/test_pipelines_text_generation.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tests/pipelines/test_pipelines_text_generation.py b/tests/pipelines/test_pipelines_text_generation.py
index 766f2a462a1930..ada04c7dbeda64 100644
--- a/tests/pipelines/test_pipelines_text_generation.py
+++ b/tests/pipelines/test_pipelines_text_generation.py
@@ -450,7 +450,9 @@ def test_small_model_fp16(self):
def test_pipeline_accelerate_top_p(self):
import torch
- pipe = pipeline(model="hf-internal-testing/tiny-random-bloom", device_map="auto", torch_dtype=torch.float16)
+ pipe = pipeline(
+ model="hf-internal-testing/tiny-random-bloom", device_map=torch_device, torch_dtype=torch.float16
+ )
pipe("This is a test", do_sample=True, top_p=0.5)
def test_pipeline_length_setting_warning(self):
From 132852203a02e320049457316a63cffb64968aa1 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Tue, 5 Mar 2024 09:42:52 +0100
Subject: [PATCH 083/549] [`UdopTokenizer`] Fix post merge imports (#29451)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* update
* ...
* nits
* arf
* 🧼
* beat the last guy
* style everyone
---
.../models/udop/tokenization_udop.py | 7 -------
.../models/udop/tokenization_udop_fast.py | 17 +++++++++++------
tests/models/udop/test_tokenization_udop.py | 6 +++++-
3 files changed, 16 insertions(+), 14 deletions(-)
diff --git a/src/transformers/models/udop/tokenization_udop.py b/src/transformers/models/udop/tokenization_udop.py
index 10e92db48cebba..c3b270bc55a8bf 100644
--- a/src/transformers/models/udop/tokenization_udop.py
+++ b/src/transformers/models/udop/tokenization_udop.py
@@ -157,12 +157,6 @@
}
-# TODO(PVP) - this should be removed in Transformers v5
-PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
- "microsoft/udop-large": 512,
-}
-
-
class UdopTokenizer(PreTrainedTokenizer):
"""
Adapted from [`LayoutXLMTokenizer`] and [`T5Tokenizer`]. Based on
@@ -256,7 +250,6 @@ class UdopTokenizer(PreTrainedTokenizer):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
- max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["input_ids", "attention_mask"]
def __init__(
diff --git a/src/transformers/models/udop/tokenization_udop_fast.py b/src/transformers/models/udop/tokenization_udop_fast.py
index ee0697595508a7..cce527a80537d9 100644
--- a/src/transformers/models/udop/tokenization_udop_fast.py
+++ b/src/transformers/models/udop/tokenization_udop_fast.py
@@ -29,11 +29,6 @@
)
from ...tokenization_utils_fast import PreTrainedTokenizerFast
from ...utils import PaddingStrategy, TensorType, add_end_docstrings, is_sentencepiece_available, logging
-from ..udop.tokenization_udop import (
- PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES,
- PRETRAINED_VOCAB_FILES_MAP,
- VOCAB_FILES_NAMES,
-)
if is_sentencepiece_available():
@@ -42,6 +37,17 @@
UdopTokenizer = None
+VOCAB_FILES_NAMES = {"vocab_file": "spiece.model", "tokenizer_file": "tokenizer.json"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+ "vocab_file": {
+ "microsoft/udop-large": "https://huggingface.co/microsoft/udop-large/resolve/main/spiece.model",
+ },
+ "tokenizer_file": {
+ "microsoft/udop-large": "https://huggingface.co/microsoft/udop-large/resolve/main/tokenizer.json",
+ },
+}
+
logger = logging.get_logger(__name__)
UDOP_ENCODE_KWARGS_DOCSTRING = r"""
@@ -197,7 +203,6 @@ class UdopTokenizerFast(PreTrainedTokenizerFast):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
- max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["input_ids", "attention_mask"]
slow_tokenizer_class = UdopTokenizer
diff --git a/tests/models/udop/test_tokenization_udop.py b/tests/models/udop/test_tokenization_udop.py
index e9d41c5b77a872..cc9a2f285207af 100644
--- a/tests/models/udop/test_tokenization_udop.py
+++ b/tests/models/udop/test_tokenization_udop.py
@@ -22,12 +22,12 @@
from transformers import (
AddedToken,
SpecialTokensMixin,
+ UdopTokenizer,
UdopTokenizerFast,
is_tf_available,
is_torch_available,
logging,
)
-from transformers.models.udop.tokenization_udop import UdopTokenizer
from transformers.testing_utils import (
get_tests_dir,
is_pt_tf_cross_test,
@@ -1717,6 +1717,10 @@ def test_batch_encode_dynamic_overflowing(self):
def test_alignement_methods(self):
pass
+ @unittest.skip("#TODO will be removed in main")
+ def test_pretrained_model_lists(self):
+ pass
+
@unittest.skip("UDOP tokenizer requires boxes besides sequences.")
def test_maximum_encoding_length_pair_input(self):
pass
From 0d52f9f582efb82a12e8d9162b43a01b1aa0200f Mon Sep 17 00:00:00 2001
From: Arthur Zucker
Date: Tue, 5 Mar 2024 18:27:25 +0900
Subject: [PATCH 084/549] more fix
---
tests/models/udop/test_processor_udop.py | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tests/models/udop/test_processor_udop.py b/tests/models/udop/test_processor_udop.py
index 05855991b185ea..5e913f8ec6da97 100644
--- a/tests/models/udop/test_processor_udop.py
+++ b/tests/models/udop/test_processor_udop.py
@@ -22,7 +22,6 @@
import numpy as np
from transformers import PreTrainedTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizerFast
-from transformers.models.udop import UdopTokenizer, UdopTokenizerFast
from transformers.testing_utils import (
require_pytesseract,
require_sentencepiece,
@@ -35,11 +34,11 @@
if is_torch_available():
import torch
+ from transformers import UdopTokenizer, UdopTokenizerFast
if is_pytesseract_available():
from PIL import Image
-
from transformers import LayoutLMv3ImageProcessor, UdopProcessor
From 57d007b912604ac3fac4dd60cd1e6f43908fba08 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Tue, 5 Mar 2024 10:39:42 +0100
Subject: [PATCH 085/549] Revert-commit
0d52f9f582efb82a12e8d9162b43a01b1aa0200f (#29455)
* style
* revert with RP
* nit
* exact revert
---
tests/models/udop/test_processor_udop.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tests/models/udop/test_processor_udop.py b/tests/models/udop/test_processor_udop.py
index 5e913f8ec6da97..05855991b185ea 100644
--- a/tests/models/udop/test_processor_udop.py
+++ b/tests/models/udop/test_processor_udop.py
@@ -22,6 +22,7 @@
import numpy as np
from transformers import PreTrainedTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizerFast
+from transformers.models.udop import UdopTokenizer, UdopTokenizerFast
from transformers.testing_utils import (
require_pytesseract,
require_sentencepiece,
@@ -34,11 +35,11 @@
if is_torch_available():
import torch
- from transformers import UdopTokenizer, UdopTokenizerFast
if is_pytesseract_available():
from PIL import Image
+
from transformers import LayoutLMv3ImageProcessor, UdopProcessor
From 4d892b729744a8f10ddac1077b06e381255258c3 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Tue, 5 Mar 2024 11:01:08 +0100
Subject: [PATCH 086/549] [`Udop imports`] Processor tests were not run.
(#29456)
* fix udop imports
* sort imports
---
tests/models/udop/test_processor_udop.py | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/tests/models/udop/test_processor_udop.py b/tests/models/udop/test_processor_udop.py
index 05855991b185ea..53a50b9a115963 100644
--- a/tests/models/udop/test_processor_udop.py
+++ b/tests/models/udop/test_processor_udop.py
@@ -21,8 +21,13 @@
import numpy as np
-from transformers import PreTrainedTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizerFast
-from transformers.models.udop import UdopTokenizer, UdopTokenizerFast
+from transformers import (
+ PreTrainedTokenizer,
+ PreTrainedTokenizerBase,
+ PreTrainedTokenizerFast,
+ UdopTokenizer,
+ UdopTokenizerFast,
+)
from transformers.testing_utils import (
require_pytesseract,
require_sentencepiece,
From 87a0783dde7291876d096d9dba5685dc1b7bd4c6 Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Tue, 5 Mar 2024 10:27:36 +0000
Subject: [PATCH 087/549] Generate: inner decoding methods are no longer public
(#29437)
---
docs/source/en/generation_strategies.md | 3 +
docs/source/en/internal/generation_utils.md | 13 +-
.../source/en/main_classes/text_generation.md | 7 -
docs/source/ja/internal/generation_utils.md | 9 --
.../source/ja/main_classes/text_generation.md | 7 -
docs/source/zh/internal/generation_utils.md | 11 +-
.../source/zh/main_classes/text_generation.md | 7 -
.../generation/configuration_utils.py | 18 +--
src/transformers/generation/utils.py | 134 +++++++++++++-----
.../models/musicgen/modeling_musicgen.py | 8 +-
src/transformers/models/rag/modeling_rag.py | 4 +-
11 files changed, 117 insertions(+), 104 deletions(-)
diff --git a/docs/source/en/generation_strategies.md b/docs/source/en/generation_strategies.md
index c4378551e6146c..2e26f4b679e2cc 100644
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@@ -389,3 +389,6 @@ just like in multinomial sampling. However, in assisted decoding, reducing the t
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are going to the same party. It is a small party, in a small']
```
+
+Alternativelly, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed
+to model based assisted decoding. You can read more about it [here](https://twitter.com/joao_gante/status/1747322413006643259).
diff --git a/docs/source/en/internal/generation_utils.md b/docs/source/en/internal/generation_utils.md
index 0fa15ddbcf1943..540594ece015d5 100644
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@@ -16,16 +16,7 @@ rendered properly in your Markdown viewer.
# Utilities for Generation
-This page lists all the utility functions used by [`~generation.GenerationMixin.generate`],
-[`~generation.GenerationMixin.greedy_search`],
-[`~generation.GenerationMixin.contrastive_search`],
-[`~generation.GenerationMixin.sample`],
-[`~generation.GenerationMixin.beam_search`],
-[`~generation.GenerationMixin.beam_sample`],
-[`~generation.GenerationMixin.group_beam_search`], and
-[`~generation.GenerationMixin.constrained_beam_search`].
-
-Most of those are only useful if you are studying the code of the generate methods in the library.
+This page lists all the utility functions used by [`~generation.GenerationMixin.generate`].
## Generate Outputs
@@ -376,4 +367,4 @@ A [`Constraint`] can be used to force the generation to include specific tokens
[[autodoc]] StaticCache
- update
- - get_seq_length
\ No newline at end of file
+ - get_seq_length
diff --git a/docs/source/en/main_classes/text_generation.md b/docs/source/en/main_classes/text_generation.md
index 309d7298eec70f..a43519d5a042d2 100644
--- a/docs/source/en/main_classes/text_generation.md
+++ b/docs/source/en/main_classes/text_generation.md
@@ -43,13 +43,6 @@ like token streaming.
[[autodoc]] generation.GenerationMixin
- generate
- compute_transition_scores
- - greedy_search
- - sample
- - beam_search
- - beam_sample
- - contrastive_search
- - group_beam_search
- - constrained_beam_search
## TFGenerationMixin
diff --git a/docs/source/ja/internal/generation_utils.md b/docs/source/ja/internal/generation_utils.md
index baeefd06abb01b..8aa069e4dcd133 100644
--- a/docs/source/ja/internal/generation_utils.md
+++ b/docs/source/ja/internal/generation_utils.md
@@ -17,15 +17,6 @@ rendered properly in your Markdown viewer.
# 発電用ユーティリティ
このページには、[`~generation.GenerationMixin.generate`] で使用されるすべてのユーティリティ関数がリストされています。
-[`~generation.GenerationMixin.greedy_search`],
-[`~generation.GenerationMixin.contrastive_search`],
-[`~generation.GenerationMixin.sample`],
-[`~generation.GenerationMixin.beam_search`],
-[`~generation.GenerationMixin.beam_sample`],
-[`~generation.GenerationMixin.group_beam_search`]、および
-[`~generation.GenerationMixin.constrained_beam_search`]。
-
-これらのほとんどは、ライブラリ内の生成メソッドのコードを学習する場合にのみ役に立ちます。
## 出力を生成する
diff --git a/docs/source/ja/main_classes/text_generation.md b/docs/source/ja/main_classes/text_generation.md
index 279d9b40735b73..18477d97e626d1 100644
--- a/docs/source/ja/main_classes/text_generation.md
+++ b/docs/source/ja/main_classes/text_generation.md
@@ -43,13 +43,6 @@ rendered properly in your Markdown viewer.
[[autodoc]] generation.GenerationMixin
- generate
- compute_transition_scores
- - greedy_search
- - sample
- - beam_search
- - beam_sample
- - contrastive_search
- - group_beam_search
- - constrained_beam_search
## TFGenerationMixin
diff --git a/docs/source/zh/internal/generation_utils.md b/docs/source/zh/internal/generation_utils.md
index 34e9bf2f787ef1..5d8056bb7d2dae 100644
--- a/docs/source/zh/internal/generation_utils.md
+++ b/docs/source/zh/internal/generation_utils.md
@@ -16,16 +16,7 @@ rendered properly in your Markdown viewer.
# 用于生成的工具
-此页面列出了所有由 [`~generation.GenerationMixin.generate`],
-[`~generation.GenerationMixin.greedy_search`],
-[`~generation.GenerationMixin.contrastive_search`],
-[`~generation.GenerationMixin.sample`],
-[`~generation.GenerationMixin.beam_search`],
-[`~generation.GenerationMixin.beam_sample`],
-[`~generation.GenerationMixin.group_beam_search`], 和
-[`~generation.GenerationMixin.constrained_beam_search`]使用的实用函数。
-
-其中大多数仅在您研究库中生成方法的代码时才有用。
+此页面列出了所有由 [`~generation.GenerationMixin.generate`]。
## 生成输出
diff --git a/docs/source/zh/main_classes/text_generation.md b/docs/source/zh/main_classes/text_generation.md
index 773228832f2272..22e31b63c14e6b 100644
--- a/docs/source/zh/main_classes/text_generation.md
+++ b/docs/source/zh/main_classes/text_generation.md
@@ -38,13 +38,6 @@ rendered properly in your Markdown viewer.
[[autodoc]] generation.GenerationMixin
- generate
- compute_transition_scores
- - greedy_search
- - sample
- - beam_search
- - beam_sample
- - contrastive_search
- - group_beam_search
- - constrained_beam_search
## TFGenerationMixin
diff --git a/src/transformers/generation/configuration_utils.py b/src/transformers/generation/configuration_utils.py
index f6d9c8f52c0066..cacc2dc8e8a8c9 100644
--- a/src/transformers/generation/configuration_utils.py
+++ b/src/transformers/generation/configuration_utils.py
@@ -43,22 +43,22 @@ class GenerationConfig(PushToHubMixin):
Class that holds a configuration for a generation task. A `generate` call supports the following generation methods
for text-decoder, text-to-text, speech-to-text, and vision-to-text models:
- - *greedy decoding* by calling [`~generation.GenerationMixin.greedy_search`] if `num_beams=1` and
+ - *greedy decoding* by calling [`~generation.GenerationMixin._greedy_search`] if `num_beams=1` and
`do_sample=False`
- - *contrastive search* by calling [`~generation.GenerationMixin.contrastive_search`] if `penalty_alpha>0.`
+ - *contrastive search* by calling [`~generation.GenerationMixin._contrastive_search`] if `penalty_alpha>0.`
and `top_k>1`
- - *multinomial sampling* by calling [`~generation.GenerationMixin.sample`] if `num_beams=1` and
+ - *multinomial sampling* by calling [`~generation.GenerationMixin._sample`] if `num_beams=1` and
`do_sample=True`
- - *beam-search decoding* by calling [`~generation.GenerationMixin.beam_search`] if `num_beams>1` and
+ - *beam-search decoding* by calling [`~generation.GenerationMixin._beam_search`] if `num_beams>1` and
`do_sample=False`
- - *beam-search multinomial sampling* by calling [`~generation.GenerationMixin.beam_sample`] if
+ - *beam-search multinomial sampling* by calling [`~generation.GenerationMixin._beam_sample`] if
`num_beams>1` and `do_sample=True`
- - *diverse beam-search decoding* by calling [`~generation.GenerationMixin.group_beam_search`], if
+ - *diverse beam-search decoding* by calling [`~generation.GenerationMixin._group_beam_search`], if
`num_beams>1` and `num_beam_groups>1`
- - *constrained beam-search decoding* by calling [`~generation.GenerationMixin.constrained_beam_search`], if
+ - *constrained beam-search decoding* by calling [`~generation.GenerationMixin._constrained_beam_search`], if
`constraints!=None` or `force_words_ids!=None`
- - *assisted decoding* by calling [`~generation.GenerationMixin.assisted_decoding`], if
- `assistant_model` is passed to `.generate()`
+ - *assisted decoding* by calling [`~generation.GenerationMixin._assisted_decoding`], if
+ `assistant_model` or `prompt_lookup_num_tokens` is passed to `.generate()`
You do not need to call any of the above methods directly. Pass custom parameter values to '.generate()'. To learn
more about decoding strategies refer to the [text generation strategies guide](../generation_strategies).
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 4bfae470cf720f..09e25958ac97b8 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -347,20 +347,22 @@ class GenerationMixin:
A class containing all functions for auto-regressive text generation, to be used as a mixin in [`PreTrainedModel`].
The class exposes [`~generation.GenerationMixin.generate`], which can be used for:
- - *greedy decoding* by calling [`~generation.GenerationMixin.greedy_search`] if `num_beams=1` and
+ - *greedy decoding* by calling [`~generation.GenerationMixin._greedy_search`] if `num_beams=1` and
`do_sample=False`
- - *contrastive search* by calling [`~generation.GenerationMixin.contrastive_search`] if `penalty_alpha>0` and
+ - *contrastive search* by calling [`~generation.GenerationMixin._contrastive_search`] if `penalty_alpha>0` and
`top_k>1`
- - *multinomial sampling* by calling [`~generation.GenerationMixin.sample`] if `num_beams=1` and
+ - *multinomial sampling* by calling [`~generation.GenerationMixin._sample`] if `num_beams=1` and
`do_sample=True`
- - *beam-search decoding* by calling [`~generation.GenerationMixin.beam_search`] if `num_beams>1` and
+ - *beam-search decoding* by calling [`~generation.GenerationMixin._beam_search`] if `num_beams>1` and
`do_sample=False`
- - *beam-search multinomial sampling* by calling [`~generation.GenerationMixin.beam_sample`] if `num_beams>1`
+ - *beam-search multinomial sampling* by calling [`~generation.GenerationMixin._beam_sample`] if `num_beams>1`
and `do_sample=True`
- - *diverse beam-search decoding* by calling [`~generation.GenerationMixin.group_beam_search`], if `num_beams>1`
+ - *diverse beam-search decoding* by calling [`~generation.GenerationMixin._group_beam_search`], if `num_beams>1`
and `num_beam_groups>1`
- - *constrained beam-search decoding* by calling [`~generation.GenerationMixin.constrained_beam_search`], if
+ - *constrained beam-search decoding* by calling [`~generation.GenerationMixin._constrained_beam_search`], if
`constraints!=None` or `force_words_ids!=None`
+ - *assisted decoding* by calling [`~generation.GenerationMixin._assisted_decoding`], if
+ `assistant_model` or `prompt_lookup_num_tokens` is passed to `.generate()`
You do not need to call any of the above methods directly. Pass custom parameter values to 'generate' instead. To
learn more about decoding strategies refer to the [text generation strategies guide](../generation_strategies).
@@ -1547,7 +1549,7 @@ def generate(
)
if generation_mode == GenerationMode.GREEDY_SEARCH:
# 11. run greedy search
- result = self.greedy_search(
+ result = self._greedy_search(
input_ids,
logits_processor=prepared_logits_processor,
stopping_criteria=prepared_stopping_criteria,
@@ -1565,7 +1567,7 @@ def generate(
if not model_kwargs["use_cache"]:
raise ValueError("Contrastive search requires `use_cache=True`")
- result = self.contrastive_search(
+ result = self._contrastive_search(
input_ids,
top_k=generation_config.top_k,
penalty_alpha=generation_config.penalty_alpha,
@@ -1595,7 +1597,7 @@ def generate(
)
# 13. run sample
- result = self.sample(
+ result = self._sample(
input_ids,
logits_processor=prepared_logits_processor,
logits_warper=logits_warper,
@@ -1629,7 +1631,7 @@ def generate(
**model_kwargs,
)
# 13. run beam search
- result = self.beam_search(
+ result = self._beam_search(
input_ids,
beam_scorer,
logits_processor=prepared_logits_processor,
@@ -1668,7 +1670,7 @@ def generate(
)
# 14. run beam sample
- result = self.beam_sample(
+ result = self._beam_sample(
input_ids,
beam_scorer,
logits_processor=prepared_logits_processor,
@@ -1703,7 +1705,7 @@ def generate(
**model_kwargs,
)
# 13. run beam search
- result = self.group_beam_search(
+ result = self._group_beam_search(
input_ids,
beam_scorer,
logits_processor=prepared_logits_processor,
@@ -1777,7 +1779,7 @@ def typeerror():
**model_kwargs,
)
# 13. run beam search
- result = self.constrained_beam_search(
+ result = self._constrained_beam_search(
input_ids,
constrained_beam_scorer=constrained_beam_scorer,
logits_processor=prepared_logits_processor,
@@ -1801,8 +1803,15 @@ def typeerror():
return result
+ def contrastive_search(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `contrastive_search` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._contrastive_search(*args, **kwargs)
+
@torch.no_grad()
- def contrastive_search(
+ def _contrastive_search(
self,
input_ids: torch.LongTensor,
top_k: Optional[int] = 1,
@@ -1828,7 +1837,7 @@ def contrastive_search(
- In most cases, you do not need to call [`~generation.GenerationMixin.contrastive_search`] directly. Use
+ In most cases, you do not need to call [`~generation.GenerationMixin._contrastive_search`] directly. Use
generate() instead. For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -1902,7 +1911,7 @@ def contrastive_search(
>>> input_prompt = "DeepMind Company is"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt")
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=64)])
- >>> outputs = model.contrastive_search(
+ >>> outputs = model._contrastive_search(
... **input_ids, penalty_alpha=0.6, top_k=4, stopping_criteria=stopping_criteria
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
@@ -2243,7 +2252,14 @@ def contrastive_search(
else:
return input_ids
- def greedy_search(
+ def greedy_search(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `greedy_search` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._greedy_search(*args, **kwargs)
+
+ def _greedy_search(
self,
input_ids: torch.LongTensor,
logits_processor: Optional[LogitsProcessorList] = None,
@@ -2266,7 +2282,7 @@ def greedy_search(
- In most cases, you do not need to call [`~generation.GenerationMixin.greedy_search`] directly. Use generate()
+ In most cases, you do not need to call [`~generation.GenerationMixin._greedy_search`] directly. Use generate()
instead. For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -2348,7 +2364,7 @@ def greedy_search(
... )
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
- >>> outputs = model.greedy_search(
+ >>> outputs = model._greedy_search(
... input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria
... )
@@ -2514,7 +2530,14 @@ def greedy_search(
else:
return input_ids
- def sample(
+ def sample(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `sample` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._sample(*args, **kwargs)
+
+ def _sample(
self,
input_ids: torch.LongTensor,
logits_processor: Optional[LogitsProcessorList] = None,
@@ -2538,7 +2561,7 @@ def sample(
- In most cases, you do not need to call [`~generation.GenerationMixin.sample`] directly. Use generate() instead.
+ In most cases, you do not need to call [`~generation.GenerationMixin._sample`] directly. Use generate() instead.
For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -2635,7 +2658,7 @@ def sample(
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT
- >>> outputs = model.sample(
+ >>> outputs = model._sample(
... input_ids,
... logits_processor=logits_processor,
... logits_warper=logits_warper,
@@ -2832,7 +2855,14 @@ def _temporary_reorder_cache(self, past_key_values, beam_idx):
past_key_values.reorder_cache(beam_idx)
return past_key_values
- def beam_search(
+ def beam_search(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `beam_search` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._beam_search(*args, **kwargs)
+
+ def _beam_search(
self,
input_ids: torch.LongTensor,
beam_scorer: BeamScorer,
@@ -2856,7 +2886,7 @@ def beam_search(
- In most cases, you do not need to call [`~generation.GenerationMixin.beam_search`] directly. Use generate()
+ In most cases, you do not need to call [`~generation.GenerationMixin._beam_search`] directly. Use generate()
instead. For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -2958,7 +2988,7 @@ def beam_search(
... ]
... )
- >>> outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs)
+ >>> outputs = model._beam_search(input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']
@@ -3214,7 +3244,14 @@ def beam_search(
else:
return sequence_outputs["sequences"]
- def beam_sample(
+ def beam_sample(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `beam_sample` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._beam_sample(*args, **kwargs)
+
+ def _beam_sample(
self,
input_ids: torch.LongTensor,
beam_scorer: BeamScorer,
@@ -3238,7 +3275,7 @@ def beam_sample(
- In most cases, you do not need to call [`~generation.GenerationMixin.beam_sample`] directly. Use generate()
+ In most cases, you do not need to call [`~generation.GenerationMixin._beam_sample`] directly. Use generate()
instead. For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -3346,7 +3383,7 @@ def beam_sample(
... ]
... )
- >>> outputs = model.beam_sample(
+ >>> outputs = model._beam_sample(
... input_ids, beam_scorer, logits_processor=logits_processor, logits_warper=logits_warper, **model_kwargs
... )
@@ -3561,7 +3598,14 @@ def beam_sample(
else:
return sequence_outputs["sequences"]
- def group_beam_search(
+ def group_beam_search(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `group_beam_search` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._group_beam_search(*args, **kwargs)
+
+ def _group_beam_search(
self,
input_ids: torch.LongTensor,
beam_scorer: BeamScorer,
@@ -3584,7 +3628,7 @@ def group_beam_search(
- In most cases, you do not need to call [`~generation.GenerationMixin.group_beam_search`] directly. Use
+ In most cases, you do not need to call [`~generation.GenerationMixin._group_beam_search`] directly. Use
generate() instead. For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -3686,7 +3730,7 @@ def group_beam_search(
... ]
... )
- >>> outputs = model.group_beam_search(
+ >>> outputs = model._group_beam_search(
... input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs
... )
@@ -3958,7 +4002,14 @@ def group_beam_search(
else:
return sequence_outputs["sequences"]
- def constrained_beam_search(
+ def constrained_beam_search(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `constrained_beam_search` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._constrained_beam_search(*args, **kwargs)
+
+ def _constrained_beam_search(
self,
input_ids: torch.LongTensor,
constrained_beam_scorer: ConstrainedBeamSearchScorer,
@@ -3981,7 +4032,7 @@ def constrained_beam_search(
- In most cases, you do not need to call [`~generation.GenerationMixin.constrained_beam_search`] directly. Use
+ In most cases, you do not need to call [`~generation.GenerationMixin._constrained_beam_search`] directly. Use
generate() instead. For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -4088,7 +4139,7 @@ def constrained_beam_search(
... ]
... )
- >>> outputs = model.constrained_beam_search(
+ >>> outputs = model._constrained_beam_search(
... input_ids, beam_scorer, constraints=constraints, logits_processor=logits_processor, **model_kwargs
... )
@@ -4311,7 +4362,14 @@ def constrained_beam_search(
else:
return sequence_outputs["sequences"]
- def assisted_decoding(
+ def assisted_decoding(self, *args, **kwargs):
+ logger.warning_once(
+ "Calling `_assisted_decoding` directly is deprecated and will be removed in v4.41. Use `generate` or a "
+ "custom generation loop instead.",
+ )
+ return self._assisted_decoding(*args, **kwargs)
+
+ def _assisted_decoding(
self,
input_ids: torch.LongTensor,
candidate_generator: Optional["CandidateGenerator"] = None,
@@ -4338,7 +4396,7 @@ def assisted_decoding(
- In most cases, you do not need to call [`~generation.GenerationMixin.candidate_decoding`] directly. Use
+ In most cases, you do not need to call [`~generation.GenerationMixin._assisted_decoding`] directly. Use
generate() instead. For an overview of generation strategies and code examples, check the [following
guide](../generation_strategies).
@@ -4429,7 +4487,7 @@ def assisted_decoding(
... logits_processor=logits_processor,
... model_kwargs={},
... )
- >>> outputs = model.assisted_decoding(
+ >>> outputs = model._assisted_decoding(
... input_ids,
... candidate_generator=candidate_generator,
... logits_processor=logits_processor,
diff --git a/src/transformers/models/musicgen/modeling_musicgen.py b/src/transformers/models/musicgen/modeling_musicgen.py
index 2514a487632385..bb5a5277f362b1 100644
--- a/src/transformers/models/musicgen/modeling_musicgen.py
+++ b/src/transformers/models/musicgen/modeling_musicgen.py
@@ -1336,7 +1336,7 @@ def generate(
)
# 11. run greedy search
- outputs = self.greedy_search(
+ outputs = self._greedy_search(
input_ids,
logits_processor=logits_processor,
stopping_criteria=stopping_criteria,
@@ -1361,7 +1361,7 @@ def generate(
)
# 12. run sample
- outputs = self.sample(
+ outputs = self._sample(
input_ids,
logits_processor=logits_processor,
logits_warper=logits_warper,
@@ -2402,7 +2402,7 @@ def generate(
)
# 11. run greedy search
- outputs = self.greedy_search(
+ outputs = self._greedy_search(
input_ids,
logits_processor=logits_processor,
stopping_criteria=stopping_criteria,
@@ -2428,7 +2428,7 @@ def generate(
)
# 12. run sample
- outputs = self.sample(
+ outputs = self._sample(
input_ids,
logits_processor=logits_processor,
logits_warper=logits_warper,
diff --git a/src/transformers/models/rag/modeling_rag.py b/src/transformers/models/rag/modeling_rag.py
index a840b0681eddbe..80dec5bc3dba58 100644
--- a/src/transformers/models/rag/modeling_rag.py
+++ b/src/transformers/models/rag/modeling_rag.py
@@ -1539,7 +1539,7 @@ def extend_enc_output(tensor, num_beams=None):
f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing"
" greedy search."
)
- return self.greedy_search(
+ return self._greedy_search(
input_ids,
logits_processor=pre_processor,
max_length=generation_config.max_length,
@@ -1559,7 +1559,7 @@ def extend_enc_output(tensor, num_beams=None):
num_beam_hyps_to_keep=generation_config.num_return_sequences,
max_length=generation_config.max_length,
)
- return self.beam_search(
+ return self._beam_search(
input_ids,
beam_scorer,
logits_processor=pre_processor,
From fb1c62e9731b0acd2ff4edc09af519ebc62dfd39 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Tue, 5 Mar 2024 12:01:06 +0100
Subject: [PATCH 088/549] [`Add Mamba`] Adds support for the `Mamba` models
(#28094)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* initial-commit
* start cleaning
* small nits
* small nits
* current updates
* add kernels
* small refactoring little step
* add comments
* styling
* nit
* nits
* Style
* Small changes
* Push dummy mambda simple slow
* nit
* Use original names
* Use original names and remove norm
* Updates for inference params
* Style nd updates
* nits
* Match logits
* Add a test
* Add expected generated text
* nits doc, imports and styling
* style
* oups
* dont install kernels, invite users to install the required kernels
* let use use the original packages
* styling
* nits
* fix some copieds
* update doc
* fix-copies
* styling done
* nits
* fix import check
* run but wrong cuda ress
* mamba CUDA works :)
* fix the fast path
* config naming nits
* conversion script is not required at this stage
* finish fixing the fast path: generation make sense now!
* nit
* Let's start working on the CIs
* style
* better style
* more nits
* test nit
* quick fix for now
* nits
* nit
* nit
* nit
* nits
* update test rest
* fixup
* update test
* nit
* some fixes
* nits
* update test values
* fix styling
* nit
* support peft
* integrations tests require torchg
* also add slow markers
* styling
* chose forward wisely
* nits
* update tests
* fix gradient checkpointing
* fixup
* nit
* fix doc
* check copies
* fix the docstring
* fix some more tests
* style
* fix beam search
* add init schene
* update
* nit
* fix
* fixup the doc
* fix the doc
* fixup
* tentative update but slow is no longer good
* nit
* should we always use float32?
* nits
* revert wrong changes
* res in float32
* cleanup
* skip fmt for now
* update generation values
* update test values running original model
* fixup
* update tests + rename inference_params to cache_params + make sure training does not use cache_params
* small nits
* more nits
* fix final CIs
* style
* nit doc
* I hope final doc nits
* nit
* 🫠
* final touch!
* fix torch import
* Apply suggestions from code review
Co-authored-by: Lysandre Debut
* Apply suggestions from code review
* fix fix and fix
* fix base model prefix!
* nit
* Update src/transformers/models/mamba/__init__.py
* Update docs/source/en/model_doc/mamba.md
Co-authored-by: Lysandre Debut
* nit
---------
Co-authored-by: Lysandre Debut
---
README.md | 1 +
README_es.md | 1 +
README_fr.md | 1 +
README_hd.md | 1 +
README_ja.md | 1 +
README_ko.md | 1 +
README_zh-hans.md | 1 +
README_zh-hant.md | 1 +
docs/source/en/_toctree.yml | 2 +
docs/source/en/index.md | 1 +
docs/source/en/model_doc/mamba.md | 107 +++
docs/source/en/tasks/language_modeling.md | 2 +-
src/transformers/__init__.py | 16 +
src/transformers/generation/utils.py | 8 +-
src/transformers/models/__init__.py | 1 +
.../models/auto/configuration_auto.py | 3 +
src/transformers/models/auto/modeling_auto.py | 4 +
.../models/auto/tokenization_auto.py | 1 +
src/transformers/models/mamba/__init__.py | 60 ++
.../models/mamba/configuration_mamba.py | 156 ++++
.../models/mamba/modeling_mamba.py | 681 ++++++++++++++++++
src/transformers/utils/dummy_pt_objects.py | 24 +
src/transformers/utils/import_utils.py | 21 +
tests/models/mamba/__init__.py | 0
tests/models/mamba/test_modeling_mamba.py | 491 +++++++++++++
utils/check_config_attributes.py | 2 +
26 files changed, 1583 insertions(+), 5 deletions(-)
create mode 100644 docs/source/en/model_doc/mamba.md
create mode 100644 src/transformers/models/mamba/__init__.py
create mode 100644 src/transformers/models/mamba/configuration_mamba.py
create mode 100644 src/transformers/models/mamba/modeling_mamba.py
create mode 100644 tests/models/mamba/__init__.py
create mode 100644 tests/models/mamba/test_modeling_mamba.py
diff --git a/README.md b/README.md
index 868547c42f584e..f0e178145bb4c1 100644
--- a/README.md
+++ b/README.md
@@ -415,6 +415,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (from Albert Gu and Tri Dao) released with the paper [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) by Albert Gu and Tri Dao.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei.
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
diff --git a/README_es.md b/README_es.md
index 0c33e07440b69e..63f9ddce93c727 100644
--- a/README_es.md
+++ b/README_es.md
@@ -388,6 +388,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (from Albert Gu and Tri Dao) released with the paper [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) by Albert Gu and Tri Dao.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei.
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
diff --git a/README_fr.md b/README_fr.md
index 26247139595e86..3e2b080ba1bce9 100644
--- a/README_fr.md
+++ b/README_fr.md
@@ -409,6 +409,7 @@ Nombre actuel de points de contrôle : ![](https://img.shields.io/endpoint?url=h
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (de Facebook) a été publié dans l'article [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) de Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve et Ronan Collobert.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (de Facebook) a été publié dans l'article [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) de Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (de Google) a été publié dans l'article [MADLAD-400 : Un ensemble de données multilingue et de niveau document](https://arxiv.org/abs/2309.04662) de Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (de Albert Gu and Tri Dao) publié dans l'article [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) parAlbert Gu and Tri Dao.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Des modèles de traduction automatique formés avec les données [OPUS](http://opus.nlpl.eu/) par Jörg Tiedemann. Le [cadre Marian](https://marian-nmt.github.io/) est en cours de développement par l'équipe Microsoft Translator.
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (de Microsoft Research Asia) a été publié dans l'article [MarkupLM : Pré-entraînement de texte et de langage de balisage pour la compréhension visuellement riche de documents](https://arxiv.org/abs/2110.08518) de Junlong Li, Yiheng Xu, Lei Cui, Furu Wei.
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (de FAIR et UIUC) a été publié dans l'article [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) de Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
diff --git a/README_hd.md b/README_hd.md
index cdb126cd23e627..c94534d00c6f40 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -362,6 +362,7 @@ conda install conda-forge::transformers
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (फेसबुक से) साथ देने वाला पेपर [बियॉन्ड इंग्लिश-सेंट्रिक मल्टीलिंगुअल मशीन ट्रांसलेशन](https://arxiv.org/एब्स/2010.11125) एंजेला फैन, श्रुति भोसले, होल्गर श्वेन्क, झी मा, अहमद अल-किश्की, सिद्धार्थ गोयल, मनदीप बैनेस, ओनूर सेलेबी, गुइल्लाम वेन्जेक, विश्रव चौधरी, नमन गोयल, टॉम बर्च, विटाली लिपचिंस्की, सर्गेई एडुनोव, एडौर्ड द्वारा ग्रेव, माइकल औली, आर्मंड जौलिन द्वारा पोस्ट किया गया।
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (Albert Gu and Tri Dao से) Albert Gu and Tri Dao. द्वाराअनुसंधान पत्र [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) के साथ जारी किया गया
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Jörg द्वारा [OPUS](http://opus.nlpl.eu/) डेटा से प्रशिक्षित मशीनी अनुवाद मॉडल पोस्ट किया गया टाइडेमैन द्वारा। [मैरियन फ्रेमवर्क](https://marian-nmt.github.io/) माइक्रोसॉफ्ट ट्रांसलेटर टीम द्वारा विकसित।
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (माइक्रोसॉफ्ट रिसर्च एशिया से) साथ में पेपर [मार्कअपएलएम: विजुअली-रिच डॉक्यूमेंट अंडरस्टैंडिंग के लिए टेक्स्ट और मार्कअप लैंग्वेज का प्री-ट्रेनिंग](https://arxiv.org/abs/2110.08518) जुनलॉन्ग ली, यिहेंग जू, लेई कुई, फुरु द्वारा वी द्वारा पोस्ट किया गया।
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (FAIR and UIUC से) Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. द्वाराअनुसंधान पत्र [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) के साथ जारी किया गया
diff --git a/README_ja.md b/README_ja.md
index 52a147b54da753..4f06efafb4e778 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -422,6 +422,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (Facebook から) Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert から公開された研究論文: [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161)
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (Facebook から) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin から公開された研究論文: [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125)
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (Albert Gu and Tri Dao から) Albert Gu and Tri Dao. から公開された研究論文 [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752)
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Jörg Tiedemann から. [OPUS](http://opus.nlpl.eu/) を使いながら学習された "Machine translation" (マシントランスレーション) モデル. [Marian Framework](https://marian-nmt.github.io/) はMicrosoft Translator Team が現在開発中です.
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (Microsoft Research Asia から) Junlong Li, Yiheng Xu, Lei Cui, Furu Wei から公開された研究論文: [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518)
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (FAIR and UIUC から) Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. から公開された研究論文 [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527)
diff --git a/README_ko.md b/README_ko.md
index 787e8e17008abe..1c306e1eda454d 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -337,6 +337,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (Facebook 에서) Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert 의 [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) 논문과 함께 발표했습니다.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (Facebook 에서) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin 의 [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) 논문과 함께 발표했습니다.
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (Albert Gu and Tri Dao 에서 제공)은 Albert Gu and Tri Dao.의 [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752)논문과 함께 발표했습니다.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (Microsoft Research Asia 에서) Junlong Li, Yiheng Xu, Lei Cui, Furu Wei 의 [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) 논문과 함께 발표했습니다.
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (FAIR and UIUC 에서 제공)은 Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.의 [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527)논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index 9b34fcf450cbe6..31faadd73cb752 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -361,6 +361,7 @@ conda install conda-forge::transformers
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (来自 Facebook) 伴随论文 [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) 由 Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert 发布。
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (来自 Facebook) 伴随论文 [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) 由 Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin 发布。
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (来自 Albert Gu and Tri Dao) 伴随论文 [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) 由 Albert Gu and Tri Dao 发布。
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** 用 [OPUS](http://opus.nlpl.eu/) 数据训练的机器翻译模型由 Jörg Tiedemann 发布。[Marian Framework](https://marian-nmt.github.io/) 由微软翻译团队开发。
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (来自 Microsoft Research Asia) 伴随论文 [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) 由 Junlong Li, Yiheng Xu, Lei Cui, Furu Wei 发布。
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (来自 FAIR and UIUC) 伴随论文 [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) 由 Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index e4a888205a655f..cbae8dd807db8c 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -373,6 +373,7 @@ conda install conda-forge::transformers
1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat.
+1. **[Mamba](https://huggingface.co/docs/transformers/main/model_doc/mamba)** (from Albert Gu and Tri Dao) released with the paper [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) by Albert Gu and Tri Dao.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei.
1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 76d8a2ba7d7d75..0144eca5f5f123 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -398,6 +398,8 @@
title: M2M100
- local: model_doc/madlad-400
title: MADLAD-400
+ - local: model_doc/mamba
+ title: Mamba
- local: model_doc/marian
title: MarianMT
- local: model_doc/markuplm
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
index 36216962d2da34..eb7c225ee53e51 100644
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -180,6 +180,7 @@ Flax), PyTorch, and/or TensorFlow.
| [M-CTC-T](model_doc/mctct) | ✅ | ❌ | ❌ |
| [M2M100](model_doc/m2m_100) | ✅ | ❌ | ❌ |
| [MADLAD-400](model_doc/madlad-400) | ✅ | ✅ | ✅ |
+| [Mamba](model_doc/mamba) | ✅ | ❌ | ❌ |
| [Marian](model_doc/marian) | ✅ | ✅ | ✅ |
| [MarkupLM](model_doc/markuplm) | ✅ | ❌ | ❌ |
| [Mask2Former](model_doc/mask2former) | ✅ | ❌ | ❌ |
diff --git a/docs/source/en/model_doc/mamba.md b/docs/source/en/model_doc/mamba.md
new file mode 100644
index 00000000000000..7378f79f94df7f
--- /dev/null
+++ b/docs/source/en/model_doc/mamba.md
@@ -0,0 +1,107 @@
+
+
+# Mamba
+
+## Overview
+
+The Mamba model was proposed in [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) by Albert Gu and Tri Dao.
+
+This model is a new paradigm architecture based on `state-space-models`. You can read more about the intuition behind these [here](https://srush.github.io/annotated-s4/).
+
+The abstract from the paper is the following:
+
+*Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.*
+
+Tips:
+
+- Mamba is a new `state space model` architecture that rivals the classic Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of [FlashAttention](https://github.com/Dao-AILab/flash-attention).
+- Mamba stacks `mixer` layers, which are the equivalent of `Attention` layers. The core logic of `mamba` is held in the `MambaMixer` class.
+- Two implementations cohabit: one is optimized and uses fast cuda kernels, while the other one is naive but can run on any device!
+- The current implementation leverages the original cuda kernels: the equivalent of flash attention for Mamba are hosted in the [`mamba-ssm`](https://github.com/state-spaces/mamba) and the [`causal_conv1d`](https://github.com/Dao-AILab/causal-conv1d) repositories. Make sure to install them if your hardware supports them!
+- Contributions to make the naive path faster are welcome 🤗
+
+This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ).
+The original code can be found [here](https://github.com/state-spaces/mamba).
+
+# Usage
+
+### A simple generation example:
+```python
+from transformers import MambaConfig, MambaForCausalLM, AutoTokenizer
+import torch
+
+tokenizer = AutoTokenizer.from_pretrained("ArthurZ/mamba-130m")
+tokenizer.pad_token = tokenizer.eos_token
+
+model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-130m", vocab_size=50280, num_hidden_layers=24, torch_dtype=torch.float32)
+model.config.use_cache = True
+input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
+
+out = model.generate(input_ids, max_new_tokens=10)
+print(tokenizer.batch_decode(out))
+```
+
+### Peft finetuning
+The slow version is not very stable for training, and the fast one needs `float32`!
+
+```python
+from datasets import load_dataset
+from trl import SFTTrainer
+from peft import LoraConfig
+from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
+model_id = "ArthurZ/mamba-2.8b"
+tokenizer = AutoTokenizer.from_pretrained(model_id, pad_token ="")
+model = AutoModelForCausalLM.from_pretrained(model_id)
+dataset = load_dataset("Abirate/english_quotes", split="train")
+training_args = TrainingArguments(
+ output_dir="./results",
+ num_train_epochs=3,
+ per_device_train_batch_size=4,
+ logging_dir='./logs',
+ logging_steps=10,
+ learning_rate=2e-3
+)
+lora_config = LoraConfig(
+ r=8,
+ target_modules="all-linear",
+ task_type="CAUSAL_LM",
+ bias="none"
+)
+trainer = SFTTrainer(
+ model=model,
+ tokenizer=tokenizer,
+ args=training_args,
+ peft_config=lora_config,
+ train_dataset=dataset,
+ dataset_text_field="quote",
+)
+trainer.train()
+```
+
+## MambaConfig
+
+[[autodoc]] MambaConfig
+
+## MambaModel
+
+[[autodoc]] MambaModel
+ - forward
+
+## MambaLMHeadModel
+
+[[autodoc]] MambaForCausalLM
+ - forward
diff --git a/docs/source/en/tasks/language_modeling.md b/docs/source/en/tasks/language_modeling.md
index bcd10341b7443e..97a40d5897bbcc 100644
--- a/docs/source/en/tasks/language_modeling.md
+++ b/docs/source/en/tasks/language_modeling.md
@@ -37,7 +37,7 @@ You can finetune other architectures for causal language modeling following the
Choose one of the following architectures:
-[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [Gemma](../model_doc/gemma), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [StableLm](../model_doc/stablelm), [Starcoder2](../model_doc/starcoder2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [Whisper](../model_doc/whisper), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
+[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [Gemma](../model_doc/gemma), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Mamba](../model_doc/mamba), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [StableLm](../model_doc/stablelm), [Starcoder2](../model_doc/starcoder2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [Whisper](../model_doc/whisper), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index 6cdd561b41e1ba..f7c92f033f69ee 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -571,6 +571,7 @@
"LxmertTokenizer",
],
"models.m2m_100": ["M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP", "M2M100Config"],
+ "models.mamba": ["MAMBA_PRETRAINED_CONFIG_ARCHIVE_MAP", "MambaConfig"],
"models.marian": ["MarianConfig"],
"models.markuplm": [
"MARKUPLM_PRETRAINED_CONFIG_ARCHIVE_MAP",
@@ -2578,6 +2579,14 @@
"M2M100PreTrainedModel",
]
)
+ _import_structure["models.mamba"].extend(
+ [
+ "MAMBA_PRETRAINED_MODEL_ARCHIVE_LIST",
+ "MambaForCausalLM",
+ "MambaModel",
+ "MambaPreTrainedModel",
+ ]
+ )
_import_structure["models.marian"].extend(["MarianForCausalLM", "MarianModel", "MarianMTModel"])
_import_structure["models.markuplm"].extend(
[
@@ -5370,6 +5379,7 @@
LxmertTokenizer,
)
from .models.m2m_100 import M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP, M2M100Config
+ from .models.mamba import MAMBA_PRETRAINED_CONFIG_ARCHIVE_MAP, MambaConfig
from .models.marian import MarianConfig
from .models.markuplm import (
MARKUPLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -7160,6 +7170,12 @@
M2M100Model,
M2M100PreTrainedModel,
)
+ from .models.mamba import (
+ MAMBA_PRETRAINED_MODEL_ARCHIVE_LIST,
+ MambaForCausalLM,
+ MambaModel,
+ MambaPreTrainedModel,
+ )
from .models.marian import MarianForCausalLM, MarianModel, MarianMTModel
from .models.markuplm import (
MARKUPLM_PRETRAINED_MODEL_ARCHIVE_LIST,
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 09e25958ac97b8..2f68c8b2f4bc12 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -3183,7 +3183,7 @@ def _beam_search(
model_kwargs = self._update_model_kwargs_for_generation(
outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder, model_inputs=model_inputs
)
- if model_kwargs["past_key_values"] is not None:
+ if model_kwargs.get("past_key_values", None) is not None:
model_kwargs["past_key_values"] = self._temporary_reorder_cache(
model_kwargs["past_key_values"], beam_idx
)
@@ -3537,7 +3537,7 @@ def _beam_sample(
model_kwargs = self._update_model_kwargs_for_generation(
outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder, model_inputs=model_inputs
)
- if model_kwargs["past_key_values"] is not None:
+ if model_kwargs.get("past_key_values", None) is not None:
model_kwargs["past_key_values"] = self._temporary_reorder_cache(
model_kwargs["past_key_values"], beam_idx
)
@@ -3943,7 +3943,7 @@ def _group_beam_search(
model_kwargs = self._update_model_kwargs_for_generation(
outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder, model_inputs=model_inputs
)
- if model_kwargs["past_key_values"] is not None:
+ if model_kwargs.get("past_key_values", None) is not None:
model_kwargs["past_key_values"] = self._temporary_reorder_cache(
model_kwargs["past_key_values"], reordering_indices
)
@@ -4302,7 +4302,7 @@ def _constrained_beam_search(
model_kwargs = self._update_model_kwargs_for_generation(
outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder, model_inputs=model_inputs
)
- if model_kwargs["past_key_values"] is not None:
+ if model_kwargs.get("past_key_values", None) is not None:
model_kwargs["past_key_values"] = self._temporary_reorder_cache(
model_kwargs["past_key_values"], beam_idx
)
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index 89ca6ab2b8660c..bce37fa50b69cb 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -128,6 +128,7 @@
luke,
lxmert,
m2m_100,
+ mamba,
marian,
markuplm,
mask2former,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index 87ff925e55eaa1..45bbc46a28a86c 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -137,6 +137,7 @@
("luke", "LukeConfig"),
("lxmert", "LxmertConfig"),
("m2m_100", "M2M100Config"),
+ ("mamba", "MambaConfig"),
("marian", "MarianConfig"),
("markuplm", "MarkupLMConfig"),
("mask2former", "Mask2FormerConfig"),
@@ -373,6 +374,7 @@
("luke", "LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("lxmert", "LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("m2m_100", "M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+ ("mamba", "MAMBA_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("markuplm", "MARKUPLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("mask2former", "MASK2FORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("maskformer", "MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -614,6 +616,7 @@
("lxmert", "LXMERT"),
("m2m_100", "M2M100"),
("madlad-400", "MADLAD-400"),
+ ("mamba", "Mamba"),
("marian", "Marian"),
("markuplm", "MarkupLM"),
("mask2former", "Mask2Former"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 0d28d224f19106..109dbb19485916 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -134,6 +134,7 @@
("luke", "LukeModel"),
("lxmert", "LxmertModel"),
("m2m_100", "M2M100Model"),
+ ("mamba", "MambaModel"),
("marian", "MarianModel"),
("markuplm", "MarkupLMModel"),
("mask2former", "Mask2FormerModel"),
@@ -286,6 +287,7 @@
("longformer", "LongformerForMaskedLM"),
("luke", "LukeForMaskedLM"),
("lxmert", "LxmertForPreTraining"),
+ ("mamba", "MambaForCausalLM"),
("mega", "MegaForMaskedLM"),
("megatron-bert", "MegatronBertForPreTraining"),
("mobilebert", "MobileBertForPreTraining"),
@@ -367,6 +369,7 @@
("longt5", "LongT5ForConditionalGeneration"),
("luke", "LukeForMaskedLM"),
("m2m_100", "M2M100ForConditionalGeneration"),
+ ("mamba", "MambaForCausalLM"),
("marian", "MarianMTModel"),
("mega", "MegaForMaskedLM"),
("megatron-bert", "MegatronBertForCausalLM"),
@@ -439,6 +442,7 @@
("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"),
("gptj", "GPTJForCausalLM"),
("llama", "LlamaForCausalLM"),
+ ("mamba", "MambaForCausalLM"),
("marian", "MarianForCausalLM"),
("mbart", "MBartForCausalLM"),
("mega", "MegaForCausalLM"),
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index d586068fb9c095..38a9650c0025db 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -233,6 +233,7 @@
("luke", ("LukeTokenizer", None)),
("lxmert", ("LxmertTokenizer", "LxmertTokenizerFast" if is_tokenizers_available() else None)),
("m2m_100", ("M2M100Tokenizer" if is_sentencepiece_available() else None, None)),
+ ("mamba", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("marian", ("MarianTokenizer" if is_sentencepiece_available() else None, None)),
(
"mbart",
diff --git a/src/transformers/models/mamba/__init__.py b/src/transformers/models/mamba/__init__.py
new file mode 100644
index 00000000000000..7a1c142e05d51e
--- /dev/null
+++ b/src/transformers/models/mamba/__init__.py
@@ -0,0 +1,60 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from ...utils import (
+ OptionalDependencyNotAvailable,
+ _LazyModule,
+ is_torch_available,
+)
+
+
+_import_structure = {
+ "configuration_mamba": ["MAMBA_PRETRAINED_CONFIG_ARCHIVE_MAP", "MambaConfig", "MambaOnnxConfig"],
+}
+
+try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["modeling_mamba"] = [
+ "MAMBA_PRETRAINED_MODEL_ARCHIVE_LIST",
+ "MambaForCausalLM",
+ "MambaModel",
+ "MambaPreTrainedModel",
+ ]
+
+
+if TYPE_CHECKING:
+ from .configuration_mamba import MAMBA_PRETRAINED_CONFIG_ARCHIVE_MAP, MambaConfig, MambaOnnxConfig
+
+ try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .modeling_mamba import (
+ MAMBA_PRETRAINED_MODEL_ARCHIVE_LIST,
+ MambaForCausalLM,
+ MambaModel,
+ MambaPreTrainedModel,
+ )
+else:
+ import sys
+
+ sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/mamba/configuration_mamba.py b/src/transformers/models/mamba/configuration_mamba.py
new file mode 100644
index 00000000000000..dd1dd129aec633
--- /dev/null
+++ b/src/transformers/models/mamba/configuration_mamba.py
@@ -0,0 +1,156 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MAMBA configuration"""
+import math
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+MAMBA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+ "state-spaces/mamba-2.8b": "https://huggingface.co/state-spaces/mamba-2.8b/resolve/main/config.json",
+}
+
+
+class MambaConfig(PretrainedConfig):
+ """
+ This is the configuration class to store the configuration of a [`MambaModel`]. It is used to instantiate a MAMBA
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+ defaults will yield a similar configuration to that of the MAMBA
+ [state-spaces/mamba-2.8b](https://huggingface.co/state-spaces/mamba-2.8b) architecture.
+
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+ documentation from [`PretrainedConfig`] for more information.
+
+
+ Args:
+ vocab_size (`int`, *optional*, defaults to 50280):
+ Vocabulary size of the MAMBA model. Defines the number of different tokens that can be represented by the
+ `inputs_ids` passed when calling [`MambaModel`].
+ hidden_size (`int`, *optional*, defaults to 768):
+ Dimensionality of the embeddings and hidden states.
+ state_size (`int`, *optional*, defaults to 16): shape of the state space latents.
+ num_hidden_layers (`int`, *optional*, defaults to 32):
+ Number of hidden layers in the model.
+ layer_norm_epsilon (`float`, *optional*, defaults to 1e-05):
+ The epsilon to use in the layer normalization layers.
+ pad_token_id (`int`, *optional*, defaults to 0):
+ Padding token id.
+ bos_token_id (`int`, *optional*, defaults to 0):
+ The id of the beginning of sentence token in the vocabulary.
+ eos_token_id (`int`, *optional*, defaults to 0):
+ The id of the end of sentence token in the vocabulary.
+ expand (`int`, *optional*, defaults to 2): Expanding factor used to determin the intermediate size.
+ conv_kernel (`int`, *optional*, defaults to 4): Size of the convolution kernel.
+ use_bias (`bool`, *optional*, defaults to `False`):
+ Whether or not to use bias in ["in_proj", "out_proj"] of the mixer block
+ use_conv_bias (`bool`, *optional*, defaults to `True`):
+ Whether or not to use bias in the convolution layer of the mixer block.
+ hidden_act (`str`, *optional*, defaults to `"silu"`):
+ The non-linear activation function (function or string) in the decoder.
+ initializer_range (`float`, *optional*, defaults to 0.1):
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+ residual_in_fp32 (`bool`, *optional*, defaults to `True`):
+ Whether or not residuals should be in `float32`. If set to `False` residuals will keep the same `dtype` as the rest of the model
+ time_step_rank (`Union[int,str]`, *optional*, defaults to `"auto"`):
+ Rank of the the discretization projection matrix. `"auto"` means that it will default to `math.ceil(self.hidden_size / 16)`
+ time_step_scale (`float`, *optional*, defaults to 1.0):
+ Scale used used to scale `dt_proj.bias`.
+ time_step_min (`float`, *optional*, defaults to 0.001):
+ Minimum `time_step` used to bound `dt_proj.bias`.
+ time_step_max (`float`, *optional*, defaults to 0.1):
+ Maximum `time_step` used to bound `dt_proj.bias`.
+ time_step_init_scheme (`float`, *optional*, defaults to `"random"`):
+ Init scheme used for `dt_proj.weight`. Should be one of `["random","uniform"]`
+ time_step_floor (`float`, *optional*, defaults to 0.0001):
+ Minimum clamping value of the `dt_proj.bias` layer initialization.
+ rescale_prenorm_residual (`bool`, *optional*, defaults to `False`):
+ Whether or not to rescale `out_proj` weights when initializing.
+ use_cache (`bool`, *optional*, defaults to `True`):
+ Whether or not the cache should be used.
+
+
+ Example:
+
+ ```python
+ >>> from transformers import MambaConfig, MambaModel
+
+ >>> # Initializing a Mamba configuration
+ >>> configuration = MambaConfig()
+
+ >>> # Initializing a model (with random weights) from the configuration
+ >>> model = MambaModel(configuration)
+
+ >>> # Accessing the model configuration
+ >>> configuration = model.config
+ ```"""
+
+ model_type = "mamba"
+
+ def __init__(
+ self,
+ vocab_size=50280,
+ hidden_size=768,
+ state_size=16,
+ num_hidden_layers=32,
+ layer_norm_epsilon=1e-5,
+ pad_token_id=0,
+ bos_token_id=0,
+ eos_token_id=0,
+ expand=2,
+ conv_kernel=4,
+ use_bias=False,
+ use_conv_bias=True,
+ hidden_act="silu",
+ initializer_range=0.1,
+ residual_in_fp32=True,
+ time_step_rank="auto",
+ time_step_scale=1.0,
+ time_step_min=0.001,
+ time_step_max=0.1,
+ time_step_init_scheme="random",
+ time_step_floor=1e-4,
+ rescale_prenorm_residual=False,
+ use_cache=True,
+ **kwargs,
+ ):
+ self.vocab_size = vocab_size
+ self.hidden_size = hidden_size
+ self.state_size = state_size
+ self.num_hidden_layers = num_hidden_layers
+ self.layer_norm_epsilon = layer_norm_epsilon
+ self.conv_kernel = conv_kernel
+ self.expand = expand
+ self.intermediate_size = int(expand * self.hidden_size)
+ self.bos_token_id = bos_token_id
+ self.eos_token_id = eos_token_id
+ self.pad_token_id = pad_token_id
+ self.use_bias = use_bias
+ self.use_conv_bias = use_conv_bias
+ self.hidden_act = hidden_act
+ self.initializer_range = initializer_range
+ self.time_step_rank = math.ceil(self.hidden_size / 16) if time_step_rank == "auto" else time_step_rank
+ self.time_step_scale = time_step_scale
+ self.time_step_min = time_step_min
+ self.time_step_max = time_step_max
+ self.time_step_init_scheme = time_step_init_scheme
+ self.time_step_floor = time_step_floor
+ self.rescale_prenorm_residual = rescale_prenorm_residual
+ self.residual_in_fp32 = residual_in_fp32
+ self.use_cache = use_cache
+
+ super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, pad_token_id=pad_token_id, **kwargs)
diff --git a/src/transformers/models/mamba/modeling_mamba.py b/src/transformers/models/mamba/modeling_mamba.py
new file mode 100644
index 00000000000000..4870c0281fdc34
--- /dev/null
+++ b/src/transformers/models/mamba/modeling_mamba.py
@@ -0,0 +1,681 @@
+# coding=utf-8
+# Copyright 2024 state-spaces/mamba org and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch MAMBA model."""
+
+import math
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+
+from ...activations import ACT2FN
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+ ModelOutput,
+ add_code_sample_docstrings,
+ add_start_docstrings,
+ add_start_docstrings_to_model_forward,
+ logging,
+)
+from ...utils.import_utils import is_causal_conv1d_available, is_mamba_ssm_available
+from .configuration_mamba import MambaConfig
+
+
+logger = logging.get_logger(__name__)
+
+if is_mamba_ssm_available():
+ from mamba_ssm.ops.selective_scan_interface import mamba_inner_fn, selective_scan_fn
+ from mamba_ssm.ops.triton.selective_state_update import selective_state_update
+else:
+ selective_state_update, selective_scan_fn, mamba_inner_fn = None, None, None
+
+if is_causal_conv1d_available():
+ from causal_conv1d import causal_conv1d_fn, causal_conv1d_update
+else:
+ causal_conv1d_update, causal_conv1d_fn = None, None
+
+is_fast_path_available = all(
+ (selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)
+)
+
+_CHECKPOINT_FOR_DOC = "ArthurZ/mamba-130m"
+_CONFIG_FOR_DOC = "MambaConfig"
+
+MAMBA_PRETRAINED_MODEL_ARCHIVE_LIST = [] # See all Mamba models at https://huggingface.co/models?filter=mamba
+
+
+class MambaMixer(nn.Module):
+ """
+ Compute ∆, A, B, C, and D the state space parameters and compute the `contextualized_states`.
+ A, D are input independent (see Mamba paper [1] Section 3.5.2 "Interpretation of A" for why A isn't selective)
+ ∆, B, C are input-dependent (this is a key difference between Mamba and the linear time invariant S4,
+ and is why Mamba is called **selective** state spaces)
+ """
+
+ def __init__(self, config, layer_idx):
+ super().__init__()
+ self.hidden_size = config.hidden_size
+ self.ssm_state_size = config.state_size
+ self.conv_kernel_size = config.conv_kernel
+ self.intermediate_size = config.intermediate_size
+ self.time_step_rank = config.time_step_rank
+ self.layer_idx = layer_idx
+ self.use_conv_bias = config.use_conv_bias
+ self.conv1d = nn.Conv1d(
+ in_channels=self.intermediate_size,
+ out_channels=self.intermediate_size,
+ bias=config.use_conv_bias,
+ kernel_size=config.conv_kernel,
+ groups=self.intermediate_size,
+ padding=config.conv_kernel - 1,
+ )
+
+ self.activation = config.hidden_act
+ self.act = ACT2FN[config.hidden_act]
+
+ # projection of the input hidden states
+ self.in_proj = nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=config.use_bias)
+ # selective projection used to make dt, B and C input dependant
+ self.x_proj = nn.Linear(self.intermediate_size, self.time_step_rank + self.ssm_state_size * 2, bias=False)
+ # time step projection (discretization)
+ self.dt_proj = nn.Linear(self.time_step_rank, self.intermediate_size, bias=True)
+
+ # S4D real initialization. These are not discretized!
+ # The core is to load them, compute the discrete states, then write the updated state. Keeps the memory bounded
+ A = torch.arange(1, self.ssm_state_size + 1, dtype=torch.float32)[None, :]
+ A = A.expand(self.intermediate_size, -1).contiguous()
+
+ self.A_log = nn.Parameter(torch.log(A))
+ self.D = nn.Parameter(torch.ones(self.intermediate_size))
+ self.out_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.use_bias)
+ self.use_bias = config.use_bias
+
+ if not is_fast_path_available:
+ logger.warning_once(
+ "The fast path is not available because on of `(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)`"
+ " is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation and"
+ " https://github.com/Dao-AILab/causal-conv1d"
+ )
+
+ def cuda_kernels_forward(self, hidden_states: torch.Tensor, cache_params=None):
+ # 1. Gated MLP's linear projection
+ projected_states = self.in_proj(hidden_states).transpose(1, 2)
+
+ if self.training and cache_params is None: # Doesn't support outputting the states -> used for training
+ contextualized_states = mamba_inner_fn(
+ projected_states,
+ self.conv1d.weight,
+ self.conv1d.bias if self.use_conv_bias else None,
+ self.x_proj.weight,
+ self.dt_proj.weight,
+ self.out_proj.weight,
+ self.out_proj.bias.float() if self.use_bias else None,
+ -torch.exp(self.A_log.float()),
+ None, # input-dependent B
+ None, # input-dependent C
+ self.D.float(),
+ delta_bias=self.dt_proj.bias.float(),
+ delta_softplus=True,
+ )
+
+ else:
+ hidden_states, gate = projected_states.chunk(2, dim=1)
+
+ # 2. Convolution sequence transformation
+ conv_weights = self.conv1d.weight.view(self.conv1d.weight.size(0), self.conv1d.weight.size(2))
+ if cache_params is not None and cache_params.seqlen_offset > 0:
+ hidden_states = causal_conv1d_update(
+ hidden_states.squeeze(-1),
+ cache_params.conv_states[self.layer_idx],
+ conv_weights,
+ self.conv1d.bias,
+ self.activation,
+ )
+ hidden_states = hidden_states.unsqueeze(-1)
+ else:
+ if cache_params is not None:
+ conv_states = nn.functional.pad(
+ hidden_states, (self.conv_kernel_size - hidden_states.shape[-1], 0)
+ )
+ cache_params.conv_states[self.layer_idx].copy_(conv_states)
+ hidden_states = causal_conv1d_fn(
+ hidden_states, conv_weights, self.conv1d.bias, activation=self.activation
+ )
+
+ # 3. State Space Model sequence transformation
+ # 3.a. input varying initialization of time_step, B and C
+ ssm_parameters = self.x_proj(hidden_states.transpose(1, 2))
+ time_step, B, C = torch.split(
+ ssm_parameters, [self.time_step_rank, self.ssm_state_size, self.ssm_state_size], dim=-1
+ )
+ discrete_time_step = self.dt_proj.weight @ time_step.transpose(1, 2)
+
+ A = -torch.exp(self.A_log.float())
+ # 3.c perform the recurrence y ← SSM(A, B, C)(x)
+ time_proj_bias = self.dt_proj.bias.float() if hasattr(self.dt_proj, "bias") else None
+ if cache_params is not None and cache_params.seqlen_offset > 0:
+ scan_outputs = selective_state_update(
+ cache_params.ssm_states[self.layer_idx],
+ hidden_states[..., 0],
+ discrete_time_step[..., 0],
+ A,
+ B[:, 0],
+ C[:, 0],
+ self.D,
+ gate[..., 0],
+ time_proj_bias,
+ dt_softplus=True,
+ ).unsqueeze(-1)
+ else:
+ scan_outputs, ssm_state = selective_scan_fn(
+ hidden_states,
+ discrete_time_step,
+ A,
+ B.transpose(1, 2),
+ C.transpose(1, 2),
+ self.D.float(),
+ gate,
+ time_proj_bias,
+ delta_softplus=True,
+ return_last_state=True,
+ )
+ if ssm_state is not None and cache_params is not None:
+ cache_params.ssm_states[self.layer_idx].copy_(ssm_state)
+
+ # 4. Final linear projection
+ contextualized_states = self.out_proj(scan_outputs.transpose(1, 2))
+ return contextualized_states
+
+ # fmt: off
+ def slow_forward(self, input_states, cache_params=None):
+ batch_size, seq_len, _ = input_states.shape
+ dtype = input_states.dtype
+ # 1. Gated MLP's linear projection
+ projected_states = self.in_proj(input_states).transpose(1, 2) # [batch, 2 * intermediate_size, seq_len]
+ hidden_states, gate = projected_states.chunk(2, dim=1)
+
+ # 2. Convolution sequence transformation
+ if cache_params is not None:
+ ssm_state = cache_params.ssm_states[self.layer_idx]
+ if cache_params.seqlen_offset > 0:
+ conv_state = cache_params.conv_states[self.layer_idx] # [batch, intermediate_size, conv_kernel_size]
+ conv_state = torch.roll(conv_state, shifts=-1, dims=-1)
+ conv_state[:, :, -1] = hidden_states[:, :, 0]
+ cache_params.conv_states[self.layer_idx].copy_(conv_state)
+ hidden_states = torch.sum(conv_state * self.conv1d.weight[:, 0, :], dim=-1)
+ if self.use_conv_bias:
+ hidden_states += self.conv1d.bias
+ hidden_states = self.act(hidden_states).to(dtype).unsqueeze(-1) # [batch, intermediate_size, 1] : decoding
+ else:
+ conv_state = nn.functional.pad(
+ hidden_states,
+ (self.conv_kernel_size - hidden_states.shape[-1], 0)
+ )
+ cache_params.conv_states[self.layer_idx].copy_(conv_state)
+ hidden_states = self.act(self.conv1d(hidden_states)[..., :seq_len]) # [batch, intermediate_size, seq_len]
+ else:
+ ssm_state = torch.zeros(
+ (batch_size, self.intermediate_size, self.ssm_state_size),
+ device=hidden_states.device, dtype=dtype
+ )
+ hidden_states = self.act(self.conv1d(hidden_states)[..., :seq_len]) # [batch, intermediate_size, seq_len]
+
+ # 3. State Space Model sequence transformation
+ # 3.a. Selection: [batch, seq_len, self.time_step_rank + self.ssm_state_size * 2]
+ ssm_parameters = self.x_proj(hidden_states.transpose(1, 2))
+ time_step, B, C = torch.split(
+ ssm_parameters, [self.time_step_rank, self.ssm_state_size, self.ssm_state_size], dim=-1
+ )
+ discrete_time_step = self.dt_proj(time_step) # [batch, seq_len, intermediate_size]
+ discrete_time_step = nn.functional.softplus(discrete_time_step).transpose(1, 2) # [batch, intermediate_size, seq_len]
+
+ # 3.b. Discretization: B and C to [batch, seq_len, intermediate_size, ssm_state_size] (SRAM)
+ A = -torch.exp(self.A_log.float()) # [intermediate_size, ssm_state_size]
+ discrete_A = torch.exp(A[None, :, None, :] * discrete_time_step[:, :, :, None]) # [batch, intermediate_size, seq_len, ssm_state_size]
+ discrete_B = discrete_time_step[:, :, :, None] * B[:, None, :, :].float() # [batch, intermediade_size, seq_len, ssm_state_size]
+ deltaB_u = discrete_B * hidden_states[:, :, :, None].float()
+
+ # 3.c perform the recurrence y ← SSM(A, B, C)(x)
+ scan_outputs = []
+ for i in range(seq_len):
+ ssm_state = discrete_A[:, :, i, :] * ssm_state + deltaB_u[:, :, i, :] # [batch, intermediade_size, ssm_state]
+ scan_output = torch.matmul(ssm_state.to(dtype), C[:, i, :].unsqueeze(-1)) # [batch, intermediade_size, 1]
+ scan_outputs.append(scan_output[:, :, 0])
+ scan_output = torch.stack(scan_outputs, dim=-1) # [batch, seq_len, intermediade_size]
+ scan_output = scan_output + (hidden_states * self.D[None, :, None])
+ scan_output = (scan_output * self.act(gate))
+
+ if cache_params is not None:
+ cache_params.ssm_states[self.layer_idx].copy_(ssm_state)
+
+ # 4. Final linear projection
+ contextualized_states = self.out_proj(scan_output.transpose(1, 2)) # [batch, seq_len, hidden_size]
+ return contextualized_states
+ # fmt: on
+
+ def forward(self, hidden_states, cache_params=None):
+ if is_fast_path_available and "cuda" in self.x_proj.weight.device.type:
+ return self.cuda_kernels_forward(hidden_states, cache_params)
+ return self.slow_forward(hidden_states, cache_params)
+
+
+class MambaCache:
+ def __init__(self, config, batch_size, dtype=torch.float16, device=None):
+ self.seqlen_offset = 0
+ self.dtype = dtype
+ intermediate_size = config.intermediate_size
+ ssm_state_size = config.state_size
+ conv_kernel_size = config.conv_kernel
+
+ self.conv_states = {
+ i: torch.zeros(batch_size, intermediate_size, conv_kernel_size, device=device, dtype=dtype)
+ for i in range(config.num_hidden_layers)
+ }
+ self.ssm_states = {
+ i: torch.zeros(batch_size, intermediate_size, ssm_state_size, device=device, dtype=dtype)
+ for i in range(config.num_hidden_layers)
+ }
+
+
+class MambaRMSNorm(nn.Module):
+ def __init__(self, hidden_size, eps=1e-6):
+ """
+ LlamaRMSNorm is equivalent to T5LayerNorm
+ """
+ super().__init__()
+ self.weight = nn.Parameter(torch.ones(hidden_size))
+ self.variance_epsilon = eps
+
+ def forward(self, hidden_states):
+ input_dtype = hidden_states.dtype
+ hidden_states = hidden_states.to(torch.float32)
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+ return self.weight * hidden_states.to(input_dtype)
+
+
+class MambaBlock(nn.Module):
+ def __init__(self, config, layer_idx):
+ super().__init__()
+ self.config = config
+ self.layer_idx = layer_idx
+ self.residual_in_fp32 = config.residual_in_fp32
+ self.norm = MambaRMSNorm(config.hidden_size, eps=config.layer_norm_epsilon)
+ self.mixer = MambaMixer(config, layer_idx=layer_idx)
+
+ def forward(self, hidden_states, cache_params=None):
+ residual = hidden_states
+ hidden_states = self.norm(hidden_states.to(dtype=self.norm.weight.dtype))
+ if self.residual_in_fp32:
+ residual = residual.to(torch.float32)
+
+ hidden_states = self.mixer(hidden_states, cache_params=cache_params)
+ hidden_states = residual + hidden_states
+ return hidden_states
+
+
+class MambaPreTrainedModel(PreTrainedModel):
+ """
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+ models.
+ """
+
+ config_class = MambaConfig
+ base_model_prefix = "backbone"
+ _no_split_modules = ["MambaBlock"]
+ supports_gradient_checkpointing = True
+
+ def _init_weights(self, module):
+ """Initialize the weights."""
+ if isinstance(module, MambaMixer):
+ module.A_log._no_weight_decay = True
+ module.D._no_weight_decay = True
+
+ dt_init_std = self.config.time_step_rank**-0.5 * self.config.time_step_scale
+ if self.config.time_step_init_scheme == "constant":
+ nn.init.constant_(module.dt_proj.weight, dt_init_std)
+ elif self.config.time_step_init_scheme == "random":
+ nn.init.uniform_(module.dt_proj.weight, -dt_init_std, dt_init_std)
+
+ dt = torch.exp(
+ torch.rand(self.config.intermediate_size)
+ * (math.log(self.config.time_step_max) - math.log(self.config.time_step_min))
+ + math.log(self.config.time_step_min)
+ ).clamp(min=self.config.time_step_floor)
+ # # Inverse of softplus: https://github.com/pytorch/pytorch/issues/72759
+ inv_dt = dt + torch.log(-torch.expm1(-dt))
+ with torch.no_grad():
+ module.dt_proj.bias.copy_(inv_dt)
+ module.dt_proj.bias._no_reinit = True
+
+ if isinstance(module, nn.Linear):
+ if module.bias is not None:
+ if not getattr(module.bias, "_no_reinit", False):
+ nn.init.zeros_(module.bias)
+ elif isinstance(module, nn.Embedding):
+ nn.init.normal_(module.weight, std=self.config.initializer_range)
+
+ if self.config.rescale_prenorm_residual:
+ # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
+ # > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
+ # > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
+ # > -- GPT-2 :: https://openai.com/blog/better-language-models/
+ #
+ # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
+ for name, p in module.named_parameters():
+ if name in ["out_proj.weight"]:
+ # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
+ # Following Pytorch init, except scale by 1/sqrt(2 * n_layer)
+ # We need to reinit p since this code could be called multiple times
+ # Having just p *= scale would repeatedly scale it down
+ nn.init.kaiming_uniform_(p, a=math.sqrt(5))
+ with torch.no_grad():
+ p /= math.sqrt(self.config.num_layers)
+
+
+@dataclass
+class MambaOutput(ModelOutput):
+ """
+ Class for the MAMBA model outputs.
+
+ Args:
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+ Sequence of hidden-states at the output of the last layer of the model.
+ cache_params (list of five `torch.FloatTensor` of shape `(batch_size, hidden_size, num_hidden_layers)`):
+ The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
+ avoid providing the old `input_ids`.
+
+ Includes both the State space model states weights after the selective scan, and the Convolutional states
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+ """
+
+ last_hidden_state: torch.FloatTensor = None
+ cache_params: Optional[List[torch.FloatTensor]] = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class MambaCausalLMOutput(ModelOutput):
+ """
+ Base class for causal language model (or autoregressive) outputs.
+
+ Args:
+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+ Language modeling loss (for next-token prediction).
+ logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+ cache_params (list of five `torch.FloatTensor` of shape `(batch_size, hidden_size, num_hidden_layers)`):
+ The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
+ avoid providing the old `input_ids`.
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+ """
+
+ loss: Optional[torch.FloatTensor] = None
+ logits: torch.FloatTensor = None
+ cache_params: Optional[List[torch.FloatTensor]] = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+
+
+MAMBA_START_DOCSTRING = r"""
+
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+ etc.)
+
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+ and behavior.
+
+ Parameters:
+ config ([`MambaConfig`]): Model configuration class with all the parameters of the model.
+ Initializing with a config file does not load the weights associated with the model, only the
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+MAMBA_INPUTS_DOCSTRING = r"""
+ Args:
+ input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`):
+ Indices of input sequence tokens in the vocabulary.
+
+ If `cache_params.seqlen_offset>0`, only `input_ids` that do not have their past calculated should be passed as
+ `input_ids`.
+
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+ [`PreTrainedTokenizer.__call__`] for details.
+
+ [What are input IDs?](../glossary#input-ids)
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+ model's internal embedding lookup matrix.
+ cache_params (`MambaCache`, *optional*):
+ If passed along, the model uses the previous state in all the blocks (which will give the output for the
+ `input_ids` provided as if the model add `state_input_ids + input_ids` as context).
+ use_cache (`bool`, *optional*):
+ If set to `True`, the `cache_params` is returned and can be used to quickly generate the next logits.
+ output_hidden_states (`bool`, *optional*):
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+ more detail.
+ return_dict (`bool`, *optional*):
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+ "The bare MAMBA Model transformer outputting raw hidden-states without any specific head on top.",
+ MAMBA_START_DOCSTRING,
+)
+class MambaModel(MambaPreTrainedModel):
+ def __init__(self, config):
+ super().__init__(config)
+
+ self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+ self.layers = nn.ModuleList([MambaBlock(config, layer_idx=idx) for idx in range(config.num_hidden_layers)])
+
+ self.gradient_checkpointing = False
+ self.norm_f = MambaRMSNorm(config.hidden_size, eps=config.layer_norm_epsilon)
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.embeddings
+
+ def set_input_embeddings(self, new_embeddings):
+ self.embeddings = new_embeddings
+
+ @add_start_docstrings_to_model_forward(MAMBA_INPUTS_DOCSTRING)
+ @add_code_sample_docstrings(
+ checkpoint=_CHECKPOINT_FOR_DOC,
+ output_type=MambaOutput,
+ config_class=_CONFIG_FOR_DOC,
+ )
+ def forward(
+ self,
+ input_ids: Optional[torch.LongTensor] = None,
+ inputs_embeds: Optional[torch.LongTensor] = None,
+ cache_params: Optional[List[torch.FloatTensor]] = None,
+ use_cache: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ **kwargs, # `attention_mask` is passed by the tokenizer and we don't want it
+ ) -> Union[Tuple, MambaOutput]:
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ use_cache = use_cache if use_cache is not None else (self.config.use_cache if not self.training else False)
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ if (input_ids is None) ^ (inputs_embeds is not None): # ^ is python for xor
+ raise ValueError(
+ "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
+ )
+
+ if inputs_embeds is None:
+ inputs_embeds = self.embeddings(input_ids)
+
+ if self.gradient_checkpointing and self.training and use_cache:
+ use_cache = False
+
+ if cache_params is None and use_cache:
+ cache_params = MambaCache(
+ self.config, inputs_embeds.size(0), device=inputs_embeds.device, dtype=inputs_embeds.dtype
+ )
+
+ hidden_states = inputs_embeds
+ all_hidden_states = () if output_hidden_states else None
+ for mixer_block in self.layers:
+ if self.gradient_checkpointing and self.training:
+ hidden_states = self._gradient_checkpointing_func(mixer_block.__call__, hidden_states, cache_params)
+ else:
+ hidden_states = mixer_block(hidden_states, cache_params=cache_params)
+
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (hidden_states,)
+
+ if use_cache:
+ cache_params.seqlen_offset += inputs_embeds.shape[1]
+
+ hidden_states = self.norm_f(hidden_states)
+
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (hidden_states,)
+
+ if not return_dict:
+ return tuple(v for v in [hidden_states, cache_params, all_hidden_states] if v is not None)
+
+ return MambaOutput(
+ last_hidden_state=hidden_states,
+ cache_params=cache_params if use_cache else None,
+ hidden_states=all_hidden_states,
+ )
+
+
+@add_start_docstrings(
+ """
+ The MAMBA Model transformer with a language modeling head on top (linear layer with weights tied to the input
+ embeddings).
+ """,
+ MAMBA_START_DOCSTRING,
+)
+class MambaForCausalLM(MambaPreTrainedModel):
+ _tied_weights_keys = ["lm_head.weight"]
+
+ def __init__(self, config):
+ super().__init__(config)
+ self.backbone = MambaModel(config)
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_output_embeddings(self):
+ return self.lm_head
+
+ def set_output_embeddings(self, new_embeddings):
+ self.lm_head = new_embeddings
+
+ def get_input_embeddings(self):
+ return self.backbone.get_input_embeddings()
+
+ def set_input_embeddings(self, new_embeddings):
+ return self.backbone.set_input_embeddings(new_embeddings)
+
+ def _update_model_kwargs_for_generation(
+ self, outputs: ModelOutput, model_kwargs: Dict[str, Any], **kwargs
+ ) -> Dict[str, Any]:
+ model_kwargs["cache_params"] = outputs["cache_params"]
+ return model_kwargs
+
+ def prepare_inputs_for_generation(
+ self, input_ids, cache_params=None, inputs_embeds=None, attention_mask=None, **kwargs
+ ):
+ # only last token for inputs_ids if the state is passed along.
+ if cache_params is not None:
+ input_ids = input_ids[:, -1].unsqueeze(-1)
+
+ if inputs_embeds is not None and cache_params is None:
+ model_inputs = {"inputs_embeds": inputs_embeds}
+ else:
+ model_inputs = {"input_ids": input_ids}
+
+ model_inputs["cache_params"] = cache_params
+ return model_inputs
+
+ @add_start_docstrings_to_model_forward(MAMBA_INPUTS_DOCSTRING)
+ @add_code_sample_docstrings(
+ checkpoint=_CHECKPOINT_FOR_DOC,
+ output_type=MambaCausalLMOutput,
+ config_class=_CONFIG_FOR_DOC,
+ )
+ def forward(
+ self,
+ input_ids: Optional[torch.LongTensor] = None,
+ inputs_embeds: Optional[torch.FloatTensor] = None,
+ cache_params: Optional[torch.FloatTensor] = None,
+ labels: Optional[torch.LongTensor] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ **kwargs, # for now we need this for generation
+ ) -> Union[Tuple, MambaCausalLMOutput]:
+ r"""
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+ Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+ `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+ are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+ """
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ mamba_outputs = self.backbone(
+ input_ids,
+ cache_params=cache_params,
+ inputs_embeds=inputs_embeds,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+ hidden_states = mamba_outputs[0]
+
+ logits = self.lm_head(hidden_states.to(self.lm_head.weight.dtype)).float()
+
+ loss = None
+ if labels is not None:
+ # move labels to correct device to enable model parallelism
+ labels = labels.to(logits.device)
+ # Shift so that tokens < n predict n
+ shift_logits = logits[..., :-1, :].contiguous()
+ shift_labels = labels[..., 1:].contiguous()
+ # Flatten the tokens
+ loss_fct = CrossEntropyLoss()
+ loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+
+ if not return_dict:
+ output = (logits,) + mamba_outputs[1:]
+ return ((loss,) + output) if loss is not None else output
+
+ return MambaCausalLMOutput(
+ loss=loss,
+ logits=logits,
+ cache_params=mamba_outputs.cache_params,
+ hidden_states=mamba_outputs.hidden_states,
+ )
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index 8f7deb28327abc..f30bf7beddc163 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -5022,6 +5022,30 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
+MAMBA_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class MambaForCausalLM(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class MambaModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class MambaPreTrainedModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
class MarianForCausalLM(metaclass=DummyObject):
_backends = ["torch"]
diff --git a/src/transformers/utils/import_utils.py b/src/transformers/utils/import_utils.py
index 5ebb8396511558..db2278fc5f585c 100644
--- a/src/transformers/utils/import_utils.py
+++ b/src/transformers/utils/import_utils.py
@@ -307,6 +307,27 @@ def is_torch_cuda_available():
return False
+def is_mamba_ssm_available():
+ if is_torch_available():
+ import torch
+
+ if not torch.cuda.is_available():
+ return False
+ else:
+ return _is_package_available("mamba_ssm")
+ return False
+
+
+def is_causal_conv1d_available():
+ if is_torch_available():
+ import torch
+
+ if not torch.cuda.is_available():
+ return False
+ return _is_package_available("causal_conv1d")
+ return False
+
+
def is_torch_mps_available():
if is_torch_available():
import torch
diff --git a/tests/models/mamba/__init__.py b/tests/models/mamba/__init__.py
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/tests/models/mamba/test_modeling_mamba.py b/tests/models/mamba/test_modeling_mamba.py
new file mode 100644
index 00000000000000..503ffa0acd07a7
--- /dev/null
+++ b/tests/models/mamba/test_modeling_mamba.py
@@ -0,0 +1,491 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+import unittest
+from typing import Dict, List, Tuple
+from unittest.util import safe_repr
+
+from parameterized import parameterized
+
+from transformers import AutoTokenizer, MambaConfig, is_torch_available
+from transformers.testing_utils import require_torch, require_torch_multi_gpu, slow, torch_device
+
+from ...generation.test_utils import GenerationTesterMixin
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, ids_tensor
+from ...test_pipeline_mixin import PipelineTesterMixin
+
+
+if is_torch_available():
+ import torch
+
+ from transformers import (
+ MambaForCausalLM,
+ MambaModel,
+ )
+ from transformers.models.mamba.modeling_mamba import MambaCache
+ from transformers.pytorch_utils import is_torch_greater_or_equal_than_2_0
+else:
+ is_torch_greater_or_equal_than_2_0 = False
+
+
+class MambaModelTester:
+ def __init__(
+ self,
+ parent,
+ batch_size=14,
+ seq_length=7,
+ is_training=True,
+ use_labels=True,
+ vocab_size=99,
+ hidden_size=32,
+ num_hidden_layers=2,
+ intermediate_size=32,
+ hidden_act="silu",
+ hidden_dropout_prob=0.1,
+ max_position_embeddings=512,
+ type_vocab_size=16,
+ type_sequence_label_size=2,
+ num_labels=3,
+ num_choices=4,
+ scope=None,
+ tie_word_embeddings=True,
+ ):
+ self.parent = parent
+ self.batch_size = batch_size
+ self.seq_length = seq_length
+ self.is_training = is_training
+ self.use_labels = use_labels
+ self.vocab_size = vocab_size
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.intermediate_size = intermediate_size
+ self.hidden_act = hidden_act
+ self.hidden_dropout_prob = hidden_dropout_prob
+ self.max_position_embeddings = max_position_embeddings
+ self.type_vocab_size = type_vocab_size
+ self.type_sequence_label_size = type_sequence_label_size
+ self.num_labels = num_labels
+ self.num_choices = num_choices
+ self.scope = scope
+ self.bos_token_id = vocab_size - 1
+ self.eos_token_id = vocab_size - 1
+ self.pad_token_id = vocab_size - 1
+ self.tie_word_embeddings = tie_word_embeddings
+
+ def get_large_model_config(self):
+ return MambaConfig.from_pretrained("hf-internal-testing/mamba-2.8b")
+
+ def prepare_config_and_inputs(
+ self, gradient_checkpointing=False, scale_attn_by_inverse_layer_idx=False, reorder_and_upcast_attn=False
+ ):
+ input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+ sequence_labels = None
+ token_labels = None
+ choice_labels = None
+ if self.use_labels:
+ sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+ token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+ choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+ config = self.get_config(
+ gradient_checkpointing=gradient_checkpointing,
+ scale_attn_by_inverse_layer_idx=scale_attn_by_inverse_layer_idx,
+ reorder_and_upcast_attn=reorder_and_upcast_attn,
+ )
+
+ return (
+ config,
+ input_ids,
+ None,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ )
+
+ def get_config(
+ self, gradient_checkpointing=False, scale_attn_by_inverse_layer_idx=False, reorder_and_upcast_attn=False
+ ):
+ return MambaConfig(
+ vocab_size=self.vocab_size,
+ hidden_size=self.hidden_size,
+ num_hidden_layers=self.num_hidden_layers,
+ intermediate_size=self.intermediate_size,
+ activation_function=self.hidden_act,
+ n_positions=self.max_position_embeddings,
+ type_vocab_size=self.type_vocab_size,
+ use_cache=True,
+ bos_token_id=self.bos_token_id,
+ eos_token_id=self.eos_token_id,
+ pad_token_id=self.pad_token_id,
+ gradient_checkpointing=gradient_checkpointing,
+ tie_word_embeddings=self.tie_word_embeddings,
+ )
+
+ def get_pipeline_config(self):
+ config = self.get_config()
+ config.vocab_size = 300
+ return config
+
+ def prepare_config_and_inputs_for_decoder(self):
+ (
+ config,
+ input_ids,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ ) = self.prepare_config_and_inputs()
+
+ return (
+ config,
+ input_ids,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ )
+
+ def create_and_check_mamba_model(self, config, input_ids, *args):
+ config.output_hidden_states = True
+ model = MambaModel(config=config)
+ model.to(torch_device)
+ model.eval()
+
+ result = model(input_ids)
+
+ self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
+ self.parent.assertEqual(len(result.hidden_states), config.num_hidden_layers + 1)
+
+ def create_and_check_causl_lm(self, config, input_ids, *args):
+ model = MambaForCausalLM(config)
+ model.to(torch_device)
+ model.eval()
+
+ result = model(input_ids, labels=input_ids)
+ self.parent.assertEqual(result.loss.shape, ())
+ self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
+
+ def create_and_check_state_equivalency(self, config, input_ids, *args):
+ model = MambaModel(config=config)
+ model.to(torch_device)
+ model.eval()
+
+ outputs = model(input_ids)
+ output_whole = outputs.last_hidden_state
+
+ outputs = model(input_ids[:, :-1], use_cache=True)
+ output_one = outputs.last_hidden_state
+
+ # Using the state computed on the first inputs, we will get the same output
+ outputs = model(input_ids[:, -1:], cache_params=outputs.cache_params)
+ output_two = outputs.last_hidden_state
+
+ self.parent.assertTrue(torch.allclose(torch.cat([output_one, output_two], dim=1), output_whole, atol=1e-5))
+ # TODO the orignal mamba does not support decoding more than 1 token neither do we
+
+ def create_and_check_forward_and_backwards(self, config, input_ids, *args, gradient_checkpointing=False):
+ model = MambaForCausalLM(config)
+ model.to(torch_device)
+ if gradient_checkpointing:
+ model.gradient_checkpointing_enable()
+
+ result = model(input_ids, labels=input_ids)
+ self.parent.assertEqual(result.loss.shape, ())
+ self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
+ result.loss.backward()
+
+ def prepare_config_and_inputs_for_common(self):
+ (
+ config,
+ input_ids,
+ _,
+ sequence_labels,
+ token_labels,
+ choice_labels,
+ ) = self.prepare_config_and_inputs()
+ inputs_dict = {"input_ids": input_ids}
+ return config, inputs_dict
+
+
+@unittest.skipIf(
+ not is_torch_greater_or_equal_than_2_0, reason="See https://github.com/huggingface/transformers/pull/24204"
+)
+@require_torch
+class MambaModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ all_model_classes = (MambaModel, MambaForCausalLM) if is_torch_available() else ()
+ fx_compatible = False # FIXME let's try to support this @ArthurZucker
+ test_torchscript = False # FIXME let's try to support this @ArthurZucker
+ test_missing_keys = False
+ test_model_parallel = False
+ test_pruning = False
+ test_head_masking = False # Mamba does not have attention heads
+ test_model_parallel = False
+ pipeline_model_mapping = (
+ {"feature-extraction": MambaModel, "text-generation": MambaForCausalLM} if is_torch_available() else {}
+ )
+
+ def setUp(self):
+ self.model_tester = MambaModelTester(self)
+ self.config_tester = ConfigTester(
+ self, config_class=MambaConfig, n_embd=37, common_properties=["hidden_size", "num_hidden_layers"]
+ )
+
+ def assertInterval(self, member, container, msg=None):
+ r"""
+ Simple utility function to check if a member is inside an interval.
+ """
+ if isinstance(member, torch.Tensor):
+ max_value, min_value = member.max().item(), member.min().item()
+ elif isinstance(member, list) or isinstance(member, tuple):
+ max_value, min_value = max(member), min(member)
+
+ if not isinstance(container, list):
+ raise TypeError("container should be a list or tuple")
+ elif len(container) != 2:
+ raise ValueError("container should have 2 elements")
+
+ expected_min, expected_max = container
+
+ is_inside_interval = (min_value >= expected_min) and (max_value <= expected_max)
+
+ if not is_inside_interval:
+ standardMsg = "%s not found in %s" % (safe_repr(member), safe_repr(container))
+ self.fail(self._formatMessage(msg, standardMsg))
+
+ def test_config(self):
+ self.config_tester.run_common_tests()
+
+ @unittest.skip("No attention in mamba")
+ def test_retain_grad_hidden_states_attentions(self):
+ pass
+
+ @require_torch_multi_gpu
+ def test_multi_gpu_data_parallel_forward(self):
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+ # some params shouldn't be scattered by nn.DataParallel
+ # so just remove them if they are present.
+ blacklist_non_batched_params = ["cache_params"]
+ for k in blacklist_non_batched_params:
+ inputs_dict.pop(k, None)
+
+ # move input tensors to cuda:O
+ for k, v in inputs_dict.items():
+ if torch.is_tensor(v):
+ inputs_dict[k] = v.to(0)
+
+ for model_class in self.all_model_classes:
+ model = model_class(config=config)
+ model.to(0)
+ model.eval()
+
+ # Wrap model in nn.DataParallel
+ model = torch.nn.DataParallel(model)
+ with torch.no_grad():
+ _ = model(**self._prepare_for_class(inputs_dict, model_class))
+
+ def test_mamba_model(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_mamba_model(*config_and_inputs)
+
+ def test_mamba_lm_head_model(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_causl_lm(*config_and_inputs)
+
+ def test_state_equivalency(self):
+ config_and_inputs = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_state_equivalency(*config_and_inputs)
+
+ def test_initialization(self):
+ config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+ for model_class in self.all_model_classes:
+ model = model_class(config=config)
+ for name, param in model.named_parameters():
+ if "dt_proj.bias" in name:
+ dt = torch.exp(
+ torch.tensor([0, 1]) * (math.log(config.time_step_max) - math.log(config.time_step_min))
+ + math.log(config.time_step_min)
+ ).clamp(min=config.time_step_floor)
+ inv_dt = dt + torch.log(-torch.expm1(-dt))
+ if param.requires_grad:
+ self.assertTrue(param.data.max().item() <= inv_dt[1])
+ self.assertTrue(param.data.min().item() >= inv_dt[0])
+ elif "A_log" in name:
+ A = torch.arange(1, config.state_size + 1, dtype=torch.float32)[None, :]
+ self.assertTrue(torch.allclose(param.data, torch.log(A), atol=1e-5, rtol=1e-5))
+ elif "D" in name:
+ if param.requires_grad:
+ # check if it's a ones like
+ self.assertTrue(torch.allclose(param.data, torch.ones_like(param.data), atol=1e-5, rtol=1e-5))
+
+ @unittest.skip("Mamba does not use attention")
+ def test_attention_outputs(self):
+ r"""
+ Overriding the test_attention_outputs test as the attention outputs of Mamba are different from other models
+ it has a shape `batch_size, seq_len, hidden_size`.
+ """
+ pass
+
+ @slow
+ def test_model_from_pretrained(self):
+ model = MambaModel.from_pretrained("hf-internal-testing/mamba-130m")
+ self.assertIsNotNone(model)
+
+ def test_model_outputs_equivalence(self):
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+ def check_equivalence(model, tuple_inputs, dict_inputs, additional_kwargs={}):
+ with torch.no_grad():
+ tuple_output = model(**tuple_inputs, return_dict=False, **additional_kwargs)
+ dict_output = model(**dict_inputs, return_dict=True, **additional_kwargs).to_tuple()
+
+ def recursive_check(tuple_object, dict_object):
+ if isinstance(tuple_object, MambaCache): # MODIFIED PART START
+ recursive_check(tuple_object.conv_states, dict_object.conv_states)
+ recursive_check(tuple_object.ssm_states, dict_object.ssm_states)
+ elif isinstance(tuple_object, (List, Tuple)): # MODIFIED PART END
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif isinstance(tuple_object, Dict):
+ for tuple_iterable_value, dict_iterable_value in zip(
+ tuple_object.values(), dict_object.values()
+ ):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif tuple_object is None:
+ return
+ else:
+ self.assertTrue(
+ torch.allclose(tuple_object, dict_object, atol=1e-5),
+ msg=(
+ "Tuple and dict output are not equal. Difference:"
+ f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:"
+ f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has"
+ f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}."
+ ),
+ )
+
+ recursive_check(tuple_output, dict_output)
+
+ for model_class in self.all_model_classes:
+ model = model_class(config)
+ model.to(torch_device)
+ model.eval()
+
+ tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
+ dict_inputs = self._prepare_for_class(inputs_dict, model_class)
+ check_equivalence(model, tuple_inputs, dict_inputs)
+
+ tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+ dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+ check_equivalence(model, tuple_inputs, dict_inputs)
+
+ tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
+ dict_inputs = self._prepare_for_class(inputs_dict, model_class)
+ check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
+
+ tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+ dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+ check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
+
+
+@require_torch
+class MambaIntegrationTests(unittest.TestCase):
+ def setUp(self):
+ self.model_id = "ArthurZ/mamba-2.8b"
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
+
+ @parameterized.expand([(torch_device,), ("cpu",)])
+ def test_simple_generate(self, device):
+ tokenizer = AutoTokenizer.from_pretrained("ArthurZ/mamba-130m")
+ tokenizer.pad_token = tokenizer.eos_token
+
+ model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-130m", torch_dtype=torch.float16)
+ model.to(device)
+ model.config.use_cache = True
+ input_ids = tokenizer("Hey how are you doing?", return_tensors="pt")["input_ids"].to(device)
+
+ out = model.generate(input_ids, do_sample=False, max_new_tokens=10)
+ output_sentence = tokenizer.decode(out[0, :])
+ self.assertEqual(output_sentence, "Hey how are you doing?\n\nI'm so glad you're here.")
+
+ with torch.no_grad():
+ logits = model(input_ids=input_ids).logits
+
+ EXPECTED_LOGITS_NO_GRAD = torch.tensor(
+ [
+ -55.6875, -69.8750, -49.9062, -51.7500, -57.6875, -57.9375, -56.9688,
+ -57.9375, -54.6875, -55.9375, -55.3125, -58.0938, -60.5625, -47.0000,
+ -52.0312, -49.7812, -55.9375, -57.9062, -56.7812, -57.1250, -57.3438,
+ -58.3125, -57.8125, -58.7812, -59.6250, -59.0938, -58.7188, -52.9375,
+ -53.4688, -57.3750, -56.9375, -55.7500, -53.3125, -55.8438, -57.0000,
+ -56.9062, -56.2188, -54.7188, -56.4375, -57.5000
+ ]
+ ,dtype=torch.float32) # fmt: skip
+
+ torch.testing.assert_close(logits[0, 0, :40].cpu(), EXPECTED_LOGITS_NO_GRAD, rtol=1e-3, atol=1e-3)
+
+ @parameterized.expand([(torch_device,), ("cpu",)])
+ def test_simple_generate_cuda_kernels_tiny(self, device):
+ expected_output = "Hello my name is John and I am a newbie to the world"
+
+ input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
+ model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-130m", torch_dtype=torch.float16).to(device)
+
+ output = model.generate(input_ids, max_new_tokens=10)
+ output_sentence = self.tokenizer.decode(output[0].tolist())
+
+ self.assertEqual(output_sentence, expected_output)
+
+ @parameterized.expand([(torch_device,), ("cpu",)])
+ @slow
+ def test_simple_generate_cuda_kernels_small(self, device):
+ expected_output = "Hello my name is\n\nI am a\n\nI am a"
+
+ input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
+ model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-790m", torch_dtype=torch.float16).to(device)
+
+ output = model.generate(input_ids, max_new_tokens=10)
+ output_sentence = self.tokenizer.decode(output[0].tolist())
+
+ self.assertEqual(output_sentence, expected_output)
+
+ @parameterized.expand([(torch_device,), ("cpu",)])
+ @slow
+ def test_simple_generate_cuda_kernels_mid(self, device):
+ expected_output = "Hello my name is John and I am a\n\nI am a single father of a beautiful daughter. I am a"
+
+ input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
+ model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-1.4b", torch_dtype=torch.float16).to(device)
+
+ output = model.generate(input_ids, max_new_tokens=20)
+ output_sentence = self.tokenizer.decode(output[0].tolist())
+
+ self.assertEqual(output_sentence, expected_output)
+
+ @parameterized.expand([(torch_device,), ("cpu",)])
+ @slow
+ def test_simple_generate_cuda_kernels_big(self, device):
+ expected_output = "Hello my name is John and I am a new member of this forum. I am a retired Marine and I am a member of the Marine Corps League. I am a"
+
+ input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
+ model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-2.8b", torch_dtype=torch.float16).to(device)
+
+ output = model.generate(input_ids, max_new_tokens=30)
+ output_sentence = self.tokenizer.decode(output[0].tolist())
+
+ self.assertEqual(output_sentence, expected_output)
diff --git a/utils/check_config_attributes.py b/utils/check_config_attributes.py
index fae3ed8da0b4ef..140cd560e03758 100644
--- a/utils/check_config_attributes.py
+++ b/utils/check_config_attributes.py
@@ -34,6 +34,8 @@
SPECIAL_CASES_TO_ALLOW = {
# used to compute the property `self.chunk_length`
"EncodecConfig": ["overlap"],
+ # used as in the config to define `intermediate_size`
+ "MambaConfig": ["expand"],
# used as `self.bert_model = BertModel(config, ...)`
"DPRConfig": True,
"FuyuConfig": True,
From 8f3f8e6766be33a15031a8844a9223d1cb04e820 Mon Sep 17 00:00:00 2001
From: AleksanderWWW
Date: Tue, 5 Mar 2024 12:54:00 +0100
Subject: [PATCH 089/549] Fix bug with passing capture_* args to neptune
callback (#29041)
* Fix bug with passing capture_* args to neptune callback
* ruff happy?
* instantiate (frozen)set only once
* code review
* code review 2
* ruff happy?
* code review
---
src/transformers/integrations/integration_utils.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/src/transformers/integrations/integration_utils.py b/src/transformers/integrations/integration_utils.py
index 05c864fb4be3d8..65642039da7395 100644
--- a/src/transformers/integrations/integration_utils.py
+++ b/src/transformers/integrations/integration_utils.py
@@ -1265,7 +1265,9 @@ def _initialize_run(self, **additional_neptune_kwargs):
self._stop_run_if_exists()
try:
- self._run = init_run(**self._init_run_kwargs, **additional_neptune_kwargs)
+ run_params = additional_neptune_kwargs.copy()
+ run_params.update(self._init_run_kwargs)
+ self._run = init_run(**run_params)
self._run_id = self._run["sys/id"].fetch()
except (NeptuneMissingProjectNameException, NeptuneMissingApiTokenException) as e:
raise NeptuneMissingConfiguration() from e
From 9c5e560924c4868033cc9ac650bee80dc4049158 Mon Sep 17 00:00:00 2001
From: Logan Adams <114770087+loadams@users.noreply.github.com>
Date: Tue, 5 Mar 2024 04:23:34 -0800
Subject: [PATCH 090/549] Update pytest `import_path` location (#29154)
* Update to pull function from proper lib
* Fix ruff formatting error
* Remove accidently added file
---
src/transformers/testing_utils.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py
index fd5974d8d5bb54..adcadfc379251e 100644
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -137,9 +137,9 @@
_is_mocked,
_patch_unwrap_mock_aware,
get_optionflags,
- import_path,
)
from _pytest.outcomes import skip
+ from _pytest.pathlib import import_path
from pytest import DoctestItem
else:
Module = object
From a69cbf4e64c7bc054d814d64f6877180f7cd3a25 Mon Sep 17 00:00:00 2001
From: Lysandre Debut
Date: Tue, 5 Mar 2024 13:37:55 +0100
Subject: [PATCH 091/549] Automatic safetensors conversion when lacking these
files (#29390)
* Automatic safetensors conversion when lacking these files
* Remove debug
* Thread name
* Typo
* Ensure that raises do not affect the main thread
---
src/transformers/modeling_utils.py | 37 +++++++++++++++++++++--
tests/test_modeling_utils.py | 48 +++++++++++++++++++++++++++++-
2 files changed, 81 insertions(+), 4 deletions(-)
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 7bda8a20165b5e..b542307794168d 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -29,6 +29,7 @@
from contextlib import contextmanager
from dataclasses import dataclass
from functools import partial, wraps
+from threading import Thread
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
from zipfile import is_zipfile
@@ -3207,9 +3208,39 @@ def from_pretrained(
)
if resolved_archive_file is not None:
is_sharded = True
- if resolved_archive_file is None:
- # Otherwise, maybe there is a TF or Flax model file. We try those to give a helpful error
- # message.
+
+ if resolved_archive_file is not None:
+ if filename in [WEIGHTS_NAME, WEIGHTS_INDEX_NAME]:
+ # If the PyTorch file was found, check if there is a safetensors file on the repository
+ # If there is no safetensors file on the repositories, start an auto conversion
+ safe_weights_name = SAFE_WEIGHTS_INDEX_NAME if is_sharded else SAFE_WEIGHTS_NAME
+ has_file_kwargs = {
+ "revision": revision,
+ "proxies": proxies,
+ "token": token,
+ }
+ cached_file_kwargs = {
+ "cache_dir": cache_dir,
+ "force_download": force_download,
+ "resume_download": resume_download,
+ "local_files_only": local_files_only,
+ "user_agent": user_agent,
+ "subfolder": subfolder,
+ "_raise_exceptions_for_gated_repo": False,
+ "_raise_exceptions_for_missing_entries": False,
+ "_commit_hash": commit_hash,
+ **has_file_kwargs,
+ }
+ if not has_file(pretrained_model_name_or_path, safe_weights_name, **has_file_kwargs):
+ Thread(
+ target=auto_conversion,
+ args=(pretrained_model_name_or_path,),
+ kwargs=cached_file_kwargs,
+ name="Thread-autoconversion",
+ ).start()
+ else:
+ # Otherwise, no PyTorch file was found, maybe there is a TF or Flax model file.
+ # We try those to give a helpful error message.
has_file_kwargs = {
"revision": revision,
"proxies": proxies,
diff --git a/tests/test_modeling_utils.py b/tests/test_modeling_utils.py
index 0d52e5a87bed35..a334cb0f2853b5 100755
--- a/tests/test_modeling_utils.py
+++ b/tests/test_modeling_utils.py
@@ -20,6 +20,7 @@
import os.path
import sys
import tempfile
+import threading
import unittest
import unittest.mock as mock
import uuid
@@ -1428,7 +1429,7 @@ def test_safetensors_on_the_fly_wrong_user_opened_pr(self):
bot_opened_pr_title = None
for discussion in discussions:
- if discussion.author == "SFconvertBot":
+ if discussion.author == "SFconvertbot":
bot_opened_pr = True
bot_opened_pr_title = discussion.title
@@ -1451,6 +1452,51 @@ def test_safetensors_on_the_fly_specific_revision(self):
with self.assertRaises(EnvironmentError):
BertModel.from_pretrained(self.repo_name, use_safetensors=True, token=self.token, revision="new-branch")
+ def test_absence_of_safetensors_triggers_conversion(self):
+ config = BertConfig(
+ vocab_size=99, hidden_size=32, num_hidden_layers=5, num_attention_heads=4, intermediate_size=37
+ )
+ initial_model = BertModel(config)
+
+ # Push a model on `main`
+ initial_model.push_to_hub(self.repo_name, token=self.token, safe_serialization=False)
+
+ # Download the model that doesn't have safetensors
+ BertModel.from_pretrained(self.repo_name, token=self.token)
+
+ for thread in threading.enumerate():
+ if thread.name == "Thread-autoconversion":
+ thread.join(timeout=10)
+
+ with self.subTest("PR was open with the safetensors account"):
+ discussions = self.api.get_repo_discussions(self.repo_name)
+
+ bot_opened_pr = None
+ bot_opened_pr_title = None
+
+ for discussion in discussions:
+ if discussion.author == "SFconvertbot":
+ bot_opened_pr = True
+ bot_opened_pr_title = discussion.title
+
+ self.assertTrue(bot_opened_pr)
+ self.assertEqual(bot_opened_pr_title, "Adding `safetensors` variant of this model")
+
+ @mock.patch("transformers.safetensors_conversion.spawn_conversion")
+ def test_absence_of_safetensors_triggers_conversion_failed(self, spawn_conversion_mock):
+ spawn_conversion_mock.side_effect = HTTPError()
+
+ config = BertConfig(
+ vocab_size=99, hidden_size=32, num_hidden_layers=5, num_attention_heads=4, intermediate_size=37
+ )
+ initial_model = BertModel(config)
+
+ # Push a model on `main`
+ initial_model.push_to_hub(self.repo_name, token=self.token, safe_serialization=False)
+
+ # The auto conversion is mocked to always raise; ensure that it doesn't raise in the main thread
+ BertModel.from_pretrained(self.repo_name, token=self.token)
+
@require_torch
@is_staging_test
From 638c423c89a7996dd5508f228ac2943a743673de Mon Sep 17 00:00:00 2001
From: Michael
Date: Wed, 6 Mar 2024 01:19:00 +0800
Subject: [PATCH 092/549] [i18n-zh] Translate add_new_pipeline.md into Chinese
(#29432)
* [i18n-zh] Translate add_new_pipeline.md into Chinese
* apply suggestions from Fan-Lin
---
docs/source/zh/_toctree.yml | 2 +
docs/source/zh/add_new_pipeline.md | 238 +++++++++++++++++++++++++++++
2 files changed, 240 insertions(+)
create mode 100644 docs/source/zh/add_new_pipeline.md
diff --git a/docs/source/zh/_toctree.yml b/docs/source/zh/_toctree.yml
index f81f264655ea0d..517033cad562a2 100644
--- a/docs/source/zh/_toctree.yml
+++ b/docs/source/zh/_toctree.yml
@@ -74,6 +74,8 @@
- sections:
- local: contributing
title: 如何为 🤗 Transformers 做贡献?
+ - local: add_new_pipeline
+ title: 如何将流水线添加到 🤗 Transformers?
title: 贡献
- sections:
- local: task_summary
diff --git a/docs/source/zh/add_new_pipeline.md b/docs/source/zh/add_new_pipeline.md
new file mode 100644
index 00000000000000..57fd53636b0a13
--- /dev/null
+++ b/docs/source/zh/add_new_pipeline.md
@@ -0,0 +1,238 @@
+
+
+# 如何创建自定义流水线?
+
+在本指南中,我们将演示如何创建一个自定义流水线并分享到 [Hub](https://hf.co/models),或将其添加到 🤗 Transformers 库中。
+
+首先,你需要决定流水线将能够接受的原始条目。它可以是字符串、原始字节、字典或任何看起来最可能是期望的输入。
+尽量保持输入为纯 Python 语言,因为这样可以更容易地实现兼容性(甚至通过 JSON 在其他语言之间)。
+这些将是流水线 (`preprocess`) 的 `inputs`。
+
+然后定义 `outputs`。与 `inputs` 相同的策略。越简单越好。这些将是 `postprocess` 方法的输出。
+
+首先继承基类 `Pipeline`,其中包含实现 `preprocess`、`_forward`、`postprocess` 和 `_sanitize_parameters` 所需的 4 个方法。
+
+```python
+from transformers import Pipeline
+
+
+class MyPipeline(Pipeline):
+ def _sanitize_parameters(self, **kwargs):
+ preprocess_kwargs = {}
+ if "maybe_arg" in kwargs:
+ preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
+ return preprocess_kwargs, {}, {}
+
+ def preprocess(self, inputs, maybe_arg=2):
+ model_input = Tensor(inputs["input_ids"])
+ return {"model_input": model_input}
+
+ def _forward(self, model_inputs):
+ # model_inputs == {"model_input": model_input}
+ outputs = self.model(**model_inputs)
+ # Maybe {"logits": Tensor(...)}
+ return outputs
+
+ def postprocess(self, model_outputs):
+ best_class = model_outputs["logits"].softmax(-1)
+ return best_class
+```
+
+这种分解的结构旨在为 CPU/GPU 提供相对无缝的支持,同时支持在不同线程上对 CPU 进行预处理/后处理。
+
+`preprocess` 将接受最初定义的输入,并将其转换为可供模型输入的内容。它可能包含更多信息,通常是一个 `Dict`。
+
+`_forward` 是实现细节,不应直接调用。`forward` 是首选的调用方法,因为它包含保障措施,以确保一切都在预期的设备上运作。
+如果任何内容与实际模型相关,它应该属于 `_forward` 方法,其他内容应该在 preprocess/postprocess 中。
+
+`postprocess` 方法将接受 `_forward` 的输出,并将其转换为之前确定的最终输出。
+
+`_sanitize_parameters` 存在是为了允许用户在任何时候传递任何参数,无论是在初始化时 `pipeline(...., maybe_arg=4)`
+还是在调用时 `pipe = pipeline(...); output = pipe(...., maybe_arg=4)`。
+
+`_sanitize_parameters` 的返回值是将直接传递给 `preprocess`、`_forward` 和 `postprocess` 的 3 个关键字参数字典。
+如果调用方没有使用任何额外参数调用,则不要填写任何内容。这样可以保留函数定义中的默认参数,这总是更"自然"的。
+
+在分类任务中,一个经典的例子是在后处理中使用 `top_k` 参数。
+
+```python
+>>> pipe = pipeline("my-new-task")
+>>> pipe("This is a test")
+[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
+{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]
+
+>>> pipe("This is a test", top_k=2)
+[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]
+```
+
+为了实现这一点,我们将更新我们的 `postprocess` 方法,将默认参数设置为 `5`,
+并编辑 `_sanitize_parameters` 方法,以允许这个新参数。
+
+```python
+def postprocess(self, model_outputs, top_k=5):
+ best_class = model_outputs["logits"].softmax(-1)
+ # Add logic to handle top_k
+ return best_class
+
+
+def _sanitize_parameters(self, **kwargs):
+ preprocess_kwargs = {}
+ if "maybe_arg" in kwargs:
+ preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
+
+ postprocess_kwargs = {}
+ if "top_k" in kwargs:
+ postprocess_kwargs["top_k"] = kwargs["top_k"]
+ return preprocess_kwargs, {}, postprocess_kwargs
+```
+
+尽量保持简单输入/输出,最好是可 JSON 序列化的,因为这样可以使流水线的使用非常简单,而不需要用户了解新的对象类型。
+通常也相对常见地支持许多不同类型的参数以便使用(例如音频文件,可以是文件名、URL 或纯字节)。
+
+## 将其添加到支持的任务列表中
+
+要将你的 `new-task` 注册到支持的任务列表中,你需要将其添加到 `PIPELINE_REGISTRY` 中:
+
+```python
+from transformers.pipelines import PIPELINE_REGISTRY
+
+PIPELINE_REGISTRY.register_pipeline(
+ "new-task",
+ pipeline_class=MyPipeline,
+ pt_model=AutoModelForSequenceClassification,
+)
+```
+
+如果需要,你可以指定一个默认模型,此时它应该带有一个特定的修订版本(可以是分支名称或提交哈希,这里我们使用了 `"abcdef"`),以及类型:
+
+```python
+PIPELINE_REGISTRY.register_pipeline(
+ "new-task",
+ pipeline_class=MyPipeline,
+ pt_model=AutoModelForSequenceClassification,
+ default={"pt": ("user/awesome_model", "abcdef")},
+ type="text", # current support type: text, audio, image, multimodal
+)
+```
+
+## 在 Hub 上分享你的流水线
+
+要在 Hub 上分享你的自定义流水线,你只需要将 `Pipeline` 子类的自定义代码保存在一个 Python 文件中。
+例如,假设我们想使用一个自定义流水线进行句对分类,如下所示:
+
+```py
+import numpy as np
+
+from transformers import Pipeline
+
+
+def softmax(outputs):
+ maxes = np.max(outputs, axis=-1, keepdims=True)
+ shifted_exp = np.exp(outputs - maxes)
+ return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True)
+
+
+class PairClassificationPipeline(Pipeline):
+ def _sanitize_parameters(self, **kwargs):
+ preprocess_kwargs = {}
+ if "second_text" in kwargs:
+ preprocess_kwargs["second_text"] = kwargs["second_text"]
+ return preprocess_kwargs, {}, {}
+
+ def preprocess(self, text, second_text=None):
+ return self.tokenizer(text, text_pair=second_text, return_tensors=self.framework)
+
+ def _forward(self, model_inputs):
+ return self.model(**model_inputs)
+
+ def postprocess(self, model_outputs):
+ logits = model_outputs.logits[0].numpy()
+ probabilities = softmax(logits)
+
+ best_class = np.argmax(probabilities)
+ label = self.model.config.id2label[best_class]
+ score = probabilities[best_class].item()
+ logits = logits.tolist()
+ return {"label": label, "score": score, "logits": logits}
+```
+
+这个实现与框架无关,适用于 PyTorch 和 TensorFlow 模型。如果我们将其保存在一个名为
+`pair_classification.py` 的文件中,然后我们可以像这样导入并注册它:
+
+```py
+from pair_classification import PairClassificationPipeline
+from transformers.pipelines import PIPELINE_REGISTRY
+from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
+
+PIPELINE_REGISTRY.register_pipeline(
+ "pair-classification",
+ pipeline_class=PairClassificationPipeline,
+ pt_model=AutoModelForSequenceClassification,
+ tf_model=TFAutoModelForSequenceClassification,
+)
+```
+
+完成这些步骤后,我们可以将其与预训练模型一起使用。例如,`sgugger/finetuned-bert-mrpc`
+已经在 MRPC 数据集上进行了微调,用于将句子对分类为是释义或不是释义。
+
+```py
+from transformers import pipeline
+
+classifier = pipeline("pair-classification", model="sgugger/finetuned-bert-mrpc")
+```
+
+然后,我们可以通过在 `Repository` 中使用 `save_pretrained` 方法将其分享到 Hub 上:
+
+```py
+from huggingface_hub import Repository
+
+repo = Repository("test-dynamic-pipeline", clone_from="{your_username}/test-dynamic-pipeline")
+classifier.save_pretrained("test-dynamic-pipeline")
+repo.push_to_hub()
+```
+
+这将会复制包含你定义的 `PairClassificationPipeline` 的文件到文件夹 `"test-dynamic-pipeline"` 中,
+同时保存流水线的模型和分词器,然后将所有内容推送到仓库 `{your_username}/test-dynamic-pipeline` 中。
+之后,只要提供选项 `trust_remote_code=True`,任何人都可以使用它:
+
+```py
+from transformers import pipeline
+
+classifier = pipeline(model="{your_username}/test-dynamic-pipeline", trust_remote_code=True)
+```
+
+## 将流水线添加到 🤗 Transformers
+
+如果你想将你的流水线贡献给 🤗 Transformers,你需要在 `pipelines` 子模块中添加一个新模块,
+其中包含你的流水线的代码,然后将其添加到 `pipelines/__init__.py` 中定义的任务列表中。
+
+然后,你需要添加测试。创建一个新文件 `tests/test_pipelines_MY_PIPELINE.py`,其中包含其他测试的示例。
+
+`run_pipeline_test` 函数将非常通用,并在每种可能的架构上运行小型随机模型,如 `model_mapping` 和 `tf_model_mapping` 所定义。
+
+这对于测试未来的兼容性非常重要,这意味着如果有人为 `XXXForQuestionAnswering` 添加了一个新模型,
+流水线测试将尝试在其上运行。由于模型是随机的,所以不可能检查实际值,这就是为什么有一个帮助函数 `ANY`,它只是尝试匹配流水线的输出类型。
+
+你还 **需要** 实现 2(最好是 4)个测试。
+
+- `test_small_model_pt`:为这个流水线定义一个小型模型(结果是否合理并不重要),并测试流水线的输出。
+ 结果应该与 `test_small_model_tf` 的结果相同。
+- `test_small_model_tf`:为这个流水线定义一个小型模型(结果是否合理并不重要),并测试流水线的输出。
+ 结果应该与 `test_small_model_pt` 的结果相同。
+- `test_large_model_pt`(可选):在一个真实的流水线上测试流水线,结果应该是有意义的。
+ 这些测试速度较慢,应该被如此标记。这里的目标是展示流水线,并确保在未来的发布中没有漂移。
+- `test_large_model_tf`(可选):在一个真实的流水线上测试流水线,结果应该是有意义的。
+ 这些测试速度较慢,应该被如此标记。这里的目标是展示流水线,并确保在未来的发布中没有漂移。
From 7b01579f73a216ddfdbcbe9c5b5c2b1f4dc4d10f Mon Sep 17 00:00:00 2001
From: AI4Harmony <160417616+AI4Harmony@users.noreply.github.com>
Date: Wed, 6 Mar 2024 08:47:33 +0900
Subject: [PATCH 093/549] =?UTF-8?q?=F0=9F=8C=90=20[i18n-KO]=20Translated?=
=?UTF-8?q?=20generation=5Fstrategies.md=20to=20Korean=20(#29086)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* Update ko _toctree.yml
* Create ko: generation_strategies.md
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Jungnerd <46880056+jungnerd@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Jungnerd <46880056+jungnerd@users.noreply.github.com>
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Jungnerd <46880056+jungnerd@users.noreply.github.com>
---
docs/source/ko/_toctree.yml | 4 +-
docs/source/ko/generation_strategies.md | 337 ++++++++++++++++++++++++
2 files changed, 339 insertions(+), 2 deletions(-)
create mode 100644 docs/source/ko/generation_strategies.md
diff --git a/docs/source/ko/_toctree.yml b/docs/source/ko/_toctree.yml
index f7a5f640107526..e955fae4ea9c3f 100644
--- a/docs/source/ko/_toctree.yml
+++ b/docs/source/ko/_toctree.yml
@@ -87,8 +87,8 @@
title: 🤗 Tokenizers 라이브러리에서 토크나이저 사용하기
- local: multilingual
title: 다국어 모델 추론하기
- - local: in_translation
- title: (번역중) Customize text generation strategy
+ - local: generation_strategies
+ title: 텍스트 생성 전략 사용자 정의
- local: create_a_model
title: 모델별 API 사용하기
- local: custom_models
diff --git a/docs/source/ko/generation_strategies.md b/docs/source/ko/generation_strategies.md
new file mode 100644
index 00000000000000..fd7b9bf905aa0a
--- /dev/null
+++ b/docs/source/ko/generation_strategies.md
@@ -0,0 +1,337 @@
+
+
+# Text generation strategies[[text-generation-strategies]]
+
+텍스트 생성은 개방형 텍스트 작성, 요약, 번역 등 다양한 자연어 처리(NLP) 작업에 필수적입니다. 이는 또한 음성-텍스트 변환, 시각-텍스트 변환과 같이 텍스트를 출력으로 하는 여러 혼합 모달리티 응용 프로그램에서도 중요한 역할을 합니다. 텍스트 생성을 가능하게 하는 몇몇 모델로는 GPT2, XLNet, OpenAI GPT, CTRL, TransformerXL, XLM, Bart, T5, GIT, Whisper 등이 있습니다.
+
+
+[`~transformers.generation_utils.GenerationMixin.generate`] 메서드를 활용하여 다음과 같은 다양한 작업들에 대해 텍스트 결과물을 생성하는 몇 가지 예시를 살펴보세요:
+* [텍스트 요약](./tasks/summarization#inference)
+* [이미지 캡셔닝](./model_doc/git#transformers.GitForCausalLM.forward.example)
+* [오디오 전사](./model_doc/whisper#transformers.WhisperForConditionalGeneration.forward.example)
+
+generate 메소드에 입력되는 값들은 모델의 데이터 형태에 따라 달라집니다. 이 값들은 AutoTokenizer나 AutoProcessor와 같은 모델의 전처리 클래스에 의해 반환됩니다. 모델의 전처리 장치가 하나 이상의 입력 유형을 생성하는 경우, 모든 입력을 generate()에 전달해야 합니다. 각 모델의 전처리 장치에 대해서는 해당 모델의 문서에서 자세히 알아볼 수 있습니다.
+
+텍스트를 생성하기 위해 출력 토큰을 선택하는 과정을 디코딩이라고 하며, `generate()` 메소드가 사용할 디코딩 전략을 사용자가 커스터마이징할 수 있습니다. 디코딩 전략을 수정하는 것은 훈련 가능한 매개변수의 값들을 변경하지 않지만, 생성된 출력의 품질에 눈에 띄는 영향을 줄 수 있습니다. 이는 텍스트에서 반복을 줄이고, 더 일관성 있게 만드는 데 도움을 줄 수 있습니다.
+
+
+이 가이드에서는 다음과 같은 내용을 다룹니다:
+* 기본 생성 설정
+* 일반적인 디코딩 전략과 주요 파라미터
+* 🤗 Hub에서 미세 조정된 모델과 함께 사용자 정의 생성 설정을 저장하고 공유하는 방법
+
+## 기본 텍스트 생성 설정[[default-text-generation-configuration]]
+
+모델의 디코딩 전략은 생성 설정에서 정의됩니다. 사전 훈련된 모델을 [`pipeline`] 내에서 추론에 사용할 때, 모델은 내부적으로 기본 생성 설정을 적용하는 `PreTrainedModel.generate()` 메소드를 호출합니다. 사용자가 모델과 함께 사용자 정의 설정을 저장하지 않았을 경우에도 기본 설정이 사용됩니다.
+
+모델을 명시적으로 로드할 때, `model.generation_config`을 통해 제공되는 생성 설정을 검사할 수 있습니다.
+
+```python
+>>> from transformers import AutoModelForCausalLM
+
+>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
+>>> model.generation_config
+GenerationConfig {
+ "bos_token_id": 50256,
+ "eos_token_id": 50256,
+}
+```
+
+ `model.generation_config`를 출력하면 기본 설정과 다른 값들만 표시되고, 기본값들은 나열되지 않습니다.
+
+기본 생성 설정은 입력 프롬프트와 출력을 합친 최대 크기를 20 토큰으로 제한하여 리소스 부족을 방지합니다. 기본 디코딩 전략은 탐욕 탐색(greedy search)으로, 다음 토큰으로 가장 높은 확률을 가진 토큰을 선택하는 가장 단순한 디코딩 전략입니다. 많은 작업과 작은 출력 크기에 대해서는 이 방법이 잘 작동하지만, 더 긴 출력을 생성할 때 사용하면 매우 반복적인 결과를 생성하게 될 수 있습니다.
+
+## 텍스트 생성 사용자 정의[[customize-text-generation]]
+
+파라미터와 해당 값을 [`generate`] 메소드에 직접 전달하여 `generation_config`을 재정의할 수 있습니다:
+
+```python
+>>> my_model.generate(**inputs, num_beams=4, do_sample=True) # doctest: +SKIP
+```
+
+기본 디코딩 전략이 대부분의 작업에 잘 작동한다 하더라도, 조정할 수 있는 몇 가지 파라미터가 있습니다. 일반적으로 조정되는 파라미터에는 다음과 같은 것들이 포함됩니다:
+
+- `max_new_tokens`: 생성할 최대 토큰 수입니다. 즉, 프롬프트에 있는 토큰을 제외한 출력 시퀀스의 크기입니다. 출력의 길이를 중단 기준으로 사용하는 대신, 전체 생성물이 일정 시간을 초과할 때 생성을 중단하기로 선택할 수도 있습니다. 더 알아보려면 [`StoppingCriteria`]를 확인하세요.
+- `num_beams`: 1보다 큰 수의 빔을 지정함으로써, 탐욕 탐색(greedy search)에서 빔 탐색(beam search)으로 전환하게 됩니다. 이 전략은 각 시간 단계에서 여러 가설을 평가하고 결국 전체 시퀀스에 대해 가장 높은 확률을 가진 가설을 선택합니다. 이는 초기 토큰의 확률이 낮아 탐욕 탐색에 의해 무시되었을 높은 확률의 시퀀스를 식별할 수 있는 장점을 가집니다.
+- `do_sample`: 이 매개변수를 `True`로 설정하면, 다항 샘플링, 빔 탐색 다항 샘플링, Top-K 샘플링 및 Top-p 샘플링과 같은 디코딩 전략을 활성화합니다. 이러한 전략들은 전체 어휘에 대한 확률 분포에서 다음 토큰을 선택하며, 전략별로 특정 조정이 적용됩니다.
+- `num_return_sequences`: 각 입력에 대해 반환할 시퀀스 후보의 수입니다. 이 옵션은 빔 탐색(beam search)의 변형과 샘플링과 같이 여러 시퀀스 후보를 지원하는 디코딩 전략에만 사용할 수 있습니다. 탐욕 탐색(greedy search)과 대조 탐색(contrastive search) 같은 디코딩 전략은 단일 출력 시퀀스를 반환합니다.
+
+## 모델에 사용자 정의 디코딩 전략 저장[[save-a-custom-decoding-strategy-with-your-model]]
+
+특정 생성 설정을 가진 미세 조정된 모델을 공유하고자 할 때, 다음 단계를 따를 수 있습니다:
+* [`GenerationConfig`] 클래스 인스턴스를 생성합니다.
+* 디코딩 전략 파라미터를 설정합니다.
+* 생성 설정을 [`GenerationConfig.save_pretrained`]를 사용하여 저장하며, `config_file_name` 인자는 비워둡니다.
+* 모델의 저장소에 설정을 업로드하기 위해 `push_to_hub`를 `True`로 설정합니다.
+
+```python
+>>> from transformers import AutoModelForCausalLM, GenerationConfig
+
+>>> model = AutoModelForCausalLM.from_pretrained("my_account/my_model") # doctest: +SKIP
+>>> generation_config = GenerationConfig(
+... max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
+... )
+>>> generation_config.save_pretrained("my_account/my_model", push_to_hub=True) # doctest: +SKIP
+```
+
+단일 디렉토리에 여러 생성 설정을 저장할 수 있으며, 이때 [`GenerationConfig.save_pretrained`]의 `config_file_name` 인자를 사용합니다. 나중에 [`GenerationConfig.from_pretrained`]로 이들을 인스턴스화할 수 있습니다. 이는 단일 모델에 대해 여러 생성 설정을 저장하고 싶을 때 유용합니다(예: 샘플링을 이용한 창의적 텍스트 생성을 위한 하나, 빔 탐색을 이용한 요약을 위한 다른 하나 등). 모델에 설정 파일을 추가하기 위해 적절한 Hub 권한을 가지고 있어야 합니다.
+
+```python
+>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
+
+>>> tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
+>>> model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
+
+>>> translation_generation_config = GenerationConfig(
+... num_beams=4,
+... early_stopping=True,
+... decoder_start_token_id=0,
+... eos_token_id=model.config.eos_token_id,
+... pad_token=model.config.pad_token_id,
+... )
+
+>>> # 팁: Hub에 push하려면 `push_to_hub=True`를 추가
+>>> translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")
+
+>>> # 명명된 생성 설정 파일을 사용하여 생성을 매개변수화할 수 있습니다.
+>>> generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")
+>>> inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")
+>>> outputs = model.generate(**inputs, generation_config=generation_config)
+>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
+['Les fichiers de configuration sont faciles à utiliser!']
+```
+
+## 스트리밍[[streaming]]
+
+`generate()` 메소드는 `streamer` 입력을 통해 스트리밍을 지원합니다. `streamer` 입력은 `put()`과 `end()` 메소드를 가진 클래스의 인스턴스와 호환됩니다. 내부적으로, `put()`은 새 토큰을 추가하는 데 사용되며, `end()`는 텍스트 생성의 끝을 표시하는 데 사용됩니다.
+
+
+
+스트리머 클래스의 API는 아직 개발 중이며, 향후 변경될 수 있습니다.
+
+
+
+실제로 다양한 목적을 위해 자체 스트리밍 클래스를 만들 수 있습니다! 또한, 기본적인 스트리밍 클래스들도 준비되어 있어 바로 사용할 수 있습니다. 예를 들어, [`TextStreamer`] 클래스를 사용하여 `generate()`의 출력을 화면에 한 단어씩 스트리밍할 수 있습니다:
+
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+
+>>> tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
+>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
+>>> inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
+>>> streamer = TextStreamer(tok)
+
+>>> # 스트리머는 평소와 같은 출력값을 반환할 뿐만 아니라 생성된 텍스트도 표준 출력(stdout)으로 출력합니다.
+>>> _ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)
+An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,
+```
+
+## 디코딩 전략[[decoding-strategies]]
+
+`generate()` 매개변수와 궁극적으로 `generation_config`의 특정 조합을 사용하여 특정 디코딩 전략을 활성화할 수 있습니다. 이 개념이 처음이라면, 흔히 사용되는 디코딩 전략이 어떻게 작동하는지 설명하는 [이 블로그 포스트](https://huggingface.co/blog/how-to-generate)를 읽어보는 것을 추천합니다.
+
+여기서는 디코딩 전략을 제어하는 몇 가지 매개변수를 보여주고, 이를 어떻게 사용할 수 있는지 설명하겠습니다.
+
+### 탐욕 탐색(Greedy Search)[[greedy-search]]
+
+[`generate`]는 기본적으로 탐욕 탐색 디코딩을 사용하므로 이를 활성화하기 위해 별도의 매개변수를 지정할 필요가 없습니다. 이는 `num_beams`가 1로 설정되고 `do_sample=False`로 되어 있다는 의미입니다."
+
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer
+
+>>> prompt = "I look forward to"
+>>> checkpoint = "distilbert/distilgpt2"
+
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
+>>> outputs = model.generate(**inputs)
+>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
+['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']
+```
+
+### 대조 탐색(Contrastive search)[[contrastive-search]]
+
+2022년 논문 [A Contrastive Framework for Neural Text Generation](https://arxiv.org/abs/2202.06417)에서 제안된 대조 탐색 디코딩 전략은 반복되지 않으면서도 일관된 긴 출력을 생성하는 데 있어 우수한 결과를 보였습니다. 대조 탐색이 작동하는 방식을 알아보려면 [이 블로그 포스트](https://huggingface.co/blog/introducing-csearch)를 확인하세요. 대조 탐색의 동작을 가능하게 하고 제어하는 두 가지 주요 매개변수는 `penalty_alpha`와 `top_k`입니다:
+
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM
+
+>>> checkpoint = "openai-community/gpt2-large"
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
+
+>>> prompt = "Hugging Face Company is"
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100)
+>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
+['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best
+in the business and our customer service is second to none.\n\nIf you have any questions about our
+products or services, feel free to contact us at any time. We look forward to hearing from you!']
+```
+
+### 다항 샘플링(Multinomial sampling)[[multinomial-sampling]]
+
+탐욕 탐색(greedy search)이 항상 가장 높은 확률을 가진 토큰을 다음 토큰으로 선택하는 것과 달리, 다항 샘플링(multinomial sampling, 조상 샘플링(ancestral sampling)이라고도 함)은 모델이 제공하는 전체 어휘에 대한 확률 분포를 기반으로 다음 토큰을 무작위로 선택합니다. 0이 아닌 확률을 가진 모든 토큰은 선택될 기회가 있으므로, 반복의 위험을 줄일 수 있습니다.
+
+다항 샘플링을 활성화하려면 `do_sample=True` 및 `num_beams=1`을 설정하세요.
+
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
+>>> set_seed(0) # 재현성을 위해
+
+>>> checkpoint = "openai-community/gpt2-large"
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
+
+>>> prompt = "Today was an amazing day because"
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
+>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
+['Today was an amazing day because when you go to the World Cup and you don\'t, or when you don\'t get invited,
+that\'s a terrible feeling."']
+```
+
+### 빔 탐색(Beam-search) 디코딩[[beam-search-decoding]]
+
+탐욕 검색(greedy search)과 달리, 빔 탐색(beam search) 디코딩은 각 시간 단계에서 여러 가설을 유지하고 결국 전체 시퀀스에 대해 가장 높은 확률을 가진 가설을 선택합니다. 이는 낮은 확률의 초기 토큰으로 시작하고 그리디 검색에서 무시되었을 가능성이 높은 시퀀스를 식별하는 이점이 있습니다.
+
+이 디코딩 전략을 활성화하려면 `num_beams` (추적할 가설 수라고도 함)를 1보다 크게 지정하세요.
+
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer
+
+>>> prompt = "It is astonishing how one can"
+>>> checkpoint = "openai-community/gpt2-medium"
+
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
+
+>>> outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)
+>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
+['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of
+time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have']
+```
+
+### 빔 탐색 다항 샘플링(Beam-search multinomial sampling)[[beam-search-multinomial-sampling]]
+
+이 디코딩 전략은 이름에서 알 수 있듯이 빔 탐색과 다항 샘플링을 결합한 것입니다. 이 디코딩 전략을 사용하기 위해서는 `num_beams`를 1보다 큰 값으로 설정하고, `do_sample=True`로 설정해야 합니다.
+
+```python
+>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed
+>>> set_seed(0) # 재현성을 위해
+
+>>> prompt = "translate English to German: The house is wonderful."
+>>> checkpoint = "google-t5/t5-small"
+
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
+
+>>> outputs = model.generate(**inputs, num_beams=5, do_sample=True)
+>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
+'Das Haus ist wunderbar.'
+```
+
+### 다양한 빔 탐색 디코딩(Diverse beam search decoding)[[diverse-beam-search-decoding]]
+
+다양한 빔 탐색(Decoding) 전략은 선택할 수 있는 더 다양한 빔 시퀀스 집합을 생성할 수 있게 해주는 빔 탐색 전략의 확장입니다. 이 방법은 어떻게 작동하는지 알아보려면, [다양한 빔 탐색: 신경 시퀀스 모델에서 다양한 솔루션 디코딩하기](https://arxiv.org/pdf/1610.02424.pdf)를 참조하세요. 이 접근 방식은 세 가지 주요 매개변수를 가지고 있습니다: `num_beams`, `num_beam_groups`, 그리고 `diversity_penalty`. 다양성 패널티는 그룹 간에 출력이 서로 다르게 하기 위한 것이며, 각 그룹 내에서 빔 탐색이 사용됩니다.
+
+```python
+>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+
+>>> checkpoint = "google/pegasus-xsum"
+>>> prompt = (
+... "The Permaculture Design Principles are a set of universal design principles "
+... "that can be applied to any location, climate and culture, and they allow us to design "
+... "the most efficient and sustainable human habitation and food production systems. "
+... "Permaculture is a design system that encompasses a wide variety of disciplines, such "
+... "as ecology, landscape design, environmental science and energy conservation, and the "
+... "Permaculture design principles are drawn from these various disciplines. Each individual "
+... "design principle itself embodies a complete conceptual framework based on sound "
+... "scientific principles. When we bring all these separate principles together, we can "
+... "create a design system that both looks at whole systems, the parts that these systems "
+... "consist of, and how those parts interact with each other to create a complex, dynamic, "
+... "living system. Each design principle serves as a tool that allows us to integrate all "
+... "the separate parts of a design, referred to as elements, into a functional, synergistic, "
+... "whole system, where the elements harmoniously interact and work together in the most "
+... "efficient way possible."
+... )
+
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
+
+>>> outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)
+>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
+'The Design Principles are a set of universal design principles that can be applied to any location, climate and
+culture, and they allow us to design the'
+```
+
+이 가이드에서는 다양한 디코딩 전략을 가능하게 하는 주요 매개변수를 보여줍니다. [`generate`] 메서드에 대한 고급 매개변수가 존재하므로 [`generate`] 메서드의 동작을 더욱 세부적으로 제어할 수 있습니다. 사용 가능한 매개변수의 전체 목록은 [API 문서](./main_classes/text_generation.md)를 참조하세요.
+
+### 추론 디코딩(Speculative Decoding)[[speculative-decoding]]
+
+추론 디코딩(보조 디코딩(assisted decoding)으로도 알려짐)은 동일한 토크나이저를 사용하는 훨씬 작은 보조 모델을 활용하여 몇 가지 후보 토큰을 생성하는 상위 모델의 디코딩 전략을 수정한 것입니다. 주 모델은 단일 전방 통과로 후보 토큰을 검증함으로써 디코딩 과정을 가속화합니다. `do_sample=True`일 경우, [추론 디코딩 논문](https://arxiv.org/pdf/2211.17192.pdf)에 소개된 토큰 검증과 재샘플링 방식이 사용됩니다.
+
+현재, 탐욕 검색(greedy search)과 샘플링만이 지원되는 보조 디코딩(assisted decoding) 기능을 통해, 보조 디코딩은 배치 입력을 지원하지 않습니다. 보조 디코딩에 대해 더 알고 싶다면, [이 블로그 포스트](https://huggingface.co/blog/assisted-generation)를 확인해 주세요.
+
+보조 디코딩을 활성화하려면 모델과 함께 `assistant_model` 인수를 설정하세요.
+
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer
+
+>>> prompt = "Alice and Bob"
+>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
+>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
+
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
+>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
+>>> outputs = model.generate(**inputs, assistant_model=assistant_model)
+>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
+['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
+```
+
+샘플링 방법과 함께 보조 디코딩을 사용하는 경우 다항 샘플링과 마찬가지로 `temperature` 인수를 사용하여 무작위성을 제어할 수 있습니다. 그러나 보조 디코딩에서는 `temperature`를 낮추면 대기 시간을 개선하는 데 도움이 될 수 있습니다.
+
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
+>>> set_seed(42) # 재현성을 위해
+
+>>> prompt = "Alice and Bob"
+>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
+>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
+
+>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+>>> inputs = tokenizer(prompt, return_tensors="pt")
+
+>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
+>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
+>>> outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
+>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
+['Alice and Bob are going to the same party. It is a small party, in a small']
+```
From 00bf44270f9def905af70ee994c290adc12ef2cb Mon Sep 17 00:00:00 2001
From: Fanli Lin
Date: Wed, 6 Mar 2024 10:58:42 +0800
Subject: [PATCH 094/549] [FIX] `offload_weight()` takes from 3 to 4 positional
arguments but 5 were given (#29457)
* use require_torch_gpu
* enable on XPU
* fix
---
src/transformers/modeling_utils.py | 2 +-
tests/test_modeling_utils.py | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index b542307794168d..5aa9d0a770cfa1 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -796,7 +796,7 @@ def _load_state_dict_into_meta_model(
if not is_safetensors:
offload_index = offload_weight(param, param_name, offload_folder, offload_index)
elif param_device == "cpu" and state_dict_index is not None:
- state_dict_index = offload_weight(param, param_name, model, state_dict_folder, state_dict_index)
+ state_dict_index = offload_weight(param, param_name, state_dict_folder, state_dict_index)
elif (
hf_quantizer is None
or (not hf_quantizer.requires_parameters_quantization)
diff --git a/tests/test_modeling_utils.py b/tests/test_modeling_utils.py
index a334cb0f2853b5..57f0f11dbb8a06 100755
--- a/tests/test_modeling_utils.py
+++ b/tests/test_modeling_utils.py
@@ -765,7 +765,7 @@ def test_model_parallelism_gpt2(self):
@require_accelerate
@mark.accelerate_tests
- @require_torch_accelerator
+ @require_torch_gpu
def test_from_pretrained_disk_offload_task_model(self):
model = AutoModel.from_pretrained("hf-internal-testing/tiny-random-gpt2")
device_map = {
@@ -808,7 +808,7 @@ def test_from_pretrained_disk_offload_task_model(self):
@require_accelerate
@mark.accelerate_tests
- @require_torch_accelerator
+ @require_torch_gpu
def test_from_pretrained_disk_offload_derived_to_base_model(self):
derived_model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gpt2")
From 2a002d073a337051bdc3fbdc95ff1bc0399ae2bb Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Wed, 6 Mar 2024 06:30:47 +0100
Subject: [PATCH 095/549] [`Docs` / `Awq`] Add docs on exllamav2 + AWQ (#29474)
* add docs on exllamav2 + AWQ
* Update docs/source/en/quantization.md
---
docs/source/en/quantization.md | 39 ++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/docs/source/en/quantization.md b/docs/source/en/quantization.md
index b8c09d449702d0..ef5a544bc4de47 100644
--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@@ -196,6 +196,45 @@ The parameter `modules_to_fuse` should include:
+### Exllama-v2 support
+
+Recent versions of `autoawq` supports exllama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running:
+
+```bash
+pip install git+https://github.com/casper-hansen/AutoAWQ.git
+```
+
+Get started by passing an `AwqConfig()` with `version="exllama"`.
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
+
+quantization_config = AwqConfig(version="exllama")
+
+model = AutoModelForCausalLM.from_pretrained(
+ "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
+ quantization_config=quantization_config,
+ device_map="auto",
+)
+
+input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
+output = model(input_ids)
+print(output.logits)
+
+tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
+input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
+output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+
+
+Note this feature is supported on AMD GPUs.
+
+
+
+
## AutoGPTQ
From b27aa206ddf3fe66b36db587603141b3d0379a82 Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Wed, 6 Mar 2024 06:58:37 +0100
Subject: [PATCH 096/549] [`docs`] Add starcoder2 docs (#29454)
* add accelerate docs
* Apply suggestions from code review
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
* Update starcoder2.md
* add correct generation
---------
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
---
docs/source/en/model_doc/starcoder2.md | 30 ++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/docs/source/en/model_doc/starcoder2.md b/docs/source/en/model_doc/starcoder2.md
index 42dac4e06a36e7..851ee5ea6ba0bb 100644
--- a/docs/source/en/model_doc/starcoder2.md
+++ b/docs/source/en/model_doc/starcoder2.md
@@ -18,10 +18,36 @@ rendered properly in your Markdown viewer.
## Overview
-Starcoder2 has been released with the paper [Stacoder-2](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view) by BigCode team.
+StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective. The models have been released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
-Documentation page about the model is coming soon
+The abstract of the paper is the following:
+> The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
+## License
+
+The models are licensed under the [BigCode OpenRAIL-M v1 license agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).
+
+## Usage tips
+
+The StarCoder2 models can be found in the [HuggingFace hub](https://huggingface.co/collections/bigcode/starcoder2-65de6da6e87db3383572be1a). You can find some examples for inference and fine-tuning in StarCoder2's [GitHub repo](https://github.com/bigcode-project/starcoder2).
+
+These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hub:
+
+```python
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer
+
+>>> model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder2-7b", device_map="auto")
+>>> tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-7b")
+
+>>> prompt = "def print_hello_world():"
+
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+>>> model.to(device)
+
+>>> generated_ids = model.generate(**model_inputs, max_new_tokens=10, do_sample=False)
+>>> tokenizer.batch_decode(generated_ids)[0]
+"def print_hello_world():\n\treturn 'Hello World!'"
+```
## Starcoder2Config
From 2890116ab761256c8d7e806c6cbf8f7e841b2abc Mon Sep 17 00:00:00 2001
From: Matthew Hoffman
Date: Wed, 6 Mar 2024 01:44:08 -0800
Subject: [PATCH 097/549] Fix TrainingArguments regression with torch <2.0.0
for dataloader_prefetch_factor (#29447)
* Fix TrainingArguments regression with torch <2.0.0 for dataloader_prefetch_factor
dataloader_prefetch_factor was added to TrainingArguments in #28498 with the default value None, but versions of torch<2.0.0 do not accept None and will raise an error if num_workers == 0 and prefetch_factor != 2
* Add is_torch_available() check
* Use is_torch_greater_or_equal_than_2_0
add back check for dataloader_prefetch_factor
---
src/transformers/training_args.py | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/src/transformers/training_args.py b/src/transformers/training_args.py
index ba89d914d76135..5baa3e1b51f366 100644
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -66,6 +66,8 @@
import torch
import torch.distributed as dist
+ from .pytorch_utils import is_torch_greater_or_equal_than_2_0
+
if is_accelerate_available():
from accelerate.state import AcceleratorState, PartialState
from accelerate.utils import DistributedType
@@ -1023,13 +1025,13 @@ class TrainingArguments:
)
},
)
- dataloader_prefetch_factor: int = field(
- default=None,
+ dataloader_prefetch_factor: Optional[int] = field(
+ default=None if not is_torch_available() or is_torch_greater_or_equal_than_2_0 else 2,
metadata={
"help": (
"Number of batches loaded in advance by each worker. "
"2 means there will be a total of 2 * num_workers batches prefetched across all workers. "
- "Default is unset"
+ "Default is 2 for PyTorch < 2.0.0 and otherwise None."
)
},
)
@@ -1807,7 +1809,11 @@ def __post_init__(self):
if self.use_cpu:
self.dataloader_pin_memory = False
- if self.dataloader_num_workers == 0 and self.dataloader_prefetch_factor is not None:
+ if (
+ (not is_torch_available() or is_torch_greater_or_equal_than_2_0)
+ and self.dataloader_num_workers == 0
+ and self.dataloader_prefetch_factor is not None
+ ):
raise ValueError(
"--dataloader_prefetch_factor can only be set when data is loaded in a different process, i.e."
" when --dataloader_num_workers > 1."
From 41f7b7ae4ba0b601a4874b19265915f09696c2a8 Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Wed, 6 Mar 2024 10:57:04 +0000
Subject: [PATCH 098/549] Generate: add tests for caches with
`pad_to_multiple_of` (#29462)
---
tests/test_cache_utils.py | 74 +++++++++++++++++++++++++++++++++++++--
1 file changed, 72 insertions(+), 2 deletions(-)
diff --git a/tests/test_cache_utils.py b/tests/test_cache_utils.py
index 6d31d63e82ef51..0b194417bb5ef1 100644
--- a/tests/test_cache_utils.py
+++ b/tests/test_cache_utils.py
@@ -291,7 +291,7 @@ def test_sink_cache_iterative_prompts(self):
@require_torch_gpu
@parameterized.expand(["eager", "sdpa", "flash_attention_2"])
- def test_static_cache_greedy_sampling_pad_left(self, attn_implementation):
+ def test_static_cache_greedy_decoding_pad_left(self, attn_implementation):
EXPECTED_GENERATION = [
"The best color is the one that complements the skin tone of the",
"We should not undermind the issues at hand.\nWe should not undermind the issues",
@@ -331,7 +331,7 @@ def test_static_cache_greedy_sampling_pad_left(self, attn_implementation):
@require_torch_gpu
@parameterized.expand(["eager", "sdpa", "flash_attention_2"])
- def test_static_cache_greedy_sampling_pad_right(self, attn_implementation):
+ def test_static_cache_greedy_decoding_pad_right(self, attn_implementation):
EXPECTED_GENERATION = [
"The best color isЋ the one that complements the skin tone of",
"We should not undermind the issues at hand.\nWe should not undermind the issues",
@@ -382,6 +382,76 @@ def call(input_ids, **kwargs):
with self.subTest(f"{attn_implementation}, static, compiled"):
self.assertListEqual(decoded, EXPECTED_GENERATION)
+ def test_dynamic_cache_extra_left_padding(self):
+ """Tests that adding extra left-padding does not affect the generation with the dynamic cache"""
+ EXPECTED_GENERATION = [
+ "The best color is the one that complements the skin tone of the",
+ "We should not undermind the issues at hand.\nWe should not undermind the issues",
+ ]
+
+ tokenizer = AutoTokenizer.from_pretrained(
+ "NousResearch/Llama-2-7b-chat-hf", padding_side="left", pad_token=""
+ )
+ model = AutoModelForCausalLM.from_pretrained(
+ "NousResearch/Llama-2-7b-chat-hf",
+ torch_dtype=torch.bfloat16,
+ ).to(torch_device)
+ inputs = tokenizer(
+ ["The best color is", "We should not undermind the issues at hand"], padding=True, return_tensors="pt"
+ ).to(model.device)
+
+ gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=10)
+ decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
+ self.assertListEqual(decoded, EXPECTED_GENERATION)
+
+ # Now with extra left-padding
+ inputs_expanded = tokenizer(
+ ["The best color is", "We should not undermind the issues at hand"],
+ padding=True,
+ return_tensors="pt",
+ pad_to_multiple_of=32,
+ ).to(model.device)
+ self.assertTrue(inputs.input_ids.shape[1] < inputs_expanded.input_ids.shape[1])
+ gen_out = model.generate(**inputs_expanded, do_sample=False, max_new_tokens=10)
+ decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
+ self.assertListEqual(decoded, EXPECTED_GENERATION)
+
+ def test_static_cache_extra_left_padding(self):
+ """Tests that adding extra left-padding does not affect the generation with the static cache"""
+ EXPECTED_GENERATION = [
+ "The best color is the one that complements the skin tone of the",
+ "We should not undermind the issues at hand.\nWe should not undermind the issues",
+ ]
+
+ tokenizer = AutoTokenizer.from_pretrained(
+ "NousResearch/Llama-2-7b-chat-hf", padding_side="left", pad_token=""
+ )
+ model = AutoModelForCausalLM.from_pretrained(
+ "NousResearch/Llama-2-7b-chat-hf",
+ torch_dtype=torch.bfloat16,
+ ).to(torch_device)
+ inputs = tokenizer(
+ ["The best color is", "We should not undermind the issues at hand"], padding=True, return_tensors="pt"
+ ).to(model.device)
+
+ model.generation_config.cache_implementation = "static"
+
+ gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=10)
+ decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
+ self.assertListEqual(decoded, EXPECTED_GENERATION)
+
+ # Now with extra left-padding
+ inputs_expanded = tokenizer(
+ ["The best color is", "We should not undermind the issues at hand"],
+ padding=True,
+ return_tensors="pt",
+ pad_to_multiple_of=32,
+ ).to(model.device)
+ self.assertTrue(inputs.input_ids.shape[1] < inputs_expanded.input_ids.shape[1])
+ gen_out = model.generate(**inputs_expanded, do_sample=False, max_new_tokens=10)
+ decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
+ self.assertListEqual(decoded, EXPECTED_GENERATION)
+
@unittest.skip("TODO @gante static cache's does not support beam search yet")
def test_static_cache_beam_search(self):
pass
From 700d48fb2deb0c24863b592513898f0f477822eb Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Wed, 6 Mar 2024 11:18:35 +0000
Subject: [PATCH 099/549] =?UTF-8?q?Generate:=20get=20generation=20mode=20f?=
=?UTF-8?q?rom=20the=20generation=20config=20instance=20=F0=9F=A7=BC=20(#2?=
=?UTF-8?q?9441)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
---
.../source/en/main_classes/text_generation.md | 3 +
src/transformers/generation/__init__.py | 4 +-
.../generation/configuration_utils.py | 80 ++++++++++++++++++-
src/transformers/generation/utils.py | 63 +--------------
tests/generation/test_configuration_utils.py | 18 +++++
5 files changed, 103 insertions(+), 65 deletions(-)
diff --git a/docs/source/en/main_classes/text_generation.md b/docs/source/en/main_classes/text_generation.md
index a43519d5a042d2..dec524d257137f 100644
--- a/docs/source/en/main_classes/text_generation.md
+++ b/docs/source/en/main_classes/text_generation.md
@@ -37,6 +37,9 @@ like token streaming.
- from_pretrained
- from_model_config
- save_pretrained
+ - update
+ - validate
+ - get_generation_mode
## GenerationMixin
diff --git a/src/transformers/generation/__init__.py b/src/transformers/generation/__init__.py
index e45f546cdc2780..178be03861a10c 100644
--- a/src/transformers/generation/__init__.py
+++ b/src/transformers/generation/__init__.py
@@ -18,7 +18,7 @@
_import_structure = {
- "configuration_utils": ["GenerationConfig"],
+ "configuration_utils": ["GenerationConfig", "GenerationMode"],
"streamers": ["TextIteratorStreamer", "TextStreamer"],
}
@@ -172,7 +172,7 @@
]
if TYPE_CHECKING:
- from .configuration_utils import GenerationConfig
+ from .configuration_utils import GenerationConfig, GenerationMode
from .streamers import TextIteratorStreamer, TextStreamer
try:
diff --git a/src/transformers/generation/configuration_utils.py b/src/transformers/generation/configuration_utils.py
index cacc2dc8e8a8c9..b937b59733b000 100644
--- a/src/transformers/generation/configuration_utils.py
+++ b/src/transformers/generation/configuration_utils.py
@@ -18,12 +18,13 @@
import json
import os
import warnings
-from typing import Any, Dict, Optional, Union
+from typing import TYPE_CHECKING, Any, Dict, Optional, Union
from .. import __version__
from ..configuration_utils import PretrainedConfig
from ..utils import (
GENERATION_CONFIG_NAME,
+ ExplicitEnum,
PushToHubMixin,
cached_file,
download_url,
@@ -33,10 +34,31 @@
)
+if TYPE_CHECKING:
+ from ..modeling_utils import PreTrainedModel
+
+
logger = logging.get_logger(__name__)
METADATA_FIELDS = ("_from_model_config", "_commit_hash", "_original_object_hash", "transformers_version")
+class GenerationMode(ExplicitEnum):
+ """
+ Possible generation modes, downstream of the [`~generation.GenerationMixin.generate`] method.
+ """
+
+ # Non-beam methods
+ CONTRASTIVE_SEARCH = "contrastive_search"
+ GREEDY_SEARCH = "greedy_search"
+ SAMPLE = "sample"
+ ASSISTED_GENERATION = "assisted_generation"
+ # Beam methods
+ BEAM_SEARCH = "beam_search"
+ BEAM_SAMPLE = "beam_sample"
+ CONSTRAINED_BEAM_SEARCH = "constrained_beam_search"
+ GROUP_BEAM_SEARCH = "group_beam_search"
+
+
class GenerationConfig(PushToHubMixin):
# no-format
r"""
@@ -376,13 +398,65 @@ def __eq__(self, other):
def __repr__(self):
return f"{self.__class__.__name__} {self.to_json_string(ignore_metadata=True)}"
+ def get_generation_mode(self, assistant_model: Optional["PreTrainedModel"] = None) -> GenerationMode:
+ """
+ Returns the generation mode triggered by the [`GenerationConfig`] instance.
+
+ Arg:
+ assistant_model (`PreTrainedModel`, *optional*):
+ The assistant model to be used for assisted generation. If set, the generation mode will be
+ assisted generation.
+
+ Returns:
+ `GenerationMode`: The generation mode triggered by the instance.
+ """
+ # TODO joao: find out a way of not depending on external fields (e.g. `assistant_model`), then make this a
+ # property and part of the `__repr__`
+ if self.constraints is not None or self.force_words_ids is not None:
+ generation_mode = GenerationMode.CONSTRAINED_BEAM_SEARCH
+ elif self.num_beams == 1:
+ if self.do_sample is False:
+ if (
+ self.top_k is not None
+ and self.top_k > 1
+ and self.penalty_alpha is not None
+ and self.penalty_alpha > 0
+ ):
+ generation_mode = GenerationMode.CONTRASTIVE_SEARCH
+ else:
+ generation_mode = GenerationMode.GREEDY_SEARCH
+ else:
+ generation_mode = GenerationMode.SAMPLE
+ else:
+ if self.num_beam_groups > 1:
+ generation_mode = GenerationMode.GROUP_BEAM_SEARCH
+ elif self.do_sample is True:
+ generation_mode = GenerationMode.BEAM_SAMPLE
+ else:
+ generation_mode = GenerationMode.BEAM_SEARCH
+
+ # Assisted generation may extend some generation modes
+ if assistant_model is not None or self.prompt_lookup_num_tokens is not None:
+ if generation_mode in ("greedy_search", "sample"):
+ generation_mode = GenerationMode.ASSISTED_GENERATION
+ else:
+ raise ValueError(
+ "You've set `assistant_model`, which triggers assisted generate. Currently, assisted generate "
+ "is only supported with Greedy Search and Sample."
+ )
+ return generation_mode
+
def validate(self, is_init=False):
"""
Validates the values of the attributes of the [`GenerationConfig`] instance. Raises exceptions in the presence
of parameterization that can be detected as incorrect from the configuration instance alone.
- Note that some parameters are best validated at generate runtime, as they may depend on other inputs and/or the
- model, such as parameters related to the generation length.
+ Note that some parameters not validated here are best validated at generate runtime, as they may depend on
+ other inputs and/or the model, such as parameters related to the generation length.
+
+ Arg:
+ is_init (`bool`, *optional*, defaults to `False`):
+ Whether the validation is performed during the initialization of the instance.
"""
# Validation of individual attributes
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 2f68c8b2f4bc12..d6207fc354acff 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -34,7 +34,7 @@
MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING,
MODEL_FOR_VISION_2_SEQ_MAPPING,
)
-from ..utils import ExplicitEnum, ModelOutput, is_accelerate_available, logging
+from ..utils import ModelOutput, is_accelerate_available, logging
from .beam_constraints import DisjunctiveConstraint, PhrasalConstraint
from .beam_search import BeamScorer, BeamSearchScorer, ConstrainedBeamSearchScorer
from .candidate_generator import (
@@ -45,7 +45,7 @@
_prepare_attention_mask,
_prepare_token_type_ids,
)
-from .configuration_utils import GenerationConfig
+from .configuration_utils import GenerationConfig, GenerationMode
from .logits_process import (
EncoderNoRepeatNGramLogitsProcessor,
EncoderRepetitionPenaltyLogitsProcessor,
@@ -325,23 +325,6 @@ class GenerateBeamEncoderDecoderOutput(ModelOutput):
GenerateOutput = Union[GenerateNonBeamOutput, GenerateBeamOutput]
-class GenerationMode(ExplicitEnum):
- """
- Possible generation modes, downstream of the [`~generation.GenerationMixin.generate`] method.
- """
-
- # Non-beam methods
- CONTRASTIVE_SEARCH = "contrastive_search"
- GREEDY_SEARCH = "greedy_search"
- SAMPLE = "sample"
- ASSISTED_GENERATION = "assisted_generation"
- # Beam methods
- BEAM_SEARCH = "beam_search"
- BEAM_SAMPLE = "beam_sample"
- CONSTRAINED_BEAM_SEARCH = "constrained_beam_search"
- GROUP_BEAM_SEARCH = "group_beam_search"
-
-
class GenerationMixin:
"""
A class containing all functions for auto-regressive text generation, to be used as a mixin in [`PreTrainedModel`].
@@ -764,46 +747,6 @@ def _get_logits_warper(
warpers.append(LogitNormalization())
return warpers
- def _get_generation_mode(
- self, generation_config: GenerationConfig, assistant_model: Optional["PreTrainedModel"]
- ) -> GenerationMode:
- """
- Returns the generation mode triggered by a [`GenerationConfig`] instance.
- """
- if generation_config.constraints is not None or generation_config.force_words_ids is not None:
- generation_mode = GenerationMode.CONSTRAINED_BEAM_SEARCH
- elif generation_config.num_beams == 1:
- if generation_config.do_sample is False:
- if (
- generation_config.top_k is not None
- and generation_config.top_k > 1
- and generation_config.penalty_alpha is not None
- and generation_config.penalty_alpha > 0
- ):
- generation_mode = GenerationMode.CONTRASTIVE_SEARCH
- else:
- generation_mode = GenerationMode.GREEDY_SEARCH
- else:
- generation_mode = GenerationMode.SAMPLE
- else:
- if generation_config.num_beam_groups > 1:
- generation_mode = GenerationMode.GROUP_BEAM_SEARCH
- elif generation_config.do_sample is True:
- generation_mode = GenerationMode.BEAM_SAMPLE
- else:
- generation_mode = GenerationMode.BEAM_SEARCH
-
- # Assisted generation may extend some generation modes
- if assistant_model is not None or generation_config.prompt_lookup_num_tokens is not None:
- if generation_mode in ("greedy_search", "sample"):
- generation_mode = GenerationMode.ASSISTED_GENERATION
- else:
- raise ValueError(
- "You've set `assistant_model`, which triggers assisted generate. Currently, assisted generate "
- "is only supported with Greedy Search and Sample."
- )
- return generation_mode
-
def _get_logits_processor(
self,
generation_config: GenerationConfig,
@@ -1474,7 +1417,7 @@ def generate(
self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)
# 7. determine generation mode
- generation_mode = self._get_generation_mode(generation_config, assistant_model)
+ generation_mode = generation_config.get_generation_mode(assistant_model)
if streamer is not None and (generation_config.num_beams > 1):
raise ValueError(
diff --git a/tests/generation/test_configuration_utils.py b/tests/generation/test_configuration_utils.py
index a86dd31440487d..ece3f33a06070c 100644
--- a/tests/generation/test_configuration_utils.py
+++ b/tests/generation/test_configuration_utils.py
@@ -24,6 +24,7 @@
from requests.exceptions import HTTPError
from transformers import AutoConfig, GenerationConfig
+from transformers.generation import GenerationMode
from transformers.testing_utils import TOKEN, USER, is_staging_test
@@ -202,6 +203,23 @@ def test_refuse_to_save(self):
self.assertEqual(len(captured_warnings), 0)
self.assertTrue(len(os.listdir(tmp_dir)) == 1)
+ def test_generation_mode(self):
+ """Tests that the `get_generation_mode` method is working as expected."""
+ config = GenerationConfig()
+ self.assertEqual(config.get_generation_mode(), GenerationMode.GREEDY_SEARCH)
+
+ config = GenerationConfig(do_sample=True)
+ self.assertEqual(config.get_generation_mode(), GenerationMode.SAMPLE)
+
+ config = GenerationConfig(num_beams=2)
+ self.assertEqual(config.get_generation_mode(), GenerationMode.BEAM_SEARCH)
+
+ config = GenerationConfig(top_k=10, do_sample=False, penalty_alpha=0.6)
+ self.assertEqual(config.get_generation_mode(), GenerationMode.CONTRASTIVE_SEARCH)
+
+ config = GenerationConfig()
+ self.assertEqual(config.get_generation_mode(assistant_model="foo"), GenerationMode.ASSISTED_GENERATION)
+
@is_staging_test
class ConfigPushToHubTester(unittest.TestCase):
From 0a5b0516f879d19551f935f19f006cbb0f9e68ca Mon Sep 17 00:00:00 2001
From: Ofir Zafrir
Date: Wed, 6 Mar 2024 13:19:47 +0200
Subject: [PATCH 100/549] Avoid dummy token in PLD to optimize performance
(#29445)
---
src/transformers/generation/candidate_generator.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/transformers/generation/candidate_generator.py b/src/transformers/generation/candidate_generator.py
index 4b8fa144f04b6b..b7facddf815569 100644
--- a/src/transformers/generation/candidate_generator.py
+++ b/src/transformers/generation/candidate_generator.py
@@ -302,8 +302,8 @@ def get_candidates(self, input_ids: torch.LongTensor) -> Tuple[torch.LongTensor,
break
if chosen_ids is None or len(chosen_ids) == 0:
- # Need to make a dummy tensor to avoid errors
- chosen_ids = torch.zeros((1), dtype=torch.long, device=input_ids.device)
+ # In case we didn't find a match return the input sequence unchanged, reverts back to autoregressive decoding
+ return input_ids, None
# Now need extend input_ids with chosen_ids
chosen_ids = chosen_ids.unsqueeze(0)
From 9322576e2f49d1014fb0c00a7a7c8c34b6a5fd35 Mon Sep 17 00:00:00 2001
From: Zach Mueller
Date: Wed, 6 Mar 2024 07:11:53 -0500
Subject: [PATCH 101/549] Fix test failure on DeepSpeed (#29444)
* Fix test failure
* use item
---
src/transformers/trainer.py | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 99792019846210..056f7a2ca96e34 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2011,7 +2011,10 @@ def _inner_training_loop(
is_accelerate_available()
and self.accelerator.distributed_type == DistributedType.DEEPSPEED
):
- grad_norm = model.get_global_grad_norm().item()
+ grad_norm = model.get_global_grad_norm()
+ # In some cases the grad norm may not return a float
+ if hasattr(grad_norm, "item"):
+ grad_norm = grad_norm.item()
else:
grad_norm = _grad_norm.item() if _grad_norm is not None else None
From ddb4fda3cb5b4906f9b4ed7e9d126ff84a1a12b5 Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Wed, 6 Mar 2024 14:28:45 +0000
Subject: [PATCH 102/549] Generate: torch.compile-ready generation config
preparation (#29443)
---
src/transformers/generation/utils.py | 92 ++++++++++++++++++----------
src/transformers/utils/__init__.py | 1 +
2 files changed, 60 insertions(+), 33 deletions(-)
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index d6207fc354acff..6dd4df4303581d 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -34,7 +34,7 @@
MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING,
MODEL_FOR_VISION_2_SEQ_MAPPING,
)
-from ..utils import ModelOutput, is_accelerate_available, logging
+from ..utils import ModelOutput, is_accelerate_available, is_torchdynamo_compiling, logging
from .beam_constraints import DisjunctiveConstraint, PhrasalConstraint
from .beam_search import BeamScorer, BeamSearchScorer, ConstrainedBeamSearchScorer
from .candidate_generator import (
@@ -1162,6 +1162,59 @@ def _validate_generated_length(self, generation_config, input_ids_length, has_de
UserWarning,
)
+ def _prepare_generation_config(
+ self, generation_config: GenerationConfig, **kwargs: Dict
+ ) -> Tuple[GenerationConfig, Dict]:
+ """
+ Prepares the base generation config, then applies any generation configuration options from kwargs.
+ """
+ # TODO joao: when we can detect `fullgraph=True` in `torch.compile` (https://github.com/pytorch/pytorch/pull/120400)
+ # replace `is_torchdynamo_compiling` by the corresponding check. As it is, we are being too restrictive with
+ # the parameterization in `fullgraph=False` so as to enable `fullgraph=True`.
+
+ # priority: `generation_config` argument > `model.generation_config` (the default generation config)
+ if generation_config is None:
+ # legacy: users may modify the model configuration to control generation. To trigger this legacy behavior,
+ # three conditions must be met
+ # 1) the generation config must have been created from the model config (`_from_model_config` field);
+ # 2) the generation config must have seen no modification since its creation (the hash is the same);
+ # 3) the user must have set generation parameters in the model config.
+ # NOTE: `torch.compile` can't compile `hash`, this legacy support is disabled with compilation.
+ if (
+ not is_torchdynamo_compiling()
+ and self.generation_config._from_model_config
+ and self.generation_config._original_object_hash == hash(self.generation_config)
+ and self.config._has_non_default_generation_parameters()
+ ):
+ new_generation_config = GenerationConfig.from_model_config(self.config)
+ if new_generation_config != self.generation_config:
+ warnings.warn(
+ "You have modified the pretrained model configuration to control generation. This is a"
+ " deprecated strategy to control generation and will be removed soon, in a future version."
+ " Please use and modify the model generation configuration (see"
+ " https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )"
+ )
+ self.generation_config = new_generation_config
+ generation_config = self.generation_config
+
+ # `torch.compile` can't compile `copy.deepcopy`, arguments in `kwargs` that are part of `generation_config`
+ # will mutate the object with `.update`. As such, passing these arguments through `kwargs` is disabled.
+ if is_torchdynamo_compiling():
+ model_kwargs = kwargs
+ generate_attributes_in_kwargs = [
+ key for key, value in kwargs.items() if getattr(generation_config, key, None) != value
+ ]
+ if len(generate_attributes_in_kwargs) > 0:
+ raise ValueError(
+ "`torch.compile` exception: all generation configuration attributes must be passed within a "
+ f"`generation_config` instance passed to `generate` (found: {generate_attributes_in_kwargs})."
+ )
+ else:
+ generation_config = copy.deepcopy(generation_config)
+ model_kwargs = generation_config.update(**kwargs)
+
+ return generation_config, model_kwargs
+
@torch.no_grad()
def generate(
self,
@@ -1260,44 +1313,17 @@ def generate(
- [`~generation.GenerateEncoderDecoderOutput`],
- [`~generation.GenerateBeamEncoderDecoderOutput`]
"""
+ # 1. Handle `generation_config` and kwargs that might update it, and validate the `.generate()` call
+ self._validate_model_class()
+ generation_config, model_kwargs = self._prepare_generation_config(generation_config, **kwargs)
+ self._validate_model_kwargs(model_kwargs.copy())
+ # 2. Set generation parameters if not already defined
if synced_gpus is None:
if is_deepspeed_zero3_enabled() and dist.get_world_size() > 1:
synced_gpus = True
else:
synced_gpus = False
-
- # 1. Handle `generation_config` and kwargs that might update it, and validate the `.generate()` call
- self._validate_model_class()
-
- # priority: `generation_config` argument > `model.generation_config` (the default generation config)
- if generation_config is None:
- # legacy: users may modify the model configuration to control generation. To trigger this legacy behavior,
- # three conditions must be met
- # 1) the generation config must have been created from the model config (`_from_model_config` field);
- # 2) the generation config must have seen no modification since its creation (the hash is the same);
- # 3) the user must have set generation parameters in the model config.
- if (
- self.generation_config._from_model_config
- and self.generation_config._original_object_hash == hash(self.generation_config)
- and self.config._has_non_default_generation_parameters()
- ):
- new_generation_config = GenerationConfig.from_model_config(self.config)
- if new_generation_config != self.generation_config:
- warnings.warn(
- "You have modified the pretrained model configuration to control generation. This is a"
- " deprecated strategy to control generation and will be removed soon, in a future version."
- " Please use and modify the model generation configuration (see"
- " https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )"
- )
- self.generation_config = new_generation_config
- generation_config = self.generation_config
-
- generation_config = copy.deepcopy(generation_config)
- model_kwargs = generation_config.update(**kwargs) # All unused kwargs must be model kwargs
- self._validate_model_kwargs(model_kwargs.copy())
-
- # 2. Set generation parameters if not already defined
logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
diff --git a/src/transformers/utils/__init__.py b/src/transformers/utils/__init__.py
index 03e2663350794b..2fe931b3f38faf 100644
--- a/src/transformers/utils/__init__.py
+++ b/src/transformers/utils/__init__.py
@@ -193,6 +193,7 @@
is_torchaudio_available,
is_torchdistx_available,
is_torchdynamo_available,
+ is_torchdynamo_compiling,
is_torchvision_available,
is_training_run_on_sagemaker,
is_vision_available,
From 19fb1e22d2bdadf6611e029a6ae82606d1520c5f Mon Sep 17 00:00:00 2001
From: Moshe Berchansky
Date: Wed, 6 Mar 2024 17:06:45 +0200
Subject: [PATCH 103/549] added the max_matching_ngram_size to GenerationConfig
(#29131)
* added the max_matching_ngram_size parameter into the GenerationConfig, for the PromptLookupCandidateGenerator
* switched back to keyword arguments
* added PromptLookupCandidateGenerator docstring for its parameters
* ruff reformat
* Update src/transformers/generation/configuration_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---------
Co-authored-by: Joao Gante
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
src/transformers/generation/candidate_generator.py | 4 ++--
src/transformers/generation/configuration_utils.py | 8 ++++++++
src/transformers/generation/utils.py | 1 +
3 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/src/transformers/generation/candidate_generator.py b/src/transformers/generation/candidate_generator.py
index b7facddf815569..ff4eea9765f093 100644
--- a/src/transformers/generation/candidate_generator.py
+++ b/src/transformers/generation/candidate_generator.py
@@ -252,10 +252,10 @@ class PromptLookupCandidateGenerator(CandidateGenerator):
def __init__(
self,
num_output_tokens: int = 10,
- max_matching_ngram_size: int = 2,
+ max_matching_ngram_size: int = None,
):
self.num_output_tokens = num_output_tokens
- self.max_matching_ngram_size = max_matching_ngram_size
+ self.max_matching_ngram_size = max_matching_ngram_size if max_matching_ngram_size else 2
if self.max_matching_ngram_size <= 0 or self.num_output_tokens <= 0:
raise ValueError("Invalid max_matching_ngram_size or num_output_tokens")
diff --git a/src/transformers/generation/configuration_utils.py b/src/transformers/generation/configuration_utils.py
index b937b59733b000..974e1452d0172e 100644
--- a/src/transformers/generation/configuration_utils.py
+++ b/src/transformers/generation/configuration_utils.py
@@ -279,11 +279,18 @@ class GenerationConfig(PushToHubMixin):
- `"heuristic_transient"`: Same as `"heuristic"` but `num_assistant_tokens` is reset to its initial value after each generation call.
- `"constant"`: `num_assistant_tokens` stays unchanged during generation
+ prompt_lookup_num_tokens (`int`, *optional*, default to `None`):
+ The number of tokens to be output as candidate tokens.
+
+ max_matching_ngram_size (`int`, *optional*, default to `None`):
+ The maximum ngram size to be considered for matching in the prompt. Default to 2 if not provided.
+
> Parameters specific to the caching mechanism:
cache_implementation (`str`, *optional*, default to `None`):
Cache class that should be used when generating.
+
> Wild card
generation_kwargs:
@@ -360,6 +367,7 @@ def __init__(self, **kwargs):
# Prompt lookup decoding
self.prompt_lookup_num_tokens = kwargs.pop("prompt_lookup_num_tokens", None)
+ self.max_matching_ngram_size = kwargs.pop("max_matching_ngram_size", None)
# Wild card
self.generation_kwargs = kwargs.pop("generation_kwargs", {})
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 6dd4df4303581d..e36bed65719c2e 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -688,6 +688,7 @@ def _get_candidate_generator(
if generation_config.prompt_lookup_num_tokens is not None:
candidate_generator = PromptLookupCandidateGenerator(
num_output_tokens=generation_config.prompt_lookup_num_tokens,
+ max_matching_ngram_size=generation_config.max_matching_ngram_size,
)
else:
candidate_generator = AssistedCandidateGenerator(
From 965cf677695dd363285831afca8cf479cf0c600c Mon Sep 17 00:00:00 2001
From: Alvaro Bartolome
Date: Wed, 6 Mar 2024 18:03:55 +0100
Subject: [PATCH 104/549] Fix `TextGenerationPipeline.__call__` docstring
(#29491)
---
src/transformers/pipelines/text_generation.py | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/src/transformers/pipelines/text_generation.py b/src/transformers/pipelines/text_generation.py
index ef64fb84dddda1..0b358291717ee0 100644
--- a/src/transformers/pipelines/text_generation.py
+++ b/src/transformers/pipelines/text_generation.py
@@ -194,7 +194,7 @@ def __call__(self, text_inputs, **kwargs):
Complete the prompt(s) given as inputs.
Args:
- args (`str` or `List[str]`):
+ text_inputs (`str` or `List[str]`):
One or several prompts (or one list of prompts) to complete.
return_tensors (`bool`, *optional*, defaults to `False`):
Whether or not to return the tensors of predictions (as token indices) in the outputs. If set to
@@ -217,8 +217,7 @@ def __call__(self, text_inputs, **kwargs):
- `None` : default strategy where nothing in particular happens
- `"hole"`: Truncates left of input, and leaves a gap wide enough to let generation happen (might
truncate a lot of the prompt and not suitable when generation exceed the model capacity)
-
- generate_kwargs:
+ generate_kwargs (`dict`, *optional*):
Additional keyword arguments to pass along to the generate method of the model (see the generate method
corresponding to your framework [here](./model#generative-models)).
From 2a939f20ff34c5fcf10f944f1e994c87a84fe83d Mon Sep 17 00:00:00 2001
From: Glen Taggart <35577239+nqgl@users.noreply.github.com>
Date: Wed, 6 Mar 2024 15:56:25 -0800
Subject: [PATCH 105/549] Substantially reduce memory usage in
_update_causal_mask for large batches by using .expand instead of .repeat
[needs tests+sanity check] (#29413)
* try to fix gemma mem use
* fix: handle attention mask dim==2 case
* remove logits=logits.float()
* clean up + add llama
* apply formatting
* readability edit: swap order of items being multiplied
* revert change unrelated to PR
* revert black autoformat
* switch to one .to
* Accept style edits
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
src/transformers/models/gemma/modeling_gemma.py | 6 ++++--
src/transformers/models/llama/modeling_llama.py | 9 +++++----
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index ea239193afc352..8869a617e92d10 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -14,6 +14,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch Gemma model."""
+
import math
import warnings
from typing import List, Optional, Tuple, Union
@@ -971,10 +972,11 @@ def _update_causal_mask(self, attention_mask, input_tensor):
# We use the current dtype to avoid any overflows
min_dtype = torch.finfo(dtype).min
- causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * min_dtype
- causal_mask = causal_mask.to(dtype=dtype, device=device)
+ causal_mask = self.causal_mask[None, None, :, :].to(dtype=dtype, device=device) * min_dtype
+ causal_mask = causal_mask.expand(batch_size, 1, -1, -1)
if attention_mask is not None and attention_mask.dim() == 2:
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 8ca9397cab740b..f794a7178c5685 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -17,7 +17,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-""" PyTorch LLaMA model."""
+"""PyTorch LLaMA model."""
+
import math
import warnings
from typing import List, Optional, Tuple, Union
@@ -1083,10 +1084,10 @@ def _update_causal_mask(self, attention_mask, input_tensor):
# We use the current dtype to avoid any overflows
min_dtype = torch.finfo(dtype).min
- causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * min_dtype
-
- causal_mask = causal_mask.to(dtype=dtype, device=device)
+ causal_mask = self.causal_mask[None, None, :, :].to(dtype=dtype, device=device) * min_dtype
+ causal_mask = causal_mask.expand(batch_size, 1, -1, -1)
if attention_mask is not None and attention_mask.dim() == 2:
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
From d45f47ab7f7c31991bb98a0302ded59ab6adac31 Mon Sep 17 00:00:00 2001
From: Park Jun
Date: Thu, 7 Mar 2024 08:57:22 +0900
Subject: [PATCH 106/549] Fix: Disable torch.autocast in RotaryEmbedding of
Gemma and LLaMa for MPS device (#29439)
* Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS devices
* Update src/transformers/models/gemma/modeling_gemma.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update llama ang gemma rope use cpu in mps device
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
src/transformers/models/gemma/modeling_gemma.py | 2 +-
src/transformers/models/llama/modeling_llama.py | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index 8869a617e92d10..479f46825c149a 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -115,7 +115,7 @@ def forward(self, x, position_ids, seq_len=None):
# Force float32 since bfloat16 loses precision on long contexts
# See https://github.com/huggingface/transformers/pull/29285
device_type = x.device.type
- device_type = device_type if isinstance(device_type, str) else "cpu"
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index f794a7178c5685..c2a81b63bd00b2 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -139,7 +139,7 @@ def forward(self, x, position_ids, seq_len=None):
# Force float32 since bfloat16 loses precision on long contexts
# See https://github.com/huggingface/transformers/pull/29285
device_type = x.device.type
- device_type = device_type if isinstance(device_type, str) else "cpu"
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
From 979fccc90f0945756db5547ddc5f6788a4fe63d0 Mon Sep 17 00:00:00 2001
From: regisss <15324346+regisss@users.noreply.github.com>
Date: Thu, 7 Mar 2024 10:28:01 +0100
Subject: [PATCH 107/549] Enable BLIP for auto VQA (#29499)
* Enable BLIP for auto VQA
* Make style
* Add VQA to BLIP pipeline tests
---
src/transformers/models/auto/modeling_auto.py | 1 +
tests/models/blip/test_modeling_blip.py | 1 +
2 files changed, 2 insertions(+)
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 109dbb19485916..3f80a36cc367ab 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -964,6 +964,7 @@
MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
[
+ ("blip", "BlipForQuestionAnswering"),
("blip-2", "Blip2ForConditionalGeneration"),
("vilt", "ViltForQuestionAnswering"),
]
diff --git a/tests/models/blip/test_modeling_blip.py b/tests/models/blip/test_modeling_blip.py
index 54512596b01c96..4e87dca58fedd0 100644
--- a/tests/models/blip/test_modeling_blip.py
+++ b/tests/models/blip/test_modeling_blip.py
@@ -432,6 +432,7 @@ class BlipModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
{
"feature-extraction": BlipModel,
"image-to-text": BlipForConditionalGeneration,
+ "visual-question-answering": BlipForQuestionAnswering,
}
if is_torch_available()
else {}
From ffe60fdcd60c17c3f216694160c2521da90f984c Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Thu, 7 Mar 2024 10:44:43 +0000
Subject: [PATCH 108/549] =?UTF-8?q?v4.39=20deprecations=20=F0=9F=A7=BC=20?=
=?UTF-8?q?=20(#29492)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
---
docs/source/en/internal/generation_utils.md | 6 -
docs/source/ja/internal/generation_utils.md | 6 -
docs/source/zh/internal/generation_utils.md | 6 -
src/transformers/__init__.py | 4 -
src/transformers/activations.py | 9 --
src/transformers/generation/__init__.py | 4 -
src/transformers/generation/tf_utils.py | 62 ---------
src/transformers/generation/utils.py | 41 ------
.../models/llama/modeling_llama.py | 13 +-
src/transformers/models/opt/modeling_opt.py | 25 +---
src/transformers/utils/dummy_pt_objects.py | 4 -
src/transformers/utils/dummy_tf_objects.py | 4 -
tests/generation/test_tf_utils.py | 97 -------------
tests/generation/test_utils.py | 128 ------------------
14 files changed, 9 insertions(+), 400 deletions(-)
diff --git a/docs/source/en/internal/generation_utils.md b/docs/source/en/internal/generation_utils.md
index 540594ece015d5..7270af049c3248 100644
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@@ -336,12 +336,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens
- process
- finalize
-## Utilities
-
-[[autodoc]] top_k_top_p_filtering
-
-[[autodoc]] tf_top_k_top_p_filtering
-
## Streamers
[[autodoc]] TextStreamer
diff --git a/docs/source/ja/internal/generation_utils.md b/docs/source/ja/internal/generation_utils.md
index 8aa069e4dcd133..d65067fc0bbd4c 100644
--- a/docs/source/ja/internal/generation_utils.md
+++ b/docs/source/ja/internal/generation_utils.md
@@ -335,12 +335,6 @@ generation_output[:2]
- process
- finalize
-## Utilities
-
-[[autodoc]] top_k_top_p_filtering
-
-[[autodoc]] tf_top_k_top_p_filtering
-
## Streamers
[[autodoc]] TextStreamer
diff --git a/docs/source/zh/internal/generation_utils.md b/docs/source/zh/internal/generation_utils.md
index 5d8056bb7d2dae..c82deecd3ddfcc 100644
--- a/docs/source/zh/internal/generation_utils.md
+++ b/docs/source/zh/internal/generation_utils.md
@@ -330,12 +330,6 @@ generation_output[:2]
- process
- finalize
-## Utilities
-
-[[autodoc]] top_k_top_p_filtering
-
-[[autodoc]] tf_top_k_top_p_filtering
-
## Streamers
[[autodoc]] TextStreamer
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index f7c92f033f69ee..da650cc58ff99b 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -1409,7 +1409,6 @@
"TypicalLogitsWarper",
"UnbatchedClassifierFreeGuidanceLogitsProcessor",
"WhisperTimeStampLogitsProcessor",
- "top_k_top_p_filtering",
]
)
_import_structure["generation_utils"] = []
@@ -3814,7 +3813,6 @@
"TFTemperatureLogitsWarper",
"TFTopKLogitsWarper",
"TFTopPLogitsWarper",
- "tf_top_k_top_p_filtering",
]
)
_import_structure["generation_tf_utils"] = []
@@ -6206,7 +6204,6 @@
TypicalLogitsWarper,
UnbatchedClassifierFreeGuidanceLogitsProcessor,
WhisperTimeStampLogitsProcessor,
- top_k_top_p_filtering,
)
from .modeling_utils import PreTrainedModel
from .models.albert import (
@@ -8178,7 +8175,6 @@
TFTemperatureLogitsWarper,
TFTopKLogitsWarper,
TFTopPLogitsWarper,
- tf_top_k_top_p_filtering,
)
from .keras_callbacks import KerasMetricCallback, PushToHubCallback
from .modeling_tf_utils import (
diff --git a/src/transformers/activations.py b/src/transformers/activations.py
index 22f5fe9b1bc2f4..2355fb5fed678d 100644
--- a/src/transformers/activations.py
+++ b/src/transformers/activations.py
@@ -13,7 +13,6 @@
# limitations under the License.
import math
-import warnings
from collections import OrderedDict
import torch
@@ -138,14 +137,6 @@ def forward(self, input: Tensor) -> Tensor:
return 0.5 * input * (1 + torch.tanh(self.precomputed_constant * (input + 0.044715 * torch.pow(input, 3))))
-class SiLUActivation(nn.SiLU):
- def __init__(self, *args, **kwargs):
- warnings.warn(
- "The SiLUActivation class has been deprecated and will be removed in v4.39. Please use nn.SiLU instead.",
- )
- super().__init__(*args, **kwargs)
-
-
class MishActivation(nn.Module):
"""
See Mish: A Self-Regularized Non-Monotonic Activation Function (Misra., https://arxiv.org/abs/1908.08681). Also
diff --git a/src/transformers/generation/__init__.py b/src/transformers/generation/__init__.py
index 178be03861a10c..8f2a6ad9600d97 100644
--- a/src/transformers/generation/__init__.py
+++ b/src/transformers/generation/__init__.py
@@ -88,7 +88,6 @@
]
_import_structure["utils"] = [
"GenerationMixin",
- "top_k_top_p_filtering",
"GreedySearchEncoderDecoderOutput",
"GreedySearchDecoderOnlyOutput",
"SampleEncoderDecoderOutput",
@@ -130,7 +129,6 @@
]
_import_structure["tf_utils"] = [
"TFGenerationMixin",
- "tf_top_k_top_p_filtering",
"TFGreedySearchDecoderOnlyOutput",
"TFGreedySearchEncoderDecoderOutput",
"TFSampleEncoderDecoderOutput",
@@ -241,7 +239,6 @@
GreedySearchEncoderDecoderOutput,
SampleDecoderOnlyOutput,
SampleEncoderDecoderOutput,
- top_k_top_p_filtering,
)
try:
@@ -279,7 +276,6 @@
TFGreedySearchEncoderDecoderOutput,
TFSampleDecoderOnlyOutput,
TFSampleEncoderDecoderOutput,
- tf_top_k_top_p_filtering,
)
try:
diff --git a/src/transformers/generation/tf_utils.py b/src/transformers/generation/tf_utils.py
index 8c2d9fde6ae721..90219c316b6c8c 100644
--- a/src/transformers/generation/tf_utils.py
+++ b/src/transformers/generation/tf_utils.py
@@ -3088,68 +3088,6 @@ def contrastive_search_body_fn(
return generated
-def tf_top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float("Inf"), min_tokens_to_keep=1):
- """
- Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
-
- Args:
- logits: logits distribution shape (batch size, vocabulary size)
- top_k (`int`, *optional*, defaults to 0):
- If > 0, only keep the top k tokens with highest probability (top-k filtering)
- top_p (`float`, *optional*, defaults to 1.0):
- If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus
- filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
- min_tokens_to_keep (`int`, *optional*, defaults to 1):
- Minimumber of tokens we keep per batch example in the output.
-
- From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
- """
-
- warnings.warn(
- "`tf_top_k_top_p_filtering` is scheduled for deletion in v4.39. Use `TFTopKLogitsWarper` and "
- "`TFTopPLogitsWarper` instead.",
- DeprecationWarning,
- )
-
- logits_shape = shape_list(logits)
-
- if top_k > 0:
- top_k = min(max(top_k, min_tokens_to_keep), logits_shape[-1]) # Safety check
- # Remove all tokens with a probability less than the last token of the top-k
- indices_to_remove = logits < tf.math.top_k(logits, k=top_k)[0][..., -1, None]
- logits = tf.where(indices_to_remove, filter_value, logits)
- if top_p < 1.0:
- sorted_indices = tf.argsort(logits, direction="DESCENDING")
- sorted_logits = tf.gather(
- logits, sorted_indices, axis=-1, batch_dims=1
- ) # expects logits to be of dim (batch_size, vocab_size)
-
- cumulative_probs = tf.math.cumsum(stable_softmax(sorted_logits, axis=-1), axis=-1)
-
- # Remove tokens with cumulative probability above the threshold (token with 0 are kept)
- sorted_indices_to_remove = cumulative_probs > top_p
-
- if min_tokens_to_keep > 1:
- # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
- sorted_indices_to_remove = tf.concat(
- [
- tf.zeros_like(sorted_indices_to_remove[:, :min_tokens_to_keep]),
- sorted_indices_to_remove[:, min_tokens_to_keep:],
- ],
- -1,
- )
-
- # Shift the indices to the right to keep also the first token above the threshold
- sorted_indices_to_remove = tf.concat(
- [tf.zeros_like(sorted_indices_to_remove[:, :1]), sorted_indices_to_remove[:, :-1]],
- -1,
- )
- # scatter sorted tensors to original indexing
- indices_to_remove = scatter_values_on_batch_indices(sorted_indices_to_remove, sorted_indices)
- logits = tf.where(indices_to_remove, filter_value, logits)
- return logits
-
-
def scatter_values_on_batch_indices(values, batch_indices):
shape = shape_list(batch_indices)
# broadcast batch dim to shape
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index e36bed65719c2e..1d7eef755bf984 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -4810,47 +4810,6 @@ def _split_model_outputs(outputs, new_outputs, cur_len, added_len, is_decoder_at
return outputs
-def top_k_top_p_filtering(
- logits: torch.FloatTensor,
- top_k: int = 0,
- top_p: float = 1.0,
- filter_value: float = -float("Inf"),
- min_tokens_to_keep: int = 1,
-) -> torch.FloatTensor:
- """
- Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
-
- Args:
- logits: logits distribution shape (batch size, vocabulary size)
- top_k (`int`, *optional*, defaults to 0):
- If > 0, only keep the top k tokens with highest probability (top-k filtering)
- top_p (`float`, *optional*, defaults to 1.0):
- If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus
- filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
- min_tokens_to_keep (`int`, *optional*, defaults to 1):
- Minimumber of tokens we keep per batch example in the output.
-
- From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
- """
- warnings.warn(
- "`top_k_top_p_filtering` is scheduled for deletion in v4.39. Use `TopKLogitsWarper` and `TopPLogitsWarper` "
- "instead.",
- DeprecationWarning,
- )
-
- if top_k > 0:
- logits = TopKLogitsWarper(top_k=top_k, filter_value=filter_value, min_tokens_to_keep=min_tokens_to_keep)(
- None, logits
- )
-
- if 0 <= top_p <= 1.0:
- logits = TopPLogitsWarper(top_p=top_p, filter_value=filter_value, min_tokens_to_keep=min_tokens_to_keep)(
- None, logits
- )
-
- return logits
-
-
def _ranking_fast(
context_hidden: torch.FloatTensor,
next_hidden: torch.FloatTensor,
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index c2a81b63bd00b2..262db548a1a34e 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -129,10 +129,7 @@ def cos_cached(self):
return self._cos_cached
@torch.no_grad()
- def forward(self, x, position_ids, seq_len=None):
- if seq_len is not None:
- logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.39.")
-
+ def forward(self, x, position_ids):
# x: [bs, num_attention_heads, seq_len, head_size]
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
@@ -151,17 +148,17 @@ def forward(self, x, position_ids, seq_len=None):
class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
"""LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
- def forward(self, x, position_ids, seq_len=None):
+ def forward(self, x, position_ids):
# difference to the original RoPE: a scaling factor is aplied to the position ids
position_ids = position_ids.float() / self.scaling_factor
- cos, sin = super().forward(x, position_ids, seq_len)
+ cos, sin = super().forward(x, position_ids)
return cos, sin
class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
"""LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
- def forward(self, x, position_ids, seq_len=None):
+ def forward(self, x, position_ids):
# difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
seq_len = torch.max(position_ids) + 1
if seq_len > self.max_position_embeddings:
@@ -173,7 +170,7 @@ def forward(self, x, position_ids, seq_len=None):
)
self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: this may break with compilation
- cos, sin = super().forward(x, position_ids, seq_len)
+ cos, sin = super().forward(x, position_ids)
return cos, sin
diff --git a/src/transformers/models/opt/modeling_opt.py b/src/transformers/models/opt/modeling_opt.py
index 7c66f5c255e584..3af18947fac93f 100644
--- a/src/transformers/models/opt/modeling_opt.py
+++ b/src/transformers/models/opt/modeling_opt.py
@@ -120,27 +120,10 @@ def __init__(
):
super().__init__()
self.config = config
-
- def _handle_deprecated_argument(config_arg_name, config, fn_arg_name, kwargs):
- """
- If a the deprecated argument `fn_arg_name` is passed, raise a deprecation
- warning and return that value, otherwise take the equivalent config.config_arg_name
- """
- val = None
- if fn_arg_name in kwargs:
- logging.warning(
- "Passing in {fn_arg_name} to {self.__class__.__name__} is deprecated and won't be supported from "
- "v4.39. Please set it in the config instead"
- )
- val = kwargs.pop(fn_arg_name)
- else:
- val = getattr(config, config_arg_name)
- return val
-
- self.embed_dim = _handle_deprecated_argument("hidden_size", config, "embed_dim", kwargs)
- self.num_heads = _handle_deprecated_argument("num_attention_heads", config, "num_heads", kwargs)
- self.dropout = _handle_deprecated_argument("attention_dropout", config, "dropout", kwargs)
- self.enable_bias = _handle_deprecated_argument("enable_bias", config, "bias", kwargs)
+ self.embed_dim = config.hidden_size
+ self.num_heads = config.num_attention_heads
+ self.dropout = config.attention_dropout
+ self.enable_bias = config.enable_bias
self.head_dim = self.embed_dim // self.num_heads
self.is_causal = True
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index f30bf7beddc163..c2baa8c58fd23f 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -408,10 +408,6 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
-def top_k_top_p_filtering(*args, **kwargs):
- requires_backends(top_k_top_p_filtering, ["torch"])
-
-
class PreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
diff --git a/src/transformers/utils/dummy_tf_objects.py b/src/transformers/utils/dummy_tf_objects.py
index ba9fa7180e8168..5441883b85a463 100644
--- a/src/transformers/utils/dummy_tf_objects.py
+++ b/src/transformers/utils/dummy_tf_objects.py
@@ -128,10 +128,6 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["tf"])
-def tf_top_k_top_p_filtering(*args, **kwargs):
- requires_backends(tf_top_k_top_p_filtering, ["tf"])
-
-
class KerasMetricCallback(metaclass=DummyObject):
_backends = ["tf"]
diff --git a/tests/generation/test_tf_utils.py b/tests/generation/test_tf_utils.py
index bcb7c6392461ca..f40ceebef76fbc 100644
--- a/tests/generation/test_tf_utils.py
+++ b/tests/generation/test_tf_utils.py
@@ -41,7 +41,6 @@
TFBartForConditionalGeneration,
TFLogitsProcessorList,
TFMinLengthLogitsProcessor,
- tf_top_k_top_p_filtering,
)
from transformers.modeling_tf_utils import keras
@@ -49,102 +48,6 @@
import tensorflow_text as text
-@require_tf
-class UtilsFunctionsTest(unittest.TestCase):
- # tests whether the top_k_top_p_filtering function behaves as expected
- def test_top_k_top_p_filtering(self):
- logits = tf.convert_to_tensor(
- [
- [
- 8.2220991, # 3rd highest value; idx. 0
- -0.5620044,
- 5.23229752,
- 4.0386393,
- -6.8798378,
- -0.54785802,
- -3.2012153,
- 2.92777176,
- 1.88171953,
- 7.35341276, # 5th highest value; idx. 9
- 8.43207833, # 2nd highest value; idx. 10
- -9.85711836,
- -5.96209236,
- -1.13039161,
- -7.1115294,
- -0.8369633,
- -5.3186408,
- 7.06427407,
- 0.81369344,
- -0.82023817,
- -5.9179796,
- 0.58813443,
- -6.99778438,
- 4.71551189,
- -0.18771637,
- 7.44020759, # 4th highest value; idx. 25
- 9.38450987, # 1st highest value; idx. 26
- 2.12662941,
- -9.32562038,
- 2.35652522,
- ], # cummulative prob of 5 highest values <= 0.6
- [
- 0.58425518,
- 4.53139238,
- -5.57510464,
- -6.28030699,
- -7.19529503,
- -4.02122551,
- 1.39337037,
- -6.06707057,
- 1.59480517,
- -9.643119,
- 0.03907799,
- 0.67231762,
- -8.88206726,
- 6.27115922, # 4th highest value; idx. 13
- 2.28520723,
- 4.82767506,
- 4.30421368,
- 8.8275313, # 2nd highest value; idx. 17
- 5.44029958, # 5th highest value; idx. 18
- -4.4735794,
- 7.38579536, # 3rd highest value; idx. 20
- -2.91051663,
- 2.61946077,
- -2.5674762,
- -9.48959302,
- -4.02922645,
- -1.35416918,
- 9.67702323, # 1st highest value; idx. 27
- -5.89478553,
- 1.85370467,
- ], # cummulative prob of 5 highest values <= 0.6
- ],
- dtype=tf.float32,
- )
-
- non_inf_expected_idx = tf.convert_to_tensor(
- [[0, 0], [0, 9], [0, 10], [0, 25], [0, 26], [1, 13], [1, 17], [1, 18], [1, 20], [1, 27]],
- dtype=tf.int32,
- ) # expected non filtered idx as noted above
-
- non_inf_expected_output = tf.convert_to_tensor(
- [8.222099, 7.3534126, 8.432078, 7.4402075, 9.38451, 6.271159, 8.827531, 5.4402995, 7.3857956, 9.677023],
- dtype=tf.float32,
- ) # expected non filtered values as noted above
-
- output = tf_top_k_top_p_filtering(logits, top_k=10, top_p=0.6, min_tokens_to_keep=4)
-
- non_inf_output = output[output != -float("inf")]
- non_inf_idx = tf.cast(
- tf.where(tf.not_equal(output, tf.constant(-float("inf"), dtype=tf.float32))),
- dtype=tf.int32,
- )
-
- tf.debugging.assert_near(non_inf_output, non_inf_expected_output, rtol=1e-12)
- tf.debugging.assert_equal(non_inf_idx, non_inf_expected_idx)
-
-
@require_tf
class TFGenerationIntegrationTests(unittest.TestCase, GenerationIntegrationTestsMixin):
# setting framework_dependent_parameters needs to be gated, just like its contents' imports
diff --git a/tests/generation/test_utils.py b/tests/generation/test_utils.py
index cb224c3c6a9d74..8f7849ea970b01 100644
--- a/tests/generation/test_utils.py
+++ b/tests/generation/test_utils.py
@@ -52,7 +52,6 @@
GPT2Tokenizer,
ImageGPTForCausalImageModeling,
SpeechEncoderDecoderModel,
- top_k_top_p_filtering,
)
from transformers.cache_utils import DynamicCache
from transformers.generation import (
@@ -2345,133 +2344,6 @@ def _check_sequence_inside_sequence(self, tensor_1, tensor_2):
@require_torch
class UtilsFunctionsTest(unittest.TestCase):
- # tests whether the top_k_top_p function behaves as expected
- def test_top_k_top_p_filtering(self):
- logits = torch.tensor(
- [
- [
- 8.2220991, # 3rd highest value; idx. 0
- -0.5620044,
- 5.23229752,
- 4.0386393,
- -6.8798378,
- -0.54785802,
- -3.2012153,
- 2.92777176,
- 1.88171953,
- 7.35341276,
- 8.43207833, # 2nd highest value; idx. 10
- -9.85711836,
- -5.96209236,
- -1.13039161,
- -7.1115294,
- -0.8369633,
- -5.3186408,
- 7.06427407,
- 0.81369344,
- -0.82023817,
- -5.9179796,
- 0.58813443,
- -6.99778438,
- 4.71551189,
- -0.18771637,
- 7.44020759, # 4th highest value; idx. 25
- 9.38450987, # 1st highest value; idx. 26
- 2.12662941,
- -9.32562038,
- 2.35652522,
- ], # cummulative prob of 4 highest values <= 0.6
- [
- 0.58425518,
- 4.53139238,
- -5.57510464,
- -6.28030699,
- -7.19529503,
- -4.02122551,
- 1.39337037,
- -6.06707057,
- 1.59480517,
- -9.643119,
- 0.03907799,
- 0.67231762,
- -8.88206726,
- 6.27115922, # 4th highest value; idx. 13
- 2.28520723,
- 4.82767506,
- 4.30421368,
- 8.8275313, # 2nd highest value; idx. 17
- 5.44029958,
- -4.4735794,
- 7.38579536, # 3rd highest value; idx. 20
- -2.91051663,
- 2.61946077,
- -2.5674762,
- -9.48959302,
- -4.02922645,
- -1.35416918,
- 9.67702323, # 1st highest value; idx. 27
- -5.89478553,
- 1.85370467,
- ], # cummulative prob of 4 highest values <= 0.6
- ],
- dtype=torch.float,
- device=torch_device,
- )
-
- non_inf_expected_idx = torch.tensor(
- [[0, 0], [0, 10], [0, 25], [0, 26], [1, 13], [1, 17], [1, 20], [1, 27]],
- dtype=torch.long,
- device=torch_device,
- ) # expected non filtered idx as noted above
-
- non_inf_expected_output = torch.tensor(
- [
- 8.2221,
- 8.4321,
- 7.4402,
- 9.3845,
- 6.2712,
- 8.8275,
- 7.3858,
- 9.6770,
- ], # expected non filtered values as noted above
- dtype=torch.float,
- device=torch_device,
- )
-
- output = top_k_top_p_filtering(logits, top_k=10, top_p=0.6, min_tokens_to_keep=4)
- non_inf_output = output[output != -float("inf")].to(device=torch_device)
- non_inf_idx = (output != -float("inf")).nonzero().to(device=torch_device)
-
- self.assertTrue(torch.allclose(non_inf_expected_output, non_inf_output, atol=1e-12))
- self.assertTrue(torch.all(torch.eq(non_inf_expected_idx, non_inf_idx)))
-
- # tests whether the function uses filter_value instead of default -inf
- def test_top_k_top_p_filtering_with_filter_value(self):
- logits = torch.tensor(
- [
- [
- 1,
- 1,
- 1,
- 0.99, # get filtered by top-p filtering
- 0.98, # get filtered by top-k filtering
- ]
- ],
- dtype=torch.float,
- device=torch_device,
- )
-
- expected_output = torch.tensor(
- [[1, 1, 1, 0, 0]],
- dtype=torch.float,
- device=torch_device,
- )
-
- output = top_k_top_p_filtering(logits, top_k=4, top_p=0.5, filter_value=0.0)
-
- self.assertTrue(torch.allclose(expected_output, output, atol=1e-12))
-
def test_speculative_sampling(self):
# assume vocab size 10, input length 5 + 3 generated candidates
candidate_input_ids = torch.tensor([[8, 0, 3, 9, 8, 1, 4, 5]]) # input tokens
From f6133d767a6bbc0614bc889cd0624fb0842cd643 Mon Sep 17 00:00:00 2001
From: Lysandre Debut
Date: Thu, 7 Mar 2024 12:12:41 +0100
Subject: [PATCH 109/549] =?UTF-8?q?Revert=20"Automatic=20safetensors=20con?=
=?UTF-8?q?version=20when=20lacking=20these=20files=20(#2=E2=80=A6=20(#295?=
=?UTF-8?q?07)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Revert "Automatic safetensors conversion when lacking these files (#29390)"
This reverts commit a69cbf4e64c7bc054d814d64f6877180f7cd3a25.
---
src/transformers/modeling_utils.py | 37 ++---------------------
tests/test_modeling_utils.py | 48 +-----------------------------
2 files changed, 4 insertions(+), 81 deletions(-)
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 5aa9d0a770cfa1..0e322e0557fdb4 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -29,7 +29,6 @@
from contextlib import contextmanager
from dataclasses import dataclass
from functools import partial, wraps
-from threading import Thread
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
from zipfile import is_zipfile
@@ -3208,39 +3207,9 @@ def from_pretrained(
)
if resolved_archive_file is not None:
is_sharded = True
-
- if resolved_archive_file is not None:
- if filename in [WEIGHTS_NAME, WEIGHTS_INDEX_NAME]:
- # If the PyTorch file was found, check if there is a safetensors file on the repository
- # If there is no safetensors file on the repositories, start an auto conversion
- safe_weights_name = SAFE_WEIGHTS_INDEX_NAME if is_sharded else SAFE_WEIGHTS_NAME
- has_file_kwargs = {
- "revision": revision,
- "proxies": proxies,
- "token": token,
- }
- cached_file_kwargs = {
- "cache_dir": cache_dir,
- "force_download": force_download,
- "resume_download": resume_download,
- "local_files_only": local_files_only,
- "user_agent": user_agent,
- "subfolder": subfolder,
- "_raise_exceptions_for_gated_repo": False,
- "_raise_exceptions_for_missing_entries": False,
- "_commit_hash": commit_hash,
- **has_file_kwargs,
- }
- if not has_file(pretrained_model_name_or_path, safe_weights_name, **has_file_kwargs):
- Thread(
- target=auto_conversion,
- args=(pretrained_model_name_or_path,),
- kwargs=cached_file_kwargs,
- name="Thread-autoconversion",
- ).start()
- else:
- # Otherwise, no PyTorch file was found, maybe there is a TF or Flax model file.
- # We try those to give a helpful error message.
+ if resolved_archive_file is None:
+ # Otherwise, maybe there is a TF or Flax model file. We try those to give a helpful error
+ # message.
has_file_kwargs = {
"revision": revision,
"proxies": proxies,
diff --git a/tests/test_modeling_utils.py b/tests/test_modeling_utils.py
index 57f0f11dbb8a06..1f277c7504561f 100755
--- a/tests/test_modeling_utils.py
+++ b/tests/test_modeling_utils.py
@@ -20,7 +20,6 @@
import os.path
import sys
import tempfile
-import threading
import unittest
import unittest.mock as mock
import uuid
@@ -1429,7 +1428,7 @@ def test_safetensors_on_the_fly_wrong_user_opened_pr(self):
bot_opened_pr_title = None
for discussion in discussions:
- if discussion.author == "SFconvertbot":
+ if discussion.author == "SFconvertBot":
bot_opened_pr = True
bot_opened_pr_title = discussion.title
@@ -1452,51 +1451,6 @@ def test_safetensors_on_the_fly_specific_revision(self):
with self.assertRaises(EnvironmentError):
BertModel.from_pretrained(self.repo_name, use_safetensors=True, token=self.token, revision="new-branch")
- def test_absence_of_safetensors_triggers_conversion(self):
- config = BertConfig(
- vocab_size=99, hidden_size=32, num_hidden_layers=5, num_attention_heads=4, intermediate_size=37
- )
- initial_model = BertModel(config)
-
- # Push a model on `main`
- initial_model.push_to_hub(self.repo_name, token=self.token, safe_serialization=False)
-
- # Download the model that doesn't have safetensors
- BertModel.from_pretrained(self.repo_name, token=self.token)
-
- for thread in threading.enumerate():
- if thread.name == "Thread-autoconversion":
- thread.join(timeout=10)
-
- with self.subTest("PR was open with the safetensors account"):
- discussions = self.api.get_repo_discussions(self.repo_name)
-
- bot_opened_pr = None
- bot_opened_pr_title = None
-
- for discussion in discussions:
- if discussion.author == "SFconvertbot":
- bot_opened_pr = True
- bot_opened_pr_title = discussion.title
-
- self.assertTrue(bot_opened_pr)
- self.assertEqual(bot_opened_pr_title, "Adding `safetensors` variant of this model")
-
- @mock.patch("transformers.safetensors_conversion.spawn_conversion")
- def test_absence_of_safetensors_triggers_conversion_failed(self, spawn_conversion_mock):
- spawn_conversion_mock.side_effect = HTTPError()
-
- config = BertConfig(
- vocab_size=99, hidden_size=32, num_hidden_layers=5, num_attention_heads=4, intermediate_size=37
- )
- initial_model = BertModel(config)
-
- # Push a model on `main`
- initial_model.push_to_hub(self.repo_name, token=self.token, safe_serialization=False)
-
- # The auto conversion is mocked to always raise; ensure that it doesn't raise in the main thread
- BertModel.from_pretrained(self.repo_name, token=self.token)
-
@require_torch
@is_staging_test
From 9288e759adb3a0af595d7b167b2cf6a7b558b8e6 Mon Sep 17 00:00:00 2001
From: Ashok Pon Kumar
Date: Thu, 7 Mar 2024 17:14:23 +0530
Subject: [PATCH 110/549] fix: Avoid error when fsdp_config is missing
xla_fsdp_v2 (#29480)
Signed-off-by: Ashok Pon Kumar Sree Prakash
---
src/transformers/trainer.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 056f7a2ca96e34..574363421234b3 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -647,7 +647,7 @@ def __init__(
if args.torch_compile and not is_torch_compile_available():
raise RuntimeError("Using torch.compile requires PyTorch 2.0 or higher.")
- self.is_fsdp_xla_v2_enabled = args.fsdp_config["xla_fsdp_v2"]
+ self.is_fsdp_xla_v2_enabled = args.fsdp_config.get("xla_fsdp_v2", False)
if self.is_fsdp_xla_v2_enabled:
# Prepare the SPMD mesh that is going to be used by the data loader and the FSDPv2 wrapper.
# Tensor axis is just a placeholder where it will not be used in FSDPv2.
From 923733c22bf4d3cc6661c8cd3b730b275e9a938e Mon Sep 17 00:00:00 2001
From: Raushan Turganbay
Date: Thu, 7 Mar 2024 16:45:47 +0500
Subject: [PATCH 111/549] Flava multimodal add attention mask (#29446)
* flava multimodal add attn mask
* make style
* check mask is not None
---
src/transformers/models/flava/modeling_flava.py | 12 +++++++++++-
tests/models/flava/test_modeling_flava.py | 16 ++++++++--------
2 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/src/transformers/models/flava/modeling_flava.py b/src/transformers/models/flava/modeling_flava.py
index f96e4292a1a360..0e5cfe1b68c441 100644
--- a/src/transformers/models/flava/modeling_flava.py
+++ b/src/transformers/models/flava/modeling_flava.py
@@ -1415,8 +1415,18 @@ def forward(
multimodal_embeddings = None
multimodal_output = None
if image_mm_projection is not None and text_mm_projection is not None and not skip_multimodal_encoder:
+ if attention_mask is not None:
+ batch_size, seq_len, _ = image_mm_projection.shape
+ if self.multimodal_model.use_cls_token:
+ seq_len += 1
+ attention_mask_image = torch.ones(batch_size, seq_len, device=image_mm_projection.device)
+ attention_multimodal = torch.cat([attention_mask_image, attention_mask], dim=1)
+ else:
+ attention_multimodal = None
multimodal_input = torch.cat([image_mm_projection, text_mm_projection], dim=1)
- multimodal_output = self.multimodal_model(multimodal_input, return_dict=return_dict)
+ multimodal_output = self.multimodal_model(
+ multimodal_input, attention_mask=attention_multimodal, return_dict=return_dict
+ )
multimodal_embeddings = multimodal_output[0]
if not return_dict:
diff --git a/tests/models/flava/test_modeling_flava.py b/tests/models/flava/test_modeling_flava.py
index 48a070d9fe3137..b17a6f7b543e6a 100644
--- a/tests/models/flava/test_modeling_flava.py
+++ b/tests/models/flava/test_modeling_flava.py
@@ -1287,9 +1287,9 @@ def test_inference(self):
outputs = model(**inputs, return_dict=True)
# verify the embeddings
- self.assertAlmostEqual(outputs.image_embeddings.sum().item(), -1352.53540, places=4)
+ self.assertAlmostEqual(outputs.image_embeddings.sum().item(), -1352.54943, places=4)
self.assertAlmostEqual(outputs.text_embeddings.sum().item(), -198.98225, places=4)
- self.assertAlmostEqual(outputs.multimodal_embeddings.sum().item(), -3988.51367, places=4)
+ self.assertAlmostEqual(outputs.multimodal_embeddings.sum().item(), -4030.466552, places=4)
@require_vision
@@ -1339,9 +1339,9 @@ def test_inference(self):
expected_logits = torch.tensor([[16.1291, 8.4033], [16.1291, 8.4033]], device=torch_device)
self.assertTrue(torch.allclose(outputs.contrastive_logits_per_image, expected_logits, atol=1e-3))
- self.assertAlmostEqual(outputs.loss_info.mmm_text.item(), 1.75533199, places=4)
- self.assertAlmostEqual(outputs.loss_info.mmm_image.item(), 7.0290069, places=4)
- self.assertAlmostEqual(outputs.loss.item(), 11.0626, places=4)
+ self.assertAlmostEqual(outputs.loss_info.mmm_text.item(), 2.0736470, places=4)
+ self.assertAlmostEqual(outputs.loss_info.mmm_image.item(), 7.025580, places=4)
+ self.assertAlmostEqual(outputs.loss.item(), 11.37761, places=4)
@slow
def test_inference_with_itm_labels(self):
@@ -1390,6 +1390,6 @@ def test_inference_with_itm_labels(self):
expected_logits = torch.tensor([[16.1291, 8.4033], [16.1291, 8.4033]], device=torch_device)
self.assertTrue(torch.allclose(outputs.contrastive_logits_per_image, expected_logits, atol=1e-3))
- self.assertAlmostEqual(outputs.loss_info.mmm_text.item(), 1.75533199, places=4)
- self.assertAlmostEqual(outputs.loss_info.mmm_image.item(), 6.89590501, places=4)
- self.assertAlmostEqual(outputs.loss.item(), 9.1995, places=4)
+ self.assertAlmostEqual(outputs.loss_info.mmm_text.item(), 2.0736470, places=4)
+ self.assertAlmostEqual(outputs.loss_info.mmm_image.item(), 6.8962264, places=4)
+ self.assertAlmostEqual(outputs.loss.item(), 9.6090, places=4)
From 45c065109074d60c587d3e562f16531d02a422f6 Mon Sep 17 00:00:00 2001
From: Alex Ishida
Date: Thu, 7 Mar 2024 22:51:59 +0900
Subject: [PATCH 112/549] Add support for metadata format MLX (#29335)
Add support for loading safetensors files saved with metadata format mlx.
---
src/transformers/modeling_utils.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 0e322e0557fdb4..505c9cb45950cb 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -504,7 +504,7 @@ def load_state_dict(checkpoint_file: Union[str, os.PathLike]):
# Check format of the archive
with safe_open(checkpoint_file, framework="pt") as f:
metadata = f.metadata()
- if metadata.get("format") not in ["pt", "tf", "flax"]:
+ if metadata.get("format") not in ["pt", "tf", "flax", "mlx"]:
raise OSError(
f"The safetensors archive passed at {checkpoint_file} does not contain the valid metadata. Make sure "
"you save your model with the `save_pretrained` method."
From 4ed9ae623d16876ad84ea89dfdf1c9378e36961b Mon Sep 17 00:00:00 2001
From: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Date: Thu, 7 Mar 2024 17:30:28 +0000
Subject: [PATCH 113/549] test_generation_config_is_loaded_with_model - fall
back to pytorch model for now (#29521)
* Fall back to pytorch model for now
* Fix up
---
tests/test_modeling_utils.py | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tests/test_modeling_utils.py b/tests/test_modeling_utils.py
index 1f277c7504561f..d0db5031e8b7a0 100755
--- a/tests/test_modeling_utils.py
+++ b/tests/test_modeling_utils.py
@@ -1188,12 +1188,14 @@ def test_generation_config_is_loaded_with_model(self):
# `transformers_version` field set to `foo`. If loading the file fails, this test also fails.
# 1. Load without further parameters
- model = AutoModelForCausalLM.from_pretrained("joaogante/tiny-random-gpt2-with-generation-config")
+ model = AutoModelForCausalLM.from_pretrained(
+ "joaogante/tiny-random-gpt2-with-generation-config", use_safetensors=False
+ )
self.assertEqual(model.generation_config.transformers_version, "foo")
# 2. Load with `device_map`
model = AutoModelForCausalLM.from_pretrained(
- "joaogante/tiny-random-gpt2-with-generation-config", device_map="auto"
+ "joaogante/tiny-random-gpt2-with-generation-config", device_map="auto", use_safetensors=False
)
self.assertEqual(model.generation_config.transformers_version, "foo")
From ddf177ee4af89750a086d36e81c472f6aa7fe5bc Mon Sep 17 00:00:00 2001
From: Alvaro Bartolome
Date: Thu, 7 Mar 2024 21:43:57 +0100
Subject: [PATCH 114/549] Set `inputs` as kwarg in `TextClassificationPipeline`
(#29495)
* Set `inputs` as kwarg in `TextClassificationPipeline`
This change has been done to align the `TextClassificationPipeline` with the rest of the pipelines, and to be able to e.g. `pipeline(**{"inputs": "text"})` which wouldn't be possible since the `*args` were being used instead.
* Add `noqa: C409` on `tuple([inputs],)`
Even though is discouraged by the linter, the cast `tuple(list(...),)` is required here, as otherwise the original list in `inputs` will be transformed into a `tuple` and the elements 1...N will be ignored by the `Pipeline`
* Run `ruff format`
* Simplify `tuple` conversion with `(inputs,)`
Co-authored-by: Matt
---------
Co-authored-by: Matt
---
src/transformers/pipelines/text_classification.py | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/src/transformers/pipelines/text_classification.py b/src/transformers/pipelines/text_classification.py
index 0c54fe1706c034..6521da098d4cdf 100644
--- a/src/transformers/pipelines/text_classification.py
+++ b/src/transformers/pipelines/text_classification.py
@@ -118,12 +118,12 @@ def _sanitize_parameters(self, return_all_scores=None, function_to_apply=None, t
postprocess_params["function_to_apply"] = function_to_apply
return preprocess_params, {}, postprocess_params
- def __call__(self, *args, **kwargs):
+ def __call__(self, inputs, **kwargs):
"""
Classify the text(s) given as inputs.
Args:
- args (`str` or `List[str]` or `Dict[str]`, or `List[Dict[str]]`):
+ inputs (`str` or `List[str]` or `Dict[str]`, or `List[Dict[str]]`):
One or several texts to classify. In order to use text pairs for your classification, you can send a
dictionary containing `{"text", "text_pair"}` keys, or a list of those.
top_k (`int`, *optional*, defaults to `1`):
@@ -152,10 +152,11 @@ def __call__(self, *args, **kwargs):
If `top_k` is used, one such dictionary is returned per label.
"""
- result = super().__call__(*args, **kwargs)
+ inputs = (inputs,)
+ result = super().__call__(*inputs, **kwargs)
# TODO try and retrieve it in a nicer way from _sanitize_parameters.
_legacy = "top_k" not in kwargs
- if isinstance(args[0], str) and _legacy:
+ if isinstance(inputs[0], str) and _legacy:
# This pipeline is odd, and return a list when single item is run
return [result]
else:
From b338a6c3b8eda29610d4d472cad8cd87cbfdaaed Mon Sep 17 00:00:00 2001
From: Nick DeGroot
Date: Thu, 7 Mar 2024 12:45:51 -0800
Subject: [PATCH 115/549] Fix `VisionEncoderDecoder` Positional Arg (#29497)
* :bug: Fix vision encoder decoder positional arg
* :white_check_mark: Add test for VisionEncoderDecoder with LayoutLMv3 encoder
---------
Co-authored-by: Nick DeGroot <1966472+nickthegroot@users.noreply.github.com>
---
.../modeling_vision_encoder_decoder.py | 2 +-
.../test_modeling_vision_encoder_decoder.py | 124 ++++++++++++++++++
2 files changed, 125 insertions(+), 1 deletion(-)
diff --git a/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py b/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py
index 88b5efd0476086..4b67c1bd3db083 100644
--- a/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py
+++ b/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py
@@ -573,7 +573,7 @@ def forward(
raise ValueError("You have to specify pixel_values")
encoder_outputs = self.encoder(
- pixel_values,
+ pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
diff --git a/tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py b/tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py
index 7cc27a34554324..3239b507a8172f 100644
--- a/tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py
+++ b/tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py
@@ -38,6 +38,7 @@
from ..bart.test_modeling_bart import BartModelTester
from ..bert.test_modeling_bert import BertModelTester
from ..deit.test_modeling_deit import DeiTModelTester
+from ..layoutlmv3.test_modeling_layoutlmv3 import LayoutLMv3ModelTester
from ..swin.test_modeling_swin import SwinModelTester
from ..trocr.test_modeling_trocr import TrOCRStandaloneDecoderModelTester
from ..vit.test_modeling_vit import ViTModelTester
@@ -52,6 +53,7 @@
BartForCausalLM,
BertLMHeadModel,
DeiTModel,
+ LayoutLMv3Model,
SwinModel,
TrOCRForCausalLM,
VisionEncoderDecoderConfig,
@@ -680,6 +682,128 @@ def test_real_model_save_load_from_pretrained(self):
pass
+@require_torch
+class LayoutLMv32TrOCR(EncoderDecoderMixin, unittest.TestCase):
+ def get_encoder_decoder_model(self, config, decoder_config):
+ encoder_model = LayoutLMv3Model(config).eval()
+ decoder_model = TrOCRForCausalLM(decoder_config).eval()
+ return encoder_model, decoder_model
+
+ def prepare_config_and_inputs(self):
+ model_tester_encoder = LayoutLMv3ModelTester(self, batch_size=13, image_size=4, patch_size=2)
+ model_tester_decoder = TrOCRStandaloneDecoderModelTester(
+ self, batch_size=13, d_model=32, max_position_embeddings=512
+ )
+ encoder_config_and_inputs = model_tester_encoder.prepare_config_and_inputs()
+ decoder_config_and_inputs = model_tester_decoder.prepare_config_and_inputs()
+ (
+ config,
+ input_ids,
+ bbox,
+ pixel_values,
+ token_type_ids,
+ input_mask,
+ sequence_labels,
+ token_labels,
+ ) = encoder_config_and_inputs
+ (decoder_config, decoder_input_ids, decoder_attention_mask, _) = decoder_config_and_inputs
+
+ # make sure that cross attention layers are added
+ decoder_config.add_cross_attention = True
+ # disable cache for now
+ decoder_config.use_cache = False
+ return {
+ "config": config,
+ "pixel_values": pixel_values,
+ "input_ids": input_ids,
+ "bbox": bbox,
+ "decoder_config": decoder_config,
+ "decoder_input_ids": decoder_input_ids,
+ "decoder_attention_mask": decoder_attention_mask,
+ "labels": decoder_input_ids,
+ }
+
+ def check_encoder_decoder_model_output_attentions(
+ self,
+ config,
+ decoder_config,
+ decoder_input_ids,
+ decoder_attention_mask,
+ input_ids,
+ pixel_values,
+ labels=None,
+ **kwargs,
+ ):
+ # make the decoder inputs a different shape from the encoder inputs to harden the test
+ decoder_input_ids = decoder_input_ids[:, :-1]
+ decoder_attention_mask = decoder_attention_mask[:, :-1]
+ encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config)
+ enc_dec_model = VisionEncoderDecoderModel(encoder=encoder_model, decoder=decoder_model)
+ enc_dec_model.to(torch_device)
+ outputs_encoder_decoder = enc_dec_model(
+ input_ids=input_ids,
+ pixel_values=pixel_values,
+ decoder_input_ids=decoder_input_ids,
+ decoder_attention_mask=decoder_attention_mask,
+ output_attentions=True,
+ **kwargs,
+ )
+
+ encoder_attentions = outputs_encoder_decoder["encoder_attentions"]
+ self.assertEqual(len(encoder_attentions), config.num_hidden_layers)
+
+ # LayoutLMv3's sequence length equals the number of text tokens + number of patches + 1 (we add 1 for the CLS token)
+ text_seq_length = input_ids.shape[-1]
+ image_seq_length = (encoder_model.config.input_size // encoder_model.config.patch_size) ** 2 + 1
+ seq_len = text_seq_length + image_seq_length
+
+ decoder_attentions = outputs_encoder_decoder["decoder_attentions"]
+ num_decoder_layers = (
+ decoder_config.num_decoder_layers
+ if hasattr(decoder_config, "num_decoder_layers")
+ else decoder_config.num_hidden_layers
+ )
+ self.assertEqual(len(decoder_attentions), num_decoder_layers)
+
+ self.assertEqual(
+ decoder_attentions[0].shape[-3:],
+ (decoder_config.num_attention_heads, decoder_input_ids.shape[-1], decoder_input_ids.shape[-1]),
+ )
+
+ cross_attentions = outputs_encoder_decoder["cross_attentions"]
+ self.assertEqual(len(cross_attentions), num_decoder_layers)
+
+ cross_attention_input_seq_len = decoder_input_ids.shape[-1]
+ self.assertEqual(
+ cross_attentions[0].shape[-3:],
+ (decoder_config.num_attention_heads, cross_attention_input_seq_len, seq_len),
+ )
+
+ def check_encoder_decoder_model_generate(self, config, decoder_config, pixel_values=None, **kwargs):
+ encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config)
+ enc_dec_model = VisionEncoderDecoderModel(encoder=encoder_model, decoder=decoder_model)
+
+ # Generate until max length
+ if hasattr(enc_dec_model.config, "eos_token_id"):
+ enc_dec_model.config.eos_token_id = None
+ if hasattr(enc_dec_model.config, "decoder") and hasattr(enc_dec_model.config.decoder, "eos_token_id"):
+ enc_dec_model.config.decoder.eos_token_id = None
+ if hasattr(enc_dec_model.generation_config, "eos_token_id"):
+ enc_dec_model.generation_config.eos_token_id = None
+ enc_dec_model.to(torch_device)
+
+ generated_output = enc_dec_model.generate(
+ pixel_values=pixel_values,
+ decoder_start_token_id=enc_dec_model.config.decoder.bos_token_id,
+ **kwargs,
+ )
+ self.assertEqual(generated_output.shape, (pixel_values.shape[0],) + (decoder_config.max_length,))
+
+ @unittest.skip("There are no published pretrained TrOCR checkpoints for now")
+ def test_real_model_save_load_from_pretrained(self):
+ pass
+
+
@require_vision
@require_torch
class TrOCRModelIntegrationTest(unittest.TestCase):
From 631fa7bf6bb049b1d2a671fe90af32a22b2b664d Mon Sep 17 00:00:00 2001
From: Pedro Cuenca
Date: Fri, 8 Mar 2024 10:47:44 +0100
Subject: [PATCH 116/549] Typo in mlx tensor support (#29509)
Potential typo in mlx support
---
src/transformers/utils/generic.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/utils/generic.py b/src/transformers/utils/generic.py
index 28e63ce45b8eae..d5762337b50ac8 100644
--- a/src/transformers/utils/generic.py
+++ b/src/transformers/utils/generic.py
@@ -242,7 +242,7 @@ def is_jax_tensor(x):
def _is_mlx(x):
- import mx.core as mx
+ import mlx.core as mx
return isinstance(x, mx.array)
From bc764f42639d245114eaa077b4712aac5643603b Mon Sep 17 00:00:00 2001
From: Joao Gante
Date: Fri, 8 Mar 2024 10:06:46 +0000
Subject: [PATCH 117/549] Generate: left-padding test, revisited (#29515)
* left-padding test revisited
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---
tests/generation/test_utils.py | 91 +++++++++++--------
tests/models/bart/test_modeling_bart.py | 4 -
.../test_modeling_bigbird_pegasus.py | 4 -
.../blenderbot/test_modeling_blenderbot.py | 4 -
.../test_modeling_blenderbot_small.py | 4 -
tests/models/ctrl/test_modeling_ctrl.py | 4 -
tests/models/marian/test_modeling_marian.py | 4 -
tests/models/mbart/test_modeling_mbart.py | 4 -
tests/models/mvp/test_modeling_mvp.py | 4 -
tests/models/pegasus/test_modeling_pegasus.py | 4 -
tests/models/plbart/test_modeling_plbart.py | 4 -
.../prophetnet/test_modeling_prophetnet.py | 4 -
tests/models/whisper/test_modeling_whisper.py | 4 -
13 files changed, 55 insertions(+), 84 deletions(-)
diff --git a/tests/generation/test_utils.py b/tests/generation/test_utils.py
index 8f7849ea970b01..425db5ecdcf417 100644
--- a/tests/generation/test_utils.py
+++ b/tests/generation/test_utils.py
@@ -1833,49 +1833,68 @@ def test_generate_with_head_masking(self):
self.assertEqual(sum([w.sum().item() for w in attn_weights]), 0.0)
def test_left_padding_compatibility(self):
- # The check done in this test is fairly difficult -- depending on the model architecture, passing the right
- # position index for the position embeddings can still result in a different output, due to numerical masking.
- # On the other hand, for some types of position embeddings, an incorrect position index can have a minimal
- # impact on the output.
- # There are two tricks employed to check whether left-padding compatibility is in place:
- # 1 - To reduce the negative impact of the numerical attention mask on a correct position index, we set the
- # padding size to 1.
- # 2 - To reduce the chance of false positives (i.e. passing when it should be failing), we run the check
- # multiple times with random inputs, and it has to pass with all of them.
- # NOTE: because of 2), there is some chance of false positives in this test.
+ # NOTE: left-padding results in small numerical differences. This is expected.
+ # See https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535
+ # First, filter out models that don't support left padding
+ # - The model must have generative capabilities
+ if len(self.all_generative_model_classes) == 0:
+ self.skipTest(reason="No generative architecture available for this model.")
+
+ # - The model must be a decoder-only architecture (encoder-based architectures use right-padding)
+ decoder_only_classes = []
for model_class in self.all_generative_model_classes:
config, _, _, _ = self._get_input_ids_and_config()
if config.is_encoder_decoder:
- continue # skip for encoder-decoder models -- they don't need left-padding compatibility
+ continue
+ else:
+ decoder_only_classes.append(model_class)
+ if len(decoder_only_classes) == 0:
+ self.skipTest(reason="No decoder-only architecture available for this model.")
+
+ # - Decoder-only architectures derived from encoder-decoder models could support it in theory, but we haven't
+ # added support for it yet. We skip these models for now.
+ has_encoder_attributes = any(
+ attr_name
+ for attr_name in config.to_dict().keys()
+ if attr_name.startswith("encoder") and attr_name != "encoder_no_repeat_ngram_size"
+ )
+ if has_encoder_attributes:
+ self.skipTest(
+ reason="The decoder-only derived from encoder-decoder models are not expected to support left-padding."
+ )
+
+ # Then, test left-padding
+ def _prepare_model_kwargs(input_ids, attention_mask, signature):
+ model_kwargs = {"input_ids": input_ids, "attention_mask": attention_mask}
+ if "position_ids" in signature:
+ position_ids = torch.cumsum(attention_mask, dim=-1) - 1
+ position_ids.masked_fill_(attention_mask == 0, 1)
+ model_kwargs["position_ids"] = position_ids
+ if "cache_position" in signature:
+ cache_position = torch.arange(input_ids.shape[-1], device=torch_device)
+ model_kwargs["cache_position"] = cache_position
+ return model_kwargs
+
+ for model_class in decoder_only_classes:
+ config, input_ids, attention_mask, _ = self._get_input_ids_and_config()
model = model_class(config).to(torch_device).eval()
signature = inspect.signature(model.forward).parameters.keys()
- no_failures = True
- for _ in range(10): # there may be false positives with 10 runs, we rely on the CI to catch the flakiness
- _, input_ids, attention_mask, _ = self._get_input_ids_and_config()
- model_kwargs = {"input_ids": input_ids, "attention_mask": attention_mask}
- if "position_ids" in signature:
- position_ids = torch.cumsum(attention_mask, dim=-1) - 1
- position_ids.masked_fill_(attention_mask == 0, 1)
- model_kwargs["position_ids"] = position_ids
- next_logits_wo_padding = model(**model_kwargs).logits[:, -1, :]
-
- pad_size = (input_ids.shape[0], 1)
- padding = torch.ones(pad_size, dtype=input_ids.dtype, device=torch_device) * config.pad_token_id
- padded_input_ids = torch.cat((padding, input_ids), dim=1)
- padded_attention_mask = torch.cat((torch.zeros_like(padding), attention_mask), dim=1)
- model_kwargs = {"input_ids": padded_input_ids, "attention_mask": padded_attention_mask}
- if "position_ids" in signature:
- position_ids = torch.cumsum(padded_attention_mask, dim=-1) - 1
- position_ids.masked_fill_(padded_attention_mask == 0, 1)
- model_kwargs["position_ids"] = position_ids
- next_logits_with_padding = model(**model_kwargs).logits[:, -1, :]
- if not torch.allclose(next_logits_wo_padding, next_logits_with_padding, atol=1e-7):
- no_failures = False
- break
-
- self.assertTrue(no_failures)
+ # Without padding
+ model_kwargs = _prepare_model_kwargs(input_ids, attention_mask, signature)
+ next_logits_wo_padding = model(**model_kwargs).logits[:, -1, :]
+
+ # With left-padding (length 32)
+ pad_size = (input_ids.shape[0], 32)
+ padding = torch.ones(pad_size, dtype=input_ids.dtype, device=torch_device) * config.pad_token_id
+ padded_input_ids = torch.cat((padding, input_ids), dim=1)
+ padded_attention_mask = torch.cat((torch.zeros_like(padding), attention_mask), dim=1)
+ model_kwargs = _prepare_model_kwargs(padded_input_ids, padded_attention_mask, signature)
+ next_logits_with_padding = model(**model_kwargs).logits[:, -1, :]
+
+ # They should result in very similar logits
+ self.assertTrue(torch.allclose(next_logits_wo_padding, next_logits_with_padding, atol=1e-5))
def test_past_key_values_format(self):
# Test that the KV cache is formatted correctly. Exceptions need to explicitly overwrite this test. Having a
diff --git a/tests/models/bart/test_modeling_bart.py b/tests/models/bart/test_modeling_bart.py
index 5e79de87c4c0a2..38049337357685 100644
--- a/tests/models/bart/test_modeling_bart.py
+++ b/tests/models/bart/test_modeling_bart.py
@@ -1527,7 +1527,3 @@ def test_retain_grad_hidden_states_attentions(self):
def test_save_load_fast_init_from_base(self):
pass
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/bigbird_pegasus/test_modeling_bigbird_pegasus.py b/tests/models/bigbird_pegasus/test_modeling_bigbird_pegasus.py
index 90b71a7b82922b..96e7ce639f9c44 100644
--- a/tests/models/bigbird_pegasus/test_modeling_bigbird_pegasus.py
+++ b/tests/models/bigbird_pegasus/test_modeling_bigbird_pegasus.py
@@ -818,7 +818,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/blenderbot/test_modeling_blenderbot.py b/tests/models/blenderbot/test_modeling_blenderbot.py
index da7d8cc12480b9..64ae71b24b9bc5 100644
--- a/tests/models/blenderbot/test_modeling_blenderbot.py
+++ b/tests/models/blenderbot/test_modeling_blenderbot.py
@@ -569,7 +569,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/blenderbot_small/test_modeling_blenderbot_small.py b/tests/models/blenderbot_small/test_modeling_blenderbot_small.py
index 7bb45bdabd878d..39e953490fae8d 100644
--- a/tests/models/blenderbot_small/test_modeling_blenderbot_small.py
+++ b/tests/models/blenderbot_small/test_modeling_blenderbot_small.py
@@ -568,7 +568,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/ctrl/test_modeling_ctrl.py b/tests/models/ctrl/test_modeling_ctrl.py
index 13b35926117d56..71dcd02ed59f7e 100644
--- a/tests/models/ctrl/test_modeling_ctrl.py
+++ b/tests/models/ctrl/test_modeling_ctrl.py
@@ -249,10 +249,6 @@ def test_model_from_pretrained(self):
model = CTRLModel.from_pretrained(model_name)
self.assertIsNotNone(model)
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
-
@require_torch
class CTRLModelLanguageGenerationTest(unittest.TestCase):
diff --git a/tests/models/marian/test_modeling_marian.py b/tests/models/marian/test_modeling_marian.py
index 53a67c20459f58..593ef8e3405e38 100644
--- a/tests/models/marian/test_modeling_marian.py
+++ b/tests/models/marian/test_modeling_marian.py
@@ -895,7 +895,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/mbart/test_modeling_mbart.py b/tests/models/mbart/test_modeling_mbart.py
index 3cabf7d999aa88..93294d6568b2a2 100644
--- a/tests/models/mbart/test_modeling_mbart.py
+++ b/tests/models/mbart/test_modeling_mbart.py
@@ -736,7 +736,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/mvp/test_modeling_mvp.py b/tests/models/mvp/test_modeling_mvp.py
index 3e0a48023718cf..225ea4a78646e1 100644
--- a/tests/models/mvp/test_modeling_mvp.py
+++ b/tests/models/mvp/test_modeling_mvp.py
@@ -823,7 +823,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/pegasus/test_modeling_pegasus.py b/tests/models/pegasus/test_modeling_pegasus.py
index fbf79650f45e98..dbc7c9de7bafbd 100644
--- a/tests/models/pegasus/test_modeling_pegasus.py
+++ b/tests/models/pegasus/test_modeling_pegasus.py
@@ -596,7 +596,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/plbart/test_modeling_plbart.py b/tests/models/plbart/test_modeling_plbart.py
index 0d5274b0181949..998bd6c84f0481 100644
--- a/tests/models/plbart/test_modeling_plbart.py
+++ b/tests/models/plbart/test_modeling_plbart.py
@@ -669,7 +669,3 @@ def test_decoder_model_attn_mask_past(self):
def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
diff --git a/tests/models/prophetnet/test_modeling_prophetnet.py b/tests/models/prophetnet/test_modeling_prophetnet.py
index eee03134d34ea5..2f43a093200abb 100644
--- a/tests/models/prophetnet/test_modeling_prophetnet.py
+++ b/tests/models/prophetnet/test_modeling_prophetnet.py
@@ -1146,10 +1146,6 @@ def test_retain_grad_hidden_states_attentions(self):
# decoder cannot keep gradients
return
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
-
@require_torch
class ProphetNetStandaloneEncoderModelTest(ModelTesterMixin, unittest.TestCase):
diff --git a/tests/models/whisper/test_modeling_whisper.py b/tests/models/whisper/test_modeling_whisper.py
index dc24a5bc34794b..db7c3ae82a11e0 100644
--- a/tests/models/whisper/test_modeling_whisper.py
+++ b/tests/models/whisper/test_modeling_whisper.py
@@ -3230,7 +3230,3 @@ def test_retain_grad_hidden_states_attentions(self):
@unittest.skip("The model doesn't support fast init from base")
def test_save_load_fast_init_from_base(self):
pass
-
- @unittest.skip("The model doesn't support left padding") # and it's not used enough to be worth fixing :)
- def test_left_padding_compatibility(self):
- pass
From 8e589c83b607ede06d2c935cab2f2ead7bac17c4 Mon Sep 17 00:00:00 2001
From: Fanli Lin
Date: Fri, 8 Mar 2024 18:13:54 +0800
Subject: [PATCH 118/549] [tests] add the missing `require_sacremoses`
decorator (#29504)
* add sacremoses check
* fix style
* for FlaubertTokenizer
* HerbertTokenizer fix
* add typeHint
* Update src/transformers/testing_utils.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* make less skipped
* make quality
* remove import
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---
src/transformers/__init__.py | 2 ++
src/transformers/testing_utils.py | 8 ++++++++
tests/models/biogpt/test_modeling_biogpt.py | 4 ++--
tests/models/biogpt/test_tokenization_biogpt.py | 3 ++-
tests/models/flaubert/test_modeling_flaubert.py | 4 ++--
tests/models/herbert/test_tokenization_herbert.py | 3 ++-
6 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index da650cc58ff99b..72bfb4b465c530 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -1078,6 +1078,7 @@
"is_psutil_available",
"is_py3nvml_available",
"is_pyctcdecode_available",
+ "is_sacremoses_available",
"is_safetensors_available",
"is_scipy_available",
"is_sentencepiece_available",
@@ -5882,6 +5883,7 @@
is_psutil_available,
is_py3nvml_available,
is_pyctcdecode_available,
+ is_sacremoses_available,
is_safetensors_available,
is_scipy_available,
is_sentencepiece_available,
diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py
index adcadfc379251e..b333678427a81e 100644
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -90,6 +90,7 @@
is_pytest_available,
is_pytorch_quantization_available,
is_rjieba_available,
+ is_sacremoses_available,
is_safetensors_available,
is_scipy_available,
is_sentencepiece_available,
@@ -562,6 +563,13 @@ def require_sentencepiece(test_case):
return unittest.skipUnless(is_sentencepiece_available(), "test requires SentencePiece")(test_case)
+def require_sacremoses(test_case):
+ """
+ Decorator marking a test that requires Sacremoses. These tests are skipped when Sacremoses isn't installed.
+ """
+ return unittest.skipUnless(is_sacremoses_available(), "test requires Sacremoses")(test_case)
+
+
def require_seqio(test_case):
"""
Decorator marking a test that requires SentencePiece. These tests are skipped when SentencePiece isn't installed.
diff --git a/tests/models/biogpt/test_modeling_biogpt.py b/tests/models/biogpt/test_modeling_biogpt.py
index b7db0bbe28a7b7..b74cbdcb0f5652 100644
--- a/tests/models/biogpt/test_modeling_biogpt.py
+++ b/tests/models/biogpt/test_modeling_biogpt.py
@@ -17,7 +17,7 @@
import math
import unittest
-from transformers import BioGptConfig, is_torch_available
+from transformers import BioGptConfig, is_sacremoses_available, is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from ...generation.test_utils import GenerationTesterMixin
@@ -294,7 +294,7 @@ class BioGptModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMix
"token-classification": BioGptForTokenClassification,
"zero-shot": BioGptForSequenceClassification,
}
- if is_torch_available()
+ if is_torch_available() and is_sacremoses_available()
else {}
)
test_pruning = False
diff --git a/tests/models/biogpt/test_tokenization_biogpt.py b/tests/models/biogpt/test_tokenization_biogpt.py
index 8ec8a248bb6dfe..c350f5de0ea555 100644
--- a/tests/models/biogpt/test_tokenization_biogpt.py
+++ b/tests/models/biogpt/test_tokenization_biogpt.py
@@ -19,11 +19,12 @@
import unittest
from transformers.models.biogpt.tokenization_biogpt import VOCAB_FILES_NAMES, BioGptTokenizer
-from transformers.testing_utils import slow
+from transformers.testing_utils import require_sacremoses, slow
from ...test_tokenization_common import TokenizerTesterMixin
+@require_sacremoses
class BioGptTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = BioGptTokenizer
test_rust_tokenizer = False
diff --git a/tests/models/flaubert/test_modeling_flaubert.py b/tests/models/flaubert/test_modeling_flaubert.py
index f21695e39c5626..fc275bdd8a02ad 100644
--- a/tests/models/flaubert/test_modeling_flaubert.py
+++ b/tests/models/flaubert/test_modeling_flaubert.py
@@ -16,7 +16,7 @@
import tempfile
import unittest
-from transformers import FlaubertConfig, is_torch_available
+from transformers import FlaubertConfig, is_sacremoses_available, is_torch_available
from transformers.testing_utils import require_torch, require_torch_accelerator, slow, torch_device
from ...test_configuration_common import ConfigTester
@@ -386,7 +386,7 @@ class FlaubertModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase
"token-classification": FlaubertForTokenClassification,
"zero-shot": FlaubertForSequenceClassification,
}
- if is_torch_available()
+ if is_torch_available() and is_sacremoses_available()
else {}
)
diff --git a/tests/models/herbert/test_tokenization_herbert.py b/tests/models/herbert/test_tokenization_herbert.py
index c7e1a7ce7fab96..d035348b739f42 100644
--- a/tests/models/herbert/test_tokenization_herbert.py
+++ b/tests/models/herbert/test_tokenization_herbert.py
@@ -20,11 +20,12 @@
from transformers import HerbertTokenizer, HerbertTokenizerFast
from transformers.models.herbert.tokenization_herbert import VOCAB_FILES_NAMES
-from transformers.testing_utils import get_tests_dir, require_tokenizers, slow
+from transformers.testing_utils import get_tests_dir, require_sacremoses, require_tokenizers, slow
from ...test_tokenization_common import TokenizerTesterMixin
+@require_sacremoses
@require_tokenizers
class HerbertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = HerbertTokenizer
From 8ee1d472033c3443d1f212e66122cc7f5ef6aa20 Mon Sep 17 00:00:00 2001
From: "Wang, Yi"
Date: Fri, 8 Mar 2024 19:11:10 +0800
Subject: [PATCH 119/549] fix image-to-text batch incorrect output issue
(#29342)
* fix image-to-text batch incorrect output issue
Signed-off-by: Wang, Yi A
* add ci test
Signed-off-by: Wang, Yi
* update ci test
Signed-off-by: Wang, Yi
---------
Signed-off-by: Wang, Yi A
Signed-off-by: Wang, Yi
---
src/transformers/pipelines/pt_utils.py | 2 +-
.../pipelines/test_pipelines_image_to_text.py | 29 +++++++++++++++++++
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/src/transformers/pipelines/pt_utils.py b/src/transformers/pipelines/pt_utils.py
index 4a95d050ec8c3c..c39f906f641ea6 100644
--- a/src/transformers/pipelines/pt_utils.py
+++ b/src/transformers/pipelines/pt_utils.py
@@ -73,7 +73,7 @@ def loader_batch_item(self):
"""
if isinstance(self._loader_batch_data, torch.Tensor):
# Batch data is simple tensor, just fetch the slice
- result = self._loader_batch_data[self._loader_batch_index]
+ result = self._loader_batch_data[self._loader_batch_index].unsqueeze(0)
else:
# Batch data is assumed to be BaseModelOutput (or dict)
loader_batched = {}
diff --git a/tests/pipelines/test_pipelines_image_to_text.py b/tests/pipelines/test_pipelines_image_to_text.py
index 21b297b1e1586f..e2d59968ebf4a6 100644
--- a/tests/pipelines/test_pipelines_image_to_text.py
+++ b/tests/pipelines/test_pipelines_image_to_text.py
@@ -142,6 +142,35 @@ def test_small_model_pt_conditional(self):
outputs = pipe(image, prompt=prompt)
self.assertTrue(outputs[0]["generated_text"].startswith(prompt))
+ @require_torch
+ def test_consistent_batching_behaviour(self):
+ pipe = pipeline("image-to-text", model="hf-internal-testing/tiny-random-BlipForConditionalGeneration")
+ image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
+ prompt = "a photo of"
+
+ outputs = pipe([image, image], prompt=prompt)
+ self.assertTrue(outputs[0][0]["generated_text"].startswith(prompt))
+ self.assertTrue(outputs[1][0]["generated_text"].startswith(prompt))
+
+ outputs = pipe([image, image], prompt=prompt, batch_size=2)
+ self.assertTrue(outputs[0][0]["generated_text"].startswith(prompt))
+ self.assertTrue(outputs[1][0]["generated_text"].startswith(prompt))
+
+ from torch.utils.data import Dataset
+
+ class MyDataset(Dataset):
+ def __len__(self):
+ return 5
+
+ def __getitem__(self, i):
+ return "./tests/fixtures/tests_samples/COCO/000000039769.png"
+
+ dataset = MyDataset()
+ for batch_size in (1, 2, 4):
+ outputs = pipe(dataset, prompt=prompt, batch_size=batch_size if batch_size > 1 else None)
+ self.assertTrue(list(outputs)[0][0]["generated_text"].startswith(prompt))
+ self.assertTrue(list(outputs)[1][0]["generated_text"].startswith(prompt))
+
@slow
@require_torch
def test_large_model_pt(self):
From 14536c339a7f28190bde70a39d0c6815d65fd9bc Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Cl=C3=A9mentine=20Fourrier?=
<22726840+clefourrier@users.noreply.github.com>
Date: Fri, 8 Mar 2024 12:20:31 +0100
Subject: [PATCH 120/549] Typo fix in error message (#29535)
---
src/transformers/tokenization_utils_base.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py
index 054146ad637481..7e5c21a1bdfb2d 100644
--- a/src/transformers/tokenization_utils_base.py
+++ b/src/transformers/tokenization_utils_base.py
@@ -1594,7 +1594,7 @@ def __init__(self, **kwargs):
self.truncation_side = kwargs.pop("truncation_side", self.truncation_side)
if self.truncation_side not in ["right", "left"]:
raise ValueError(
- f"Padding side should be selected between 'right' and 'left', current value: {self.truncation_side}"
+ f"Truncation side should be selected between 'right' and 'left', current value: {self.truncation_side}"
)
self.model_input_names = kwargs.pop("model_input_names", self.model_input_names)
From 1ea3ad1aeca1ffc1726d8a4fab2e4393a72be56f Mon Sep 17 00:00:00 2001
From: Fanli Lin
Date: Fri, 8 Mar 2024 19:21:43 +0800
Subject: [PATCH 121/549] [tests] use `torch_device` instead of `auto` for
model testing (#29531)
* use torch_device
* skip for XPU
* Update tests/generation/test_utils.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
---
tests/generation/test_utils.py | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tests/generation/test_utils.py b/tests/generation/test_utils.py
index 425db5ecdcf417..bd3bbe7c60c470 100644
--- a/tests/generation/test_utils.py
+++ b/tests/generation/test_utils.py
@@ -1073,6 +1073,9 @@ def test_beam_search_generate_dict_outputs_use_cache(self):
@require_torch_multi_accelerator
def test_model_parallel_beam_search(self):
for model_class in self.all_generative_model_classes:
+ if "xpu" in torch_device:
+ return unittest.skip("device_map='auto' does not work with XPU devices")
+
if model_class._no_split_modules is None:
continue
From f386c51ad9034bae4d3ce6b5820b97796ff7f8a1 Mon Sep 17 00:00:00 2001
From: liangjs <761232680@qq.com>
Date: Fri, 8 Mar 2024 19:58:25 +0800
Subject: [PATCH 122/549] StableLM: Fix dropout argument type error (#29236)
* fix stablelm dropout argument type error
* fix docs of _flash_attention_forward
* fix all docs of _flash_attention_forward
* fix docs of _flash_attention_forward in starcoder2
---------
Co-authored-by: oliang
---
src/transformers/models/bark/modeling_bark.py | 2 +-
src/transformers/models/bart/modeling_bart.py | 2 +-
src/transformers/models/distilbert/modeling_distilbert.py | 2 +-
src/transformers/models/falcon/modeling_falcon.py | 2 +-
src/transformers/models/gemma/modeling_gemma.py | 2 +-
src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py | 2 +-
src/transformers/models/gpt_neo/modeling_gpt_neo.py | 2 +-
src/transformers/models/gpt_neox/modeling_gpt_neox.py | 2 +-
src/transformers/models/llama/modeling_llama.py | 2 +-
src/transformers/models/mbart/modeling_mbart.py | 2 +-
src/transformers/models/mistral/modeling_mistral.py | 2 +-
src/transformers/models/mixtral/modeling_mixtral.py | 2 +-
src/transformers/models/opt/modeling_opt.py | 2 +-
src/transformers/models/phi/modeling_phi.py | 2 +-
src/transformers/models/qwen2/modeling_qwen2.py | 2 +-
src/transformers/models/stablelm/modeling_stablelm.py | 4 ++--
src/transformers/models/starcoder2/modeling_starcoder2.py | 2 +-
src/transformers/models/whisper/modeling_whisper.py | 2 +-
18 files changed, 19 insertions(+), 19 deletions(-)
diff --git a/src/transformers/models/bark/modeling_bark.py b/src/transformers/models/bark/modeling_bark.py
index 57cccd43127fa8..c4da5a2fce0032 100644
--- a/src/transformers/models/bark/modeling_bark.py
+++ b/src/transformers/models/bark/modeling_bark.py
@@ -306,7 +306,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/bart/modeling_bart.py b/src/transformers/models/bart/modeling_bart.py
index ca5f724b08a917..1f90b82a104d42 100755
--- a/src/transformers/models/bart/modeling_bart.py
+++ b/src/transformers/models/bart/modeling_bart.py
@@ -430,7 +430,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/distilbert/modeling_distilbert.py b/src/transformers/models/distilbert/modeling_distilbert.py
index 481e4c427119c1..023b4dc13ade1c 100755
--- a/src/transformers/models/distilbert/modeling_distilbert.py
+++ b/src/transformers/models/distilbert/modeling_distilbert.py
@@ -370,7 +370,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/falcon/modeling_falcon.py b/src/transformers/models/falcon/modeling_falcon.py
index 2dde8d1cac67f6..d2c9125ddcffde 100644
--- a/src/transformers/models/falcon/modeling_falcon.py
+++ b/src/transformers/models/falcon/modeling_falcon.py
@@ -657,7 +657,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/gemma/modeling_gemma.py b/src/transformers/models/gemma/modeling_gemma.py
index 479f46825c149a..cbb074fcc1d2f2 100644
--- a/src/transformers/models/gemma/modeling_gemma.py
+++ b/src/transformers/models/gemma/modeling_gemma.py
@@ -410,7 +410,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py b/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
index 2ef46eaa9f7322..25938342c2efb2 100644
--- a/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
+++ b/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
@@ -425,7 +425,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/gpt_neo/modeling_gpt_neo.py b/src/transformers/models/gpt_neo/modeling_gpt_neo.py
index 03e209f9d170e4..5e1ca2f1915fd9 100755
--- a/src/transformers/models/gpt_neo/modeling_gpt_neo.py
+++ b/src/transformers/models/gpt_neo/modeling_gpt_neo.py
@@ -407,7 +407,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/gpt_neox/modeling_gpt_neox.py b/src/transformers/models/gpt_neox/modeling_gpt_neox.py
index 882b4fc9ecc322..2ab552f118c120 100755
--- a/src/transformers/models/gpt_neox/modeling_gpt_neox.py
+++ b/src/transformers/models/gpt_neox/modeling_gpt_neox.py
@@ -439,7 +439,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 262db548a1a34e..3752a92c836743 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -518,7 +518,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/mbart/modeling_mbart.py b/src/transformers/models/mbart/modeling_mbart.py
index 2fc1ef12e78069..2f1d031d1a6d9c 100755
--- a/src/transformers/models/mbart/modeling_mbart.py
+++ b/src/transformers/models/mbart/modeling_mbart.py
@@ -420,7 +420,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/mistral/modeling_mistral.py b/src/transformers/models/mistral/modeling_mistral.py
index fbba155f19d57c..e219271e8ee5c3 100644
--- a/src/transformers/models/mistral/modeling_mistral.py
+++ b/src/transformers/models/mistral/modeling_mistral.py
@@ -496,7 +496,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/mixtral/modeling_mixtral.py b/src/transformers/models/mixtral/modeling_mixtral.py
index 12733dfdd90497..4c4c44bd2297d8 100644
--- a/src/transformers/models/mixtral/modeling_mixtral.py
+++ b/src/transformers/models/mixtral/modeling_mixtral.py
@@ -574,7 +574,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/opt/modeling_opt.py b/src/transformers/models/opt/modeling_opt.py
index 3af18947fac93f..a350c9019d7af0 100644
--- a/src/transformers/models/opt/modeling_opt.py
+++ b/src/transformers/models/opt/modeling_opt.py
@@ -394,7 +394,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/phi/modeling_phi.py b/src/transformers/models/phi/modeling_phi.py
index 9704d4ccf520ad..c3cb119f0aa043 100644
--- a/src/transformers/models/phi/modeling_phi.py
+++ b/src/transformers/models/phi/modeling_phi.py
@@ -540,7 +540,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/qwen2/modeling_qwen2.py b/src/transformers/models/qwen2/modeling_qwen2.py
index da0c9b8567752a..bfba4a45324818 100644
--- a/src/transformers/models/qwen2/modeling_qwen2.py
+++ b/src/transformers/models/qwen2/modeling_qwen2.py
@@ -502,7 +502,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/stablelm/modeling_stablelm.py b/src/transformers/models/stablelm/modeling_stablelm.py
index e7ee3b1462b2f9..76aca7bae91d18 100755
--- a/src/transformers/models/stablelm/modeling_stablelm.py
+++ b/src/transformers/models/stablelm/modeling_stablelm.py
@@ -549,7 +549,7 @@ def forward(
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
- dropout_rate = self.attention_dropout if self.training else 0.0
+ dropout_rate = self.attention_dropout.p if self.training else 0.0
attn_output = self._flash_attention_forward(
query_states,
@@ -586,7 +586,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/starcoder2/modeling_starcoder2.py b/src/transformers/models/starcoder2/modeling_starcoder2.py
index ac0c8fac9c007c..85a76f87b8d6e5 100644
--- a/src/transformers/models/starcoder2/modeling_starcoder2.py
+++ b/src/transformers/models/starcoder2/modeling_starcoder2.py
@@ -481,7 +481,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
diff --git a/src/transformers/models/whisper/modeling_whisper.py b/src/transformers/models/whisper/modeling_whisper.py
index 94c5758236741c..45f2d9fc5ccca6 100644
--- a/src/transformers/models/whisper/modeling_whisper.py
+++ b/src/transformers/models/whisper/modeling_whisper.py
@@ -536,7 +536,7 @@ def _flash_attention_forward(
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
- dropout (`int`, *optional*):
+ dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
From 608fa5496cdb3199c8c12523f01cdb73fe1765b4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jonatan=20K=C5=82osko?=
Date: Fri, 8 Mar 2024 13:53:17 +0100
Subject: [PATCH 123/549] Make sliding window size inclusive in eager attention
(#29519)
* Make sliding window size inclusive in eager attention
* Fix tests
---
src/transformers/modeling_attn_mask_utils.py | 6 +++---
tests/test_modeling_utils.py | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/transformers/modeling_attn_mask_utils.py b/src/transformers/modeling_attn_mask_utils.py
index faae0d763f4e59..8ad68f39db9134 100755
--- a/src/transformers/modeling_attn_mask_utils.py
+++ b/src/transformers/modeling_attn_mask_utils.py
@@ -164,10 +164,10 @@ def _make_causal_mask(
# add lower triangular sliding window mask if necessary
if sliding_window is not None:
- diagonal = past_key_values_length - sliding_window + 1
+ diagonal = past_key_values_length - sliding_window - 1
- context_mask = 1 - torch.triu(torch.ones_like(mask, dtype=torch.int), diagonal=diagonal)
- mask.masked_fill_(context_mask.bool(), torch.finfo(dtype).min)
+ context_mask = torch.tril(torch.ones_like(mask, dtype=torch.bool), diagonal=diagonal)
+ mask.masked_fill_(context_mask, torch.finfo(dtype).min)
return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
diff --git a/tests/test_modeling_utils.py b/tests/test_modeling_utils.py
index d0db5031e8b7a0..87b933425fb751 100755
--- a/tests/test_modeling_utils.py
+++ b/tests/test_modeling_utils.py
@@ -1673,7 +1673,7 @@ def check_to_causal(self, mask_converter, q_len, kv_len, bsz=3):
def compute_num_context_mask(self, kv_len, context, q_len):
# This function computes the # of attention tokens that are added for
# the sliding window
- c_mask_len = kv_len - context
+ c_mask_len = kv_len - context - 1
num_mask_triangle = c_mask_len * (c_mask_len + 1) // 2
cut_mask_len = max(c_mask_len - q_len, 0)
num_cut_mask = cut_mask_len * (cut_mask_len + 1) // 2
From 697f05bab39ce067a7984ed98da2743898352e47 Mon Sep 17 00:00:00 2001
From: Yun Dai
Date: Fri, 8 Mar 2024 05:36:30 -0800
Subject: [PATCH 124/549] fix typos in FSDP config parsing logic in
`TrainingArguments` (#29189)
fix FSDP config
---
src/transformers/training_args.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/transformers/training_args.py b/src/transformers/training_args.py
index 5baa3e1b51f366..884ea8ad6fb8c8 100644
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -1732,9 +1732,9 @@ def __post_init__(self):
os.environ[f"{prefix}TRANSFORMER_CLS_TO_WRAP"] = ",".join(
self.fsdp_config["transformer_layer_cls_to_wrap"]
)
- prefetch_policy = self.fsdp_config.get("fsdp_backward_prefetch", "NO_PREFETCH")
+ prefetch_policy = self.fsdp_config.get("backward_prefetch", "NO_PREFETCH")
os.environ[f"{prefix}BACKWARD_PREFETCH"] = prefetch_policy.upper()
- os.environ[f"{prefix}FORWARD_PREFETCH"] = self.fsdp_config.get("forward_prefect", "false")
+ os.environ[f"{prefix}FORWARD_PREFETCH"] = self.fsdp_config.get("forward_prefetch", "false")
os.environ[f"{prefix}SYNC_MODULE_STATES"] = self.fsdp_config.get("sync_module_states", "true")
os.environ[f"{prefix}USE_ORIG_PARAMS"] = self.fsdp_config.get("use_orig_params", "true")
From 1ba89dc2d2b5423ccbb2c52ac5e3b4282e54794b Mon Sep 17 00:00:00 2001
From: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>
Date: Fri, 8 Mar 2024 14:31:05 +0000
Subject: [PATCH 125/549] Fix WhisperNoSpeechDetection when input is full
silence (#29065)
fix total silence input with no_speech_threshold
---
src/transformers/models/whisper/generation_whisper.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/src/transformers/models/whisper/generation_whisper.py b/src/transformers/models/whisper/generation_whisper.py
index 5b5957d53478ec..9b4b2f81b65a7a 100644
--- a/src/transformers/models/whisper/generation_whisper.py
+++ b/src/transformers/models/whisper/generation_whisper.py
@@ -141,8 +141,10 @@ def _pad_to_max_length(current_segments, pad_token_id, padding="right", bos_toke
sequences.append(sequence)
max_total_length = max(max_total_length, len(sequences[-1]))
- else:
+ elif bos_token_tensor is not None:
sequences.append(bos_token_tensor)
+ else:
+ sequences.append(torch.tensor([]))
for i in range(len(current_segments)):
pad_length = max_total_length - len(sequences[i])
From 3f6973db06d0149ee94a71a8f7cf4c374c675cd4 Mon Sep 17 00:00:00 2001
From: Fanli Lin
Date: Fri, 8 Mar 2024 23:52:25 +0800
Subject: [PATCH 126/549] [tests] use the correct `n_gpu` in
`TrainerIntegrationTest::test_train_and_eval_dataloaders` for XPU (#29307)
* fix n_gpu
* fix style
---
tests/trainer/test_trainer.py | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py
index 98f3c96b4ea890..bd704bc8b59ee3 100644
--- a/tests/trainer/test_trainer.py
+++ b/tests/trainer/test_trainer.py
@@ -1029,7 +1029,10 @@ def is_any_loss_nan_or_inf(log_history):
self.assertFalse(is_any_loss_nan_or_inf(log_history_filter))
def test_train_and_eval_dataloaders(self):
- n_gpu = max(1, backend_device_count(torch_device))
+ if torch_device == "cuda":
+ n_gpu = max(1, backend_device_count(torch_device))
+ else:
+ n_gpu = 1
trainer = get_regression_trainer(learning_rate=0.1, per_device_train_batch_size=16)
self.assertEqual(trainer.get_train_dataloader().total_batch_size, 16 * n_gpu)
trainer = get_regression_trainer(learning_rate=0.1, per_device_eval_batch_size=16)
From 469c13280d77a75be626da4f8e918e9f24e4f80f Mon Sep 17 00:00:00 2001
From: Zach Mueller
Date: Fri, 8 Mar 2024 11:04:18 -0500
Subject: [PATCH 127/549] Fix eval thread fork bomb (#29538)
* Fix eval thread fork bomb
* Keep eval dl persistent and prepare after so free_memory doesn't destroy it
* Add note
* Quality
---
src/transformers/trainer.py | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 574363421234b3..f32be25e5326b4 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -888,6 +888,11 @@ def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoa
"""
if eval_dataset is None and self.eval_dataset is None:
raise ValueError("Trainer: evaluation requires an eval_dataset.")
+
+ # If we have persistent workers, don't do a fork bomb especially as eval datasets
+ # don't change during training
+ if hasattr(self, "_eval_dataloader") and self.args.dataloader_persistent_workers:
+ return self.accelerator.prepare(self._eval_dataloader)
eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
data_collator = self.data_collator
@@ -909,7 +914,13 @@ def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoa
dataloader_params["drop_last"] = self.args.dataloader_drop_last
dataloader_params["prefetch_factor"] = self.args.dataloader_prefetch_factor
- return self.accelerator.prepare(DataLoader(eval_dataset, **dataloader_params))
+ # accelerator.free_memory() will destroy the references, so
+ # we need to store the non-prepared version
+ eval_dataloader = DataLoader(eval_dataset, **dataloader_params)
+ if self.args.dataloader_persistent_workers:
+ self._eval_dataloader = eval_dataloader
+
+ return self.accelerator.prepare(eval_dataloader)
def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
"""
From 0290ec19c901adc0f1230ebdccad11c40af026f5 Mon Sep 17 00:00:00 2001
From: Winston H <56998716+winstxnhdw@users.noreply.github.com>
Date: Sat, 9 Mar 2024 01:27:30 +0800
Subject: [PATCH 128/549] feat: use `warning_advice` for tensorflow warning
(#29540)
feat: use `warning_advice` instead of tensorflow warning
---
src/transformers/__init__.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index 72bfb4b465c530..7297c1359e4ce0 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -9084,7 +9084,7 @@
if not is_tf_available() and not is_torch_available() and not is_flax_available():
- logger.warning(
+ logger.warning_advice(
"None of PyTorch, TensorFlow >= 2.0, or Flax have been found. "
"Models won't be available and only tokenizers, configuration "
"and file/data utilities can be used."
From 4f27ee936a861f56f32ea6db138978b274008006 Mon Sep 17 00:00:00 2001
From: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Date: Mon, 11 Mar 2024 19:46:24 +1100
Subject: [PATCH 129/549] [`Mamba doc`] Post merge updates (#29472)
* post merge update
* nit
* oups
---
docs/source/en/model_doc/mamba.md | 13 +++++--------
src/transformers/models/mamba/modeling_mamba.py | 4 ++--
tests/models/mamba/test_modeling_mamba.py | 14 +++++++-------
3 files changed, 14 insertions(+), 17 deletions(-)
diff --git a/docs/source/en/model_doc/mamba.md b/docs/source/en/model_doc/mamba.md
index 7378f79f94df7f..94eb2e2c2d528d 100644
--- a/docs/source/en/model_doc/mamba.md
+++ b/docs/source/en/model_doc/mamba.md
@@ -44,11 +44,8 @@ The original code can be found [here](https://github.com/state-spaces/mamba).
from transformers import MambaConfig, MambaForCausalLM, AutoTokenizer
import torch
-tokenizer = AutoTokenizer.from_pretrained("ArthurZ/mamba-130m")
-tokenizer.pad_token = tokenizer.eos_token
-
-model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-130m", vocab_size=50280, num_hidden_layers=24, torch_dtype=torch.float32)
-model.config.use_cache = True
+tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-130m-hf")
+model = MambaForCausalLM.from_pretrained("state-spaces/mamba-130m-hf")
input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
out = model.generate(input_ids, max_new_tokens=10)
@@ -63,8 +60,8 @@ from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
-model_id = "ArthurZ/mamba-2.8b"
-tokenizer = AutoTokenizer.from_pretrained(model_id, pad_token ="")
+model_id = "state-spaces/mamba-130m-hf"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
dataset = load_dataset("Abirate/english_quotes", split="train")
training_args = TrainingArguments(
@@ -77,7 +74,7 @@ training_args = TrainingArguments(
)
lora_config = LoraConfig(
r=8,
- target_modules="all-linear",
+ target_modules=["x_proj", "embeddings", "in_proj", "out_proj"],
task_type="CAUSAL_LM",
bias="none"
)
diff --git a/src/transformers/models/mamba/modeling_mamba.py b/src/transformers/models/mamba/modeling_mamba.py
index 4870c0281fdc34..54d51d31930445 100644
--- a/src/transformers/models/mamba/modeling_mamba.py
+++ b/src/transformers/models/mamba/modeling_mamba.py
@@ -53,7 +53,7 @@
(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)
)
-_CHECKPOINT_FOR_DOC = "ArthurZ/mamba-130m"
+_CHECKPOINT_FOR_DOC = "state-spaces/mamba-130m-hf"
_CONFIG_FOR_DOC = "MambaConfig"
MAMBA_PRETRAINED_MODEL_ARCHIVE_LIST = [] # See all Mamba models at https://huggingface.co/models?filter=mamba
@@ -605,7 +605,7 @@ def set_input_embeddings(self, new_embeddings):
def _update_model_kwargs_for_generation(
self, outputs: ModelOutput, model_kwargs: Dict[str, Any], **kwargs
) -> Dict[str, Any]:
- model_kwargs["cache_params"] = outputs["cache_params"]
+ model_kwargs["cache_params"] = outputs.get("cache_params", None)
return model_kwargs
def prepare_inputs_for_generation(
diff --git a/tests/models/mamba/test_modeling_mamba.py b/tests/models/mamba/test_modeling_mamba.py
index 503ffa0acd07a7..8bd121933b8052 100644
--- a/tests/models/mamba/test_modeling_mamba.py
+++ b/tests/models/mamba/test_modeling_mamba.py
@@ -406,15 +406,15 @@ def recursive_check(tuple_object, dict_object):
@require_torch
class MambaIntegrationTests(unittest.TestCase):
def setUp(self):
- self.model_id = "ArthurZ/mamba-2.8b"
+ self.model_id = "state-spaces/mamba-2.8b-hf"
self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
@parameterized.expand([(torch_device,), ("cpu",)])
def test_simple_generate(self, device):
- tokenizer = AutoTokenizer.from_pretrained("ArthurZ/mamba-130m")
+ tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-130m-hf")
tokenizer.pad_token = tokenizer.eos_token
- model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-130m", torch_dtype=torch.float16)
+ model = MambaForCausalLM.from_pretrained("state-spaces/mamba-130m-hf", torch_dtype=torch.float16)
model.to(device)
model.config.use_cache = True
input_ids = tokenizer("Hey how are you doing?", return_tensors="pt")["input_ids"].to(device)
@@ -444,7 +444,7 @@ def test_simple_generate_cuda_kernels_tiny(self, device):
expected_output = "Hello my name is John and I am a newbie to the world"
input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
- model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-130m", torch_dtype=torch.float16).to(device)
+ model = MambaForCausalLM.from_pretrained("state-spaces/mamba-130m-hf", torch_dtype=torch.float16).to(device)
output = model.generate(input_ids, max_new_tokens=10)
output_sentence = self.tokenizer.decode(output[0].tolist())
@@ -457,7 +457,7 @@ def test_simple_generate_cuda_kernels_small(self, device):
expected_output = "Hello my name is\n\nI am a\n\nI am a"
input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
- model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-790m", torch_dtype=torch.float16).to(device)
+ model = MambaForCausalLM.from_pretrained("state-spaces/mamba-790m-hf", torch_dtype=torch.float16).to(device)
output = model.generate(input_ids, max_new_tokens=10)
output_sentence = self.tokenizer.decode(output[0].tolist())
@@ -470,7 +470,7 @@ def test_simple_generate_cuda_kernels_mid(self, device):
expected_output = "Hello my name is John and I am a\n\nI am a single father of a beautiful daughter. I am a"
input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
- model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-1.4b", torch_dtype=torch.float16).to(device)
+ model = MambaForCausalLM.from_pretrained("state-spaces/mamba-1.4b-hf", torch_dtype=torch.float16).to(device)
output = model.generate(input_ids, max_new_tokens=20)
output_sentence = self.tokenizer.decode(output[0].tolist())
@@ -483,7 +483,7 @@ def test_simple_generate_cuda_kernels_big(self, device):
expected_output = "Hello my name is John and I am a new member of this forum. I am a retired Marine and I am a member of the Marine Corps League. I am a"
input_ids = self.tokenizer("Hello my name is", return_tensors="pt").input_ids.to(device)
- model = MambaForCausalLM.from_pretrained("ArthurZ/mamba-2.8b", torch_dtype=torch.float16).to(device)
+ model = MambaForCausalLM.from_pretrained("state-spaces/mamba-2.8b-hf", torch_dtype=torch.float16).to(device)
output = model.generate(input_ids, max_new_tokens=30)
output_sentence = self.tokenizer.decode(output[0].tolist())
From d80c9a349709b3db888b3976b660ef4ea2e29646 Mon Sep 17 00:00:00 2001
From: j-gc <102429286+j-gc@users.noreply.github.com>
Date: Mon, 11 Mar 2024 16:35:16 +0530
Subject: [PATCH 130/549] [`Docs`] fixed minor typo (#29555)
---
docs/source/en/quantization.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/quantization.md b/docs/source/en/quantization.md
index ef5a544bc4de47..1c24ca04a131ef 100644
--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@@ -49,7 +49,7 @@ Starting with version `aqlm 1.0.2`, AQLM supports Parameter-Efficient Fine-Tunin
### AQLM configurations
-AQLM quantization setpus vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
+AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference |
|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
From 6d67837f06fb8e3155a5c5b0dd57cd09841bc9f9 Mon Sep 17 00:00:00 2001
From: Tanay Mehta
Date: Mon, 11 Mar 2024 12:14:02 +0000
Subject: [PATCH 131/549] Add Fill-in-the-middle training objective example -
PyTorch (#27464)
* add: initial script to train clm fim
* fix: if training model from scratch, new tokens will be added and embeddings resized
* fix: fixed attention_mask errors when generating FIM data
* fix: file formatted using black
* add: run_fim_no_trainer.py and fixed some comments in run_fim.py
* add: added fim examples to the README.md and ran code fixup
* fix: little bug in both fim training scripts
* fix: remove comment from notebook and added a note on fim related params
* fix: minor typo in README
* add: suggested minor changes to README and run_fim.py
* add: gradient_accumulation_steps and gradient_checkpointing args
* add: improved model embedding resizing
* add: pad_to_multiple_of and attn_implementation params
* add: requested minor changes
* add: deepspeed zero compatibility
* add: resize embeddings layer with zero3 support for fim model initialization
---
examples/pytorch/language-modeling/README.md | 57 +-
examples/pytorch/language-modeling/run_fim.py | 861 +++++++++++++++++
.../language-modeling/run_fim_no_trainer.py | 913 ++++++++++++++++++
3 files changed, 1828 insertions(+), 3 deletions(-)
create mode 100644 examples/pytorch/language-modeling/run_fim.py
create mode 100644 examples/pytorch/language-modeling/run_fim_no_trainer.py
diff --git a/examples/pytorch/language-modeling/README.md b/examples/pytorch/language-modeling/README.md
index 23c0bc2c79aeb4..3a209584acc522 100644
--- a/examples/pytorch/language-modeling/README.md
+++ b/examples/pytorch/language-modeling/README.md
@@ -73,6 +73,57 @@ python run_clm_no_trainer.py \
--output_dir /tmp/test-clm
```
+### GPT-2/GPT and causal language modeling with fill-in-the middle objective
+
+The following example fine-tunes GPT-2 on WikiText-2 but using the Fill-in-middle training objective. FIM objective was proposed in [Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255). They showed that autoregressive language models can learn to infill text after applying a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.
+
+We're using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling.
+
+```bash
+python run_fim.py \
+ --model_name_or_path gpt2 \
+ --dataset_name wikitext \
+ --dataset_config_name wikitext-2-raw-v1 \
+ --per_device_train_batch_size 8 \
+ --per_device_eval_batch_size 8 \
+ --fim_rate 0.5 \
+ --fim_spm_rate 0.2 \
+ --do_train \
+ --do_eval \
+ --output_dir /tmp/test-clm
+```
+
+To run on your own training and validation files, use the following command:
+
+```bash
+python run_fim.py \
+ --model_name_or_path gpt2 \
+ --train_file path_to_train_file \
+ --validation_file path_to_validation_file \
+ --per_device_train_batch_size 8 \
+ --per_device_eval_batch_size 8 \
+ --fim_rate 0.5 \
+ --fim_spm_rate 0.2 \
+ --do_train \
+ --do_eval \
+ --output_dir /tmp/test-clm
+```
+
+This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_fim_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:
+
+```bash
+python run_fim_no_trainer.py \
+ --model_name_or_path gpt2 \
+ --dataset_name wikitext \
+ --dataset_config_name wikitext-2-raw-v1 \
+ --model_name_or_path gpt2 \
+ --fim_rate 0.5 \
+ --fim_spm_rate 0.2 \
+ --output_dir /tmp/test-clm
+```
+
+**Note**: Passing in FIM rate as `0.5` means that FIM transformations will be applied to the dataset with a probability of 50%. Whereas passing in FIM SPM rate as `0.2` means that 20% of FIM transformations will use SPM (or Suffix-Prefix-Middle) and the remaining 80% will use PSM (or Prefix-Suffix-Middle) mode of transformation.
+
### RoBERTa/BERT/DistilBERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
@@ -176,11 +227,11 @@ sure all your batches have the same length.
## Streaming
-To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is currently supported by `run_mlm.py` and `run_clm.py`.
+To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is supported by `run_mlm.py`, `run_clm.py` and `run_fim.py`. Make sure to adapt the other scripts to your use case by taking inspiration from them.
## Low Cpu Memory Usage
-To use low cpu memory mode which can be very useful for LLM, add `--low_cpu_mem_usage` to the command line. This is currently supported by `run_clm.py`,`run_mlm.py`, `run_plm.py`,`run_mlm_no_trainer.py` and `run_clm_no_trainer.py`.
+To use low cpu memory mode which can be very useful for LLM, add `--low_cpu_mem_usage` to the command line. This is currently supported by `run_clm.py`,`run_mlm.py`, `run_plm.py`, `run_fim.py`, `run_mlm_no_trainer.py`, `run_clm_no_trainer.py` and `run_fim_no_trainer.py`.
## Creating a model on the fly
@@ -192,4 +243,4 @@ python run_clm.py --model_type openai-community/gpt2 --tokenizer_name openai-com
[...]
```
-This feature is only available in `run_clm.py`, `run_plm.py` and `run_mlm.py`.
+This feature is only available in `run_clm.py`, `run_plm.py`, `run_mlm.py` and `run_fim.py`.
diff --git a/examples/pytorch/language-modeling/run_fim.py b/examples/pytorch/language-modeling/run_fim.py
new file mode 100644
index 00000000000000..0fd3833a9f2283
--- /dev/null
+++ b/examples/pytorch/language-modeling/run_fim.py
@@ -0,0 +1,861 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for causal language modeling using
+Fill-in-the middle (FIM) objective on a text file or a dataset.
+
+Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
+https://huggingface.co/models?filter=text-generation
+"""
+# You should adapt this script on your own causal language modeling task. Pointers for this are left as comments.
+
+import logging
+import math
+import os
+import sys
+from dataclasses import dataclass, field
+from itertools import chain
+from typing import Optional
+
+import datasets
+import evaluate
+import numpy as np
+import torch
+from datasets import load_dataset
+
+import transformers
+from transformers import (
+ CONFIG_MAPPING,
+ MODEL_FOR_CAUSAL_LM_MAPPING,
+ AutoConfig,
+ AutoModelForCausalLM,
+ AutoTokenizer,
+ HfArgumentParser,
+ Trainer,
+ TrainingArguments,
+ default_data_collator,
+ is_deepspeed_zero3_enabled,
+ is_torch_tpu_available,
+ set_seed,
+)
+from transformers.testing_utils import CaptureLogger
+from transformers.trainer_utils import get_last_checkpoint
+from transformers.utils import check_min_version, send_example_telemetry
+from transformers.utils.versions import require_version
+
+
+# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
+check_min_version("4.36.0.dev0")
+
+require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+
+
+@dataclass
+class ModelArguments:
+ """
+ Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+ """
+
+ model_name_or_path: Optional[str] = field(
+ default=None,
+ metadata={
+ "help": (
+ "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch."
+ )
+ },
+ )
+ model_type: Optional[str] = field(
+ default=None,
+ metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
+ )
+ config_overrides: Optional[str] = field(
+ default=None,
+ metadata={
+ "help": (
+ "Override some existing default config settings when a model is trained from scratch. Example: "
+ "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
+ )
+ },
+ )
+ config_name: Optional[str] = field(
+ default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+ )
+ tokenizer_name: Optional[str] = field(
+ default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+ )
+ cache_dir: Optional[str] = field(
+ default=None,
+ metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
+ )
+ use_fast_tokenizer: bool = field(
+ default=True,
+ metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
+ )
+ model_revision: str = field(
+ default="main",
+ metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
+ )
+ token: str = field(
+ default=None,
+ metadata={
+ "help": (
+ "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
+ "generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
+ )
+ },
+ )
+ trust_remote_code: bool = field(
+ default=False,
+ metadata={
+ "help": (
+ "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option"
+ "should only be set to `True` for repositories you trust and in which you have read the code, as it will "
+ "execute code present on the Hub on your local machine."
+ )
+ },
+ )
+ torch_dtype: Optional[str] = field(
+ default=None,
+ metadata={
+ "help": (
+ "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
+ "dtype will be automatically derived from the model's weights."
+ ),
+ "choices": ["auto", "bfloat16", "float16", "float32"],
+ },
+ )
+ low_cpu_mem_usage: bool = field(
+ default=False,
+ metadata={
+ "help": (
+ "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. "
+ "set True will benefit LLM loading time and RAM consumption."
+ )
+ },
+ )
+ pad_to_multiple_of: bool = field(
+ default=False,
+ metadata={
+ "help": (
+ "Whether to pad the embedding layer to a multiple depending on the device. ",
+ "For NVIDIA GPUs, this will be a multiple of 8, for TPUs a multiple of 128.",
+ )
+ },
+ )
+ attn_implementation: Optional[str] = field(
+ default="sdpa", metadata={"help": ("The attention implementation to use. ")}
+ )
+
+ def __post_init__(self):
+ if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None):
+ raise ValueError(
+ "--config_overrides can't be used in combination with --config_name or --model_name_or_path"
+ )
+
+
+@dataclass
+class DataTrainingArguments:
+ """
+ Arguments pertaining to what data we are going to input our model for training and eval.
+ """
+
+ dataset_name: Optional[str] = field(
+ default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+ )
+ dataset_config_name: Optional[str] = field(
+ default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
+ )
+ train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
+ validation_file: Optional[str] = field(
+ default=None,
+ metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
+ )
+ max_train_samples: Optional[int] = field(
+ default=None,
+ metadata={
+ "help": (
+ "For debugging purposes or quicker training, truncate the number of training examples to this "
+ "value if set."
+ )
+ },
+ )
+ max_eval_samples: Optional[int] = field(
+ default=None,
+ metadata={
+ "help": (
+ "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
+ "value if set."
+ )
+ },
+ )
+ streaming: bool = field(default=False, metadata={"help": "Enable streaming mode"})
+ block_size: Optional[int] = field(
+ default=None,
+ metadata={
+ "help": (
+ "Optional input sequence length after tokenization. "
+ "The training dataset will be truncated in block of this size for training. "
+ "Default to the model max input length for single sentence inputs (take into account special tokens)."
+ )
+ },
+ )
+ fim_rate: Optional[float] = field(
+ default=0.5,
+ metadata={
+ "help": (
+ "Optional probability with which the FIM transformation is applied to the example. "
+ "Default is 0.5. A rate of 1.0 means every example will undergo FIM transformation, "
+ "while a rate of 0.0 means no example will."
+ )
+ },
+ )
+ fim_spm_rate: Optional[float] = field(
+ default=0.5,
+ metadata={
+ "help": (
+ "Within the examples undergoing FIM transformation, this rate determines the probability "
+ "of applying the Sentence Permutation Mode (SPM). "
+ "Default is 0.5. A rate of 1.0 means all FIM transformations will use SPM, "
+ "while a rate of 0.0 means none will."
+ )
+ },
+ )
+ truncate_or_pad: Optional[bool] = field(
+ default=True,
+ metadata={
+ "help": (
+ "Indicates whether the transformed example should be truncated or padded to maintain "
+ "the same length as the original example. "
+ "Default is True. If False, the function will not truncate or pad the examples."
+ )
+ },
+ )
+ fim_prefix_token: Optional[str] = field(
+ default="",
+ metadata={"help": ("Fill-in-Middle Prefix token. Defaults to ''.")},
+ )
+ fim_middle_token: Optional[str] = field(
+ default="",
+ metadata={"help": ("Fill-in-Middle Middle token. Defaults to ''.")},
+ )
+ fim_suffix_token: Optional[str] = field(
+ default="",
+ metadata={"help": ("Fill-in-Middle Suffix token. Defaults to ''.")},
+ )
+ pad_token: Optional[str] = field(
+ default="",
+ metadata={
+ "help": (
+ "Fill-in-Middle Pad token. Used only when 'truncate_or_pad' is set to True. "
+ "Defaults to ''."
+ )
+ },
+ )
+ overwrite_cache: bool = field(
+ default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+ )
+ validation_split_percentage: Optional[int] = field(
+ default=5,
+ metadata={
+ "help": "The percentage of the train set used as validation set in case there's no validation split"
+ },
+ )
+ preprocessing_num_workers: Optional[int] = field(
+ default=None,
+ metadata={"help": "The number of processes to use for the preprocessing."},
+ )
+ keep_linebreaks: bool = field(
+ default=True, metadata={"help": "Whether to keep line breaks when using TXT files or not."}
+ )
+
+ def __post_init__(self):
+ if self.streaming:
+ require_version("datasets>=2.0.0", "The streaming feature requires `datasets>=2.0.0`")
+
+ if self.dataset_name is None and self.train_file is None and self.validation_file is None:
+ raise ValueError("Need either a dataset name or a training/validation file.")
+ else:
+ if self.train_file is not None:
+ extension = self.train_file.split(".")[-1]
+ assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file."
+ if self.validation_file is not None:
+ extension = self.validation_file.split(".")[-1]
+ assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file."
+
+
+def main():
+ # See all possible arguments in src/transformers/training_args.py
+ # or by passing the --help flag to this script.
+ # We now keep distinct sets of args, for a cleaner separation of concerns.
+
+ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
+ if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+ # If we pass only one argument to the script and it's the path to a json file,
+ # let's parse it to get our arguments.
+ model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+ else:
+ model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+ # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
+ # information sent is the one passed as arguments along with your Python/PyTorch versions.
+ send_example_telemetry("run_fim", model_args, data_args)
+
+ # Setup logging
+ logging.basicConfig(
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+ datefmt="%m/%d/%Y %H:%M:%S",
+ handlers=[logging.StreamHandler(sys.stdout)],
+ )
+
+ if training_args.should_log:
+ # The default of training_args.log_level is passive, so we set log level at info here to have that default.
+ transformers.utils.logging.set_verbosity_info()
+
+ log_level = training_args.get_process_log_level()
+ logger.setLevel(log_level)
+ datasets.utils.logging.set_verbosity(log_level)
+ transformers.utils.logging.set_verbosity(log_level)
+ transformers.utils.logging.enable_default_handler()
+ transformers.utils.logging.enable_explicit_format()
+
+ # Log on each process the small summary:
+ logger.warning(
+ f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, "
+ + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}"
+ )
+ logger.info(f"Training/evaluation parameters {training_args}")
+
+ # Detecting last checkpoint.
+ last_checkpoint = None
+ if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+ last_checkpoint = get_last_checkpoint(training_args.output_dir)
+ if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+ raise ValueError(
+ f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+ "Use --overwrite_output_dir to overcome."
+ )
+ elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+ logger.info(
+ f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+ "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+ )
+
+ # Set seed before initializing model.
+ set_seed(training_args.seed)
+
+ # Set a numpy random state for FIM transformations
+ np_rng = np.random.RandomState(seed=training_args.seed)
+
+ # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
+ # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
+ # (the dataset will be downloaded automatically from the datasets Hub).
+ #
+ # For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
+ # 'text' is found. You can easily tweak this behavior (see below).
+ #
+ # In distributed training, the load_dataset function guarantee that only one local process can concurrently
+ # download the dataset.
+ if data_args.dataset_name is not None:
+ # Downloading and loading a dataset from the hub.
+ raw_datasets = load_dataset(
+ data_args.dataset_name,
+ data_args.dataset_config_name,
+ cache_dir=model_args.cache_dir,
+ token=model_args.token,
+ streaming=data_args.streaming,
+ )
+ if "validation" not in raw_datasets.keys():
+ raw_datasets["validation"] = load_dataset(
+ data_args.dataset_name,
+ data_args.dataset_config_name,
+ split=f"train[:{data_args.validation_split_percentage}%]",
+ cache_dir=model_args.cache_dir,
+ token=model_args.token,
+ streaming=data_args.streaming,
+ )
+ raw_datasets["train"] = load_dataset(
+ data_args.dataset_name,
+ data_args.dataset_config_name,
+ split=f"train[{data_args.validation_split_percentage}%:]",
+ cache_dir=model_args.cache_dir,
+ token=model_args.token,
+ streaming=data_args.streaming,
+ )
+ else:
+ data_files = {}
+ dataset_args = {}
+ if data_args.train_file is not None:
+ data_files["train"] = data_args.train_file
+ if data_args.validation_file is not None:
+ data_files["validation"] = data_args.validation_file
+ extension = (
+ data_args.train_file.split(".")[-1]
+ if data_args.train_file is not None
+ else data_args.validation_file.split(".")[-1]
+ )
+ if extension == "txt":
+ extension = "text"
+ dataset_args["keep_linebreaks"] = data_args.keep_linebreaks
+ raw_datasets = load_dataset(
+ extension,
+ data_files=data_files,
+ cache_dir=model_args.cache_dir,
+ token=model_args.token,
+ **dataset_args,
+ )
+ # If no validation data is there, validation_split_percentage will be used to divide the dataset.
+ if "validation" not in raw_datasets.keys():
+ raw_datasets["validation"] = load_dataset(
+ extension,
+ data_files=data_files,
+ split=f"train[:{data_args.validation_split_percentage}%]",
+ cache_dir=model_args.cache_dir,
+ token=model_args.token,
+ **dataset_args,
+ )
+ raw_datasets["train"] = load_dataset(
+ extension,
+ data_files=data_files,
+ split=f"train[{data_args.validation_split_percentage}%:]",
+ cache_dir=model_args.cache_dir,
+ token=model_args.token,
+ **dataset_args,
+ )
+
+ # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
+ # https://huggingface.co/docs/datasets/loading_datasets.html.
+
+ # Load pretrained model and tokenizer
+ #
+ # Distributed training:
+ # The .from_pretrained methods guarantee that only one local process can concurrently
+ # download model & vocab.
+
+ config_kwargs = {
+ "cache_dir": model_args.cache_dir,
+ "revision": model_args.model_revision,
+ "token": model_args.token,
+ "trust_remote_code": model_args.trust_remote_code,
+ }
+ if model_args.config_name:
+ config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
+ elif model_args.model_name_or_path:
+ config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
+ else:
+ config = CONFIG_MAPPING[model_args.model_type]()
+ logger.warning("You are instantiating a new config instance from scratch.")
+ if model_args.config_overrides is not None:
+ logger.info(f"Overriding config: {model_args.config_overrides}")
+ config.update_from_string(model_args.config_overrides)
+ logger.info(f"New config: {config}")
+
+ tokenizer_kwargs = {
+ "cache_dir": model_args.cache_dir,
+ "use_fast": model_args.use_fast_tokenizer,
+ "revision": model_args.model_revision,
+ "token": model_args.token,
+ "trust_remote_code": model_args.trust_remote_code,
+ }
+ if model_args.tokenizer_name:
+ tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
+ elif model_args.model_name_or_path:
+ tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
+ else:
+ raise ValueError(
+ "You are instantiating a new tokenizer from scratch. This is not supported by this script. "
+ "You can do it from another script, save it, and load it from here, using --tokenizer_name."
+ )
+
+ if model_args.model_name_or_path:
+ torch_dtype = (
+ model_args.torch_dtype
+ if model_args.torch_dtype in ["auto", None]
+ else getattr(torch, model_args.torch_dtype)
+ )
+ model = AutoModelForCausalLM.from_pretrained(
+ model_args.model_name_or_path,
+ from_tf=bool(".ckpt" in model_args.model_name_or_path),
+ config=config,
+ cache_dir=model_args.cache_dir,
+ revision=model_args.model_revision,
+ token=model_args.token,
+ trust_remote_code=model_args.trust_remote_code,
+ torch_dtype=torch_dtype,
+ low_cpu_mem_usage=model_args.low_cpu_mem_usage,
+ attn_implementation=model_args.attn_implementation,
+ )
+
+ else:
+ model = AutoModelForCausalLM.from_config(
+ config,
+ trust_remote_code=model_args.trust_remote_code,
+ attn_implementation=model_args.attn_implementation,
+ )
+ n_params = sum({p.data_ptr(): p.numel() for p in model.parameters()}.values())
+ logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params")
+
+ # Add the new FIM tokens to the tokenizer and resize model's vocab embeddings
+ special_tokens = [data_args.fim_prefix_token, data_args.fim_middle_token, data_args.fim_suffix_token]
+ if data_args.truncate_or_pad:
+ special_tokens.append(data_args.pad_token)
+
+ # Get the factor by which the embedding layer should be padded based on the device
+ pad_factor = 1
+ if torch.cuda.is_availble():
+ pad_factor = 8
+
+ elif is_torch_tpu_available():
+ pad_factor = 128
+
+ # Add the new tokens to the tokenizer
+ tokenizer.add_tokens(special_tokens)
+ original_embeddings = model.get_input_embeddings()
+
+ if is_deepspeed_zero3_enabled():
+ import deepspeed
+
+ with deepspeed.zero.GatheredParameters(original_embeddings.weight, modifier_rank=0):
+ # Get the pre-expansion embeddings of the model and resize the embedding layer
+ model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=pad_factor)
+ embeddings = model.get_input_embeddings()
+
+ # Sample the embeddings for the new tokens from a multivariate normal distribution
+ # We do this so that the new embeddings are close to the original embeddings and not necessarily zero
+ # More on this: https://nlp.stanford.edu/~johnhew/vocab-expansion.html
+ mean = original_embeddings.mean(dim=0)
+ n = original_embeddings.size()[0]
+ sigma = ((original_embeddings - mean).T @ (original_embeddings - mean)) / n
+ dist = torch.distributions.multivariate_normal.MultivariateNormal(
+ mean,
+ covariance_matrix=1e-5 * sigma,
+ )
+ new_token_embeddings = torch.stack(
+ tuple((dist.sample() for _ in range(len(special_tokens)))),
+ dim=0,
+ )
+ else:
+ original_embeddings = model.get_input_embeddings()
+ # Get the pre-expansion embeddings of the model and resize the embedding layer
+ model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=pad_factor)
+ embeddings = model.get_input_embeddings()
+
+ # Sample the embeddings for the new tokens from a multivariate normal distribution
+ # We do this so that the new embeddings are close to the original embeddings and not necessarily zero
+ # More on this: https://nlp.stanford.edu/~johnhew/vocab-expansion.html
+ mean = original_embeddings.mean(dim=0)
+ n = original_embeddings.size()[0]
+ sigma = ((original_embeddings - mean).T @ (original_embeddings - mean)) / n
+ dist = torch.distributions.multivariate_normal.MultivariateNormal(
+ mean,
+ covariance_matrix=1e-5 * sigma,
+ )
+ new_token_embeddings = torch.stack(
+ tuple((dist.sample() for _ in range(len(special_tokens)))),
+ dim=0,
+ )
+
+ if is_deepspeed_zero3_enabled():
+ import deepspeed
+
+ with deepspeed.zero.GatheredParameters(embeddings.weight, modifier_rank=0):
+ # Set the new tokens' embeddings to the newly sampled embeddings
+ embeddings.weight.data[-len(special_tokens) :] = new_token_embeddings
+ else:
+ # Set the new tokens' embeddings to the newly sampled embeddings
+ embeddings.weight.data[-len(special_tokens) :] = new_token_embeddings
+
+ # Update the model's embeddings with the new embeddings
+ model.set_input_embeddings(embeddings)
+
+ logger.info("Added special tokens to the tokenizer and resized model's embedding layer")
+
+ # Preprocessing the datasets.
+ # First we tokenize all the texts.
+ if training_args.do_train:
+ column_names = list(raw_datasets["train"].features)
+ else:
+ column_names = list(raw_datasets["validation"].features)
+ text_column_name = "text" if "text" in column_names else column_names[0]
+
+ # since this will be pickled to avoid _LazyModule error in Hasher force logger loading before tokenize_function
+ tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")
+
+ def tokenize_function(examples):
+ with CaptureLogger(tok_logger) as cl:
+ output = tokenizer(examples[text_column_name])
+ # clm-fim input could be much much longer than block_size
+ if "Token indices sequence length is longer than the" in cl.out:
+ tok_logger.warning(
+ "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits"
+ " before being passed to the model."
+ )
+ return output
+
+ with training_args.main_process_first(desc="dataset map tokenization"):
+ if not data_args.streaming:
+ tokenized_datasets = raw_datasets.map(
+ tokenize_function,
+ batched=True,
+ num_proc=data_args.preprocessing_num_workers,
+ remove_columns=column_names,
+ load_from_cache_file=not data_args.overwrite_cache,
+ desc="Running tokenizer on dataset",
+ )
+ else:
+ tokenized_datasets = raw_datasets.map(
+ tokenize_function,
+ batched=True,
+ remove_columns=column_names,
+ )
+
+ if data_args.block_size is None:
+ block_size = tokenizer.model_max_length
+ if block_size > config.max_position_embeddings:
+ logger.warning(
+ f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
+ f"Using block_size={min(1024, config.max_position_embeddings)} instead. You can change that default value by passing --block_size xxx."
+ )
+ block_size = min(1024, config.max_position_embeddings)
+ else:
+ if data_args.block_size > tokenizer.model_max_length:
+ logger.warning(
+ f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model "
+ f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
+ )
+ block_size = min(data_args.block_size, tokenizer.model_max_length)
+
+ # Data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
+ def group_texts(examples):
+ # Concatenate all texts.
+ concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
+ total_length = len(concatenated_examples[list(examples.keys())[0]])
+ # We drop the small remainder, and if the total_length < block_size we exclude this batch and return an empty dict.
+ # We could add padding if the model supported it instead of this drop, you can customize this part to your needs.
+ total_length = (total_length // block_size) * block_size
+ # Split by chunks of max_len.
+ result = {
+ k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
+ for k, t in concatenated_examples.items()
+ }
+ result["labels"] = result["input_ids"].copy()
+ return result
+
+ # Get the FIM-specific token ids
+ prefix_tok_id = tokenizer.convert_tokens_to_ids(data_args.fim_prefix_token)
+ middle_tok_id = tokenizer.convert_tokens_to_ids(data_args.fim_middle_token)
+ suffix_tok_id = tokenizer.convert_tokens_to_ids(data_args.fim_suffix_token)
+ pad_tok_id = None
+
+ # If truncate_or_pad is on, also get pad token id
+ if data_args.truncate_or_pad:
+ pad_tok_id = tokenizer.convert_tokens_to_ids(data_args.pad_token)
+
+ # The two functions below perform the FIM transformation on the data (either PSM or SPM or PSM+SPM)
+ # Don't call fim_transform directly in .map()
+ # Adapted from https://github.com/loubnabnl/santacoder-finetuning/blob/main/fim.py#L22C13-L83
+ def fim_transform(example):
+ """
+ This function performs FIM transformation on a single example (list of tokens)
+ """
+ if np_rng.binomial(1, data_args.fim_rate):
+ boundaries = sorted(np_rng.randint(low=0, high=len(example) + 1, size=2))
+
+ prefix = example[: boundaries[0]]
+ middle = example[boundaries[0] : boundaries[1]]
+ suffix = example[boundaries[1] :]
+
+ if data_args.truncate_or_pad:
+ total_length = len(prefix) + len(middle) + len(suffix) + 3
+ diff = total_length - len(example)
+ if diff > 0:
+ suffix = suffix[: max(0, len(suffix) - diff)]
+ elif diff < 0:
+ suffix.extend([pad_tok_id] * (-diff))
+
+ if np_rng.binomial(1, data_args.fim_spm_rate):
+ # Apply Suffix-Prefix-Middle (SPM) transformation
+ transformed_example = [prefix_tok_id, suffix_tok_id] + suffix + [middle_tok_id] + prefix + middle
+ else:
+ # Apply Prefix-Suffix-Middle (PSM) transformation
+ transformed_example = [prefix_tok_id] + prefix + [suffix_tok_id] + suffix + [middle_tok_id] + middle
+ else:
+ transformed_example = example
+
+ return transformed_example
+
+ # Below function is the one you are supposed to call in the .map() function
+ def apply_fim(examples):
+ """
+ Apply FIM transformation to a batch of examples
+ """
+ fim_transform_ids = [fim_transform(ids) for ids in examples["input_ids"]]
+ examples["input_ids"] = fim_transform_ids
+ examples["labels"] = fim_transform_ids
+ # If your application requires custom attention mask, please adjust this function's below line.
+ # Since FIM transformation increases the number of tokens in input_ids and labels
+ # but leaves the number of tokens unchanged in attention_masks which would cause problems
+ examples["attention_mask"] = [[1] * len(mask) for mask in examples["input_ids"]]
+ return examples
+
+ # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a remainder
+ # for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value might be slower
+ # to preprocess.
+ #
+ # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
+ # https://huggingface.co/docs/datasets/process#map
+
+ # FIM transformations are only supposed to be applied before group_texts processing otherwise some sentences will
+ # have 3-4 more tokens than others due to probabilistic addition of FIM-specific tokens which will raise errors
+ with training_args.main_process_first(desc="processing texts together"):
+ if not data_args.streaming:
+ fim_datasets = tokenized_datasets.map(
+ apply_fim,
+ batched=True,
+ num_proc=data_args.preprocessing_num_workers,
+ load_from_cache_file=not data_args.overwrite_cache,
+ desc="Performing FIM transformation",
+ )
+ lm_datasets = fim_datasets.map(
+ group_texts,
+ batched=True,
+ num_proc=data_args.preprocessing_num_workers,
+ load_from_cache_file=not data_args.overwrite_cache,
+ desc=f"Grouping texts in chunks of {block_size}",
+ )
+ else:
+ fim_datasets = tokenized_datasets.map(
+ apply_fim,
+ batched=True,
+ )
+ lm_datasets = fim_datasets.map(
+ group_texts,
+ batched=True,
+ )
+
+ if training_args.do_train:
+ if "train" not in tokenized_datasets:
+ raise ValueError("--do_train requires a train dataset")
+ train_dataset = lm_datasets["train"]
+ if data_args.max_train_samples is not None:
+ max_train_samples = min(len(train_dataset), data_args.max_train_samples)
+ train_dataset = train_dataset.select(range(max_train_samples))
+
+ if training_args.do_eval:
+ if "validation" not in tokenized_datasets:
+ raise ValueError("--do_eval requires a validation dataset")
+ eval_dataset = lm_datasets["validation"]
+ if data_args.max_eval_samples is not None:
+ max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples)
+ eval_dataset = eval_dataset.select(range(max_eval_samples))
+
+ def preprocess_logits_for_metrics(logits, labels):
+ if isinstance(logits, tuple):
+ # Depending on the model and config, logits may contain extra tensors,
+ # like past_key_values, but logits always come first
+ logits = logits[0]
+ return logits.argmax(dim=-1)
+
+ metric = evaluate.load("accuracy")
+
+ def compute_metrics(eval_preds):
+ preds, labels = eval_preds
+ # preds have the same shape as the labels, after the argmax(-1) has been calculated
+ # by preprocess_logits_for_metrics but we need to shift the labels
+ labels = labels[:, 1:].reshape(-1)
+ preds = preds[:, :-1].reshape(-1)
+ return metric.compute(predictions=preds, references=labels)
+
+ # Initialize our Trainer
+ trainer = Trainer(
+ model=model,
+ args=training_args,
+ train_dataset=train_dataset if training_args.do_train else None,
+ eval_dataset=eval_dataset if training_args.do_eval else None,
+ tokenizer=tokenizer,
+ # Data collator will default to DataCollatorWithPadding, so we change it.
+ data_collator=default_data_collator,
+ compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
+ preprocess_logits_for_metrics=(
+ preprocess_logits_for_metrics if training_args.do_eval and not is_torch_tpu_available() else None
+ ),
+ )
+
+ # Training
+ if training_args.do_train:
+ checkpoint = None
+ if training_args.resume_from_checkpoint is not None:
+ checkpoint = training_args.resume_from_checkpoint
+ elif last_checkpoint is not None:
+ checkpoint = last_checkpoint
+ train_result = trainer.train(resume_from_checkpoint=checkpoint)
+ trainer.save_model() # Saves the tokenizer too for easy upload
+
+ metrics = train_result.metrics
+
+ max_train_samples = (
+ data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
+ )
+ metrics["train_samples"] = min(max_train_samples, len(train_dataset))
+
+ trainer.log_metrics("train", metrics)
+ trainer.save_metrics("train", metrics)
+ trainer.save_state()
+
+ # Evaluation
+ if training_args.do_eval:
+ logger.info("*** Evaluate ***")
+
+ metrics = trainer.evaluate()
+
+ max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
+ metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
+ try:
+ perplexity = math.exp(metrics["eval_loss"])
+ except OverflowError:
+ perplexity = float("inf")
+ metrics["perplexity"] = perplexity
+
+ trainer.log_metrics("eval", metrics)
+ trainer.save_metrics("eval", metrics)
+
+ kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"}
+ if data_args.dataset_name is not None:
+ kwargs["dataset_tags"] = data_args.dataset_name
+ if data_args.dataset_config_name is not None:
+ kwargs["dataset_args"] = data_args.dataset_config_name
+ kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}"
+ else:
+ kwargs["dataset"] = data_args.dataset_name
+
+ if training_args.push_to_hub:
+ trainer.push_to_hub(**kwargs)
+ else:
+ trainer.create_model_card(**kwargs)
+
+
+def _mp_fn(index):
+ # For xla_spawn (TPUs)
+ main()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/pytorch/language-modeling/run_fim_no_trainer.py b/examples/pytorch/language-modeling/run_fim_no_trainer.py
new file mode 100644
index 00000000000000..9f95c92ebf1ea7
--- /dev/null
+++ b/examples/pytorch/language-modeling/run_fim_no_trainer.py
@@ -0,0 +1,913 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for causal language modeling using
+Fill-in-the middle (FIM) objective on a text file or a dataset without using HuggingFace Trainer.
+
+Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
+https://huggingface.co/models?filter=text-generation
+"""
+# You can also adapt this script on your own fim causal language modeling task. Pointers for this are left as comments.
+
+import argparse
+import json
+import logging
+import math
+import os
+import random
+from itertools import chain
+from pathlib import Path
+
+import datasets
+import numpy as np
+import torch
+from accelerate import Accelerator, DistributedType
+from accelerate.logging import get_logger
+from accelerate.utils import set_seed
+from datasets import load_dataset
+from huggingface_hub import Repository, create_repo
+from torch.utils.data import DataLoader
+from tqdm.auto import tqdm
+
+import transformers
+from transformers import (
+ CONFIG_MAPPING,
+ MODEL_MAPPING,
+ AutoConfig,
+ AutoModelForCausalLM,
+ AutoTokenizer,
+ SchedulerType,
+ default_data_collator,
+ get_scheduler,
+ is_deepspeed_zero3_enabled,
+ is_torch_tpu_available,
+)
+from transformers.utils import check_min_version, send_example_telemetry
+from transformers.utils.versions import require_version
+
+
+# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
+check_min_version("4.36.0.dev0")
+
+logger = get_logger(__name__)
+
+require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
+
+MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(
+ description="Finetune a transformers model on a causal language modeling task using fill-in-the middle objective"
+ )
+ parser.add_argument(
+ "--dataset_name",
+ type=str,
+ default=None,
+ help="The name of the dataset to use (via the datasets library).",
+ )
+ parser.add_argument(
+ "--dataset_config_name",
+ type=str,
+ default=None,
+ help="The configuration name of the dataset to use (via the datasets library).",
+ )
+ parser.add_argument(
+ "--train_file", type=str, default=None, help="A csv, txt or a json file containing the training data."
+ )
+ parser.add_argument(
+ "--validation_file", type=str, default=None, help="A csv, txt or a json file containing the validation data."
+ )
+ parser.add_argument(
+ "--validation_split_percentage",
+ default=5,
+ help="The percentage of the train set used as validation set in case there's no validation split",
+ )
+ parser.add_argument(
+ "--model_name_or_path",
+ type=str,
+ help="Path to pretrained model or model identifier from huggingface.co/models.",
+ required=False,
+ )
+ parser.add_argument(
+ "--config_name",
+ type=str,
+ default=None,
+ help="Pretrained config name or path if not the same as model_name",
+ )
+ parser.add_argument(
+ "--tokenizer_name",
+ type=str,
+ default=None,
+ help="Pretrained tokenizer name or path if not the same as model_name",
+ )
+ parser.add_argument(
+ "--use_slow_tokenizer",
+ action="store_true",
+ help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+ )
+ parser.add_argument(
+ "--per_device_train_batch_size",
+ type=int,
+ default=8,
+ help="Batch size (per device) for the training dataloader.",
+ )
+ parser.add_argument(
+ "--per_device_eval_batch_size",
+ type=int,
+ default=8,
+ help="Batch size (per device) for the evaluation dataloader.",
+ )
+ parser.add_argument(
+ "--learning_rate",
+ type=float,
+ default=5e-5,
+ help="Initial learning rate (after the potential warmup period) to use.",
+ )
+ parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
+ parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
+ parser.add_argument(
+ "--max_train_steps",
+ type=int,
+ default=None,
+ help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
+ )
+ parser.add_argument(
+ "--gradient_accumulation_steps",
+ type=int,
+ default=1,
+ help="Number of updates steps to accumulate before performing a backward/update pass.",
+ )
+ parser.add_argument(
+ "--lr_scheduler_type",
+ type=SchedulerType,
+ default="linear",
+ help="The scheduler type to use.",
+ choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"],
+ )
+ parser.add_argument(
+ "--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler."
+ )
+ parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
+ parser.add_argument("--seed", type=int, default=42, help="A seed for reproducible training.")
+ parser.add_argument(
+ "--model_type",
+ type=str,
+ default=None,
+ help="Model type to use if training from scratch.",
+ choices=MODEL_TYPES,
+ )
+ parser.add_argument(
+ "--block_size",
+ type=int,
+ default=None,
+ help=(
+ "Optional input sequence length after tokenization. The training dataset will be truncated in block of"
+ " this size for training. Default to the model max input length for single sentence inputs (take into"
+ " account special tokens)."
+ ),
+ )
+ parser.add_argument(
+ "--fim_rate",
+ type=float,
+ default=0.5,
+ help=(
+ " Optional probability with which the FIM transformation is applied to the example."
+ " Default is 0.5. A rate of 1.0 means every example will undergo FIM transformation,"
+ " while a rate of 0.0 means no example will."
+ ),
+ )
+ parser.add_argument(
+ "--fim_spm_rate",
+ type=float,
+ default=0.5,
+ help=(
+ "Within the examples undergoing FIM transformation, this rate determines the probability"
+ " of applying the Sentence Permutation Mode (SPM)."
+ " Default is 0.5. A rate of 1.0 means all FIM transformations will use SPM,"
+ " while a rate of 0.0 means none will."
+ ),
+ )
+ parser.add_argument(
+ "--truncate_or_pad",
+ type=bool,
+ default=True,
+ help=(
+ "Indicates whether the transformed example should be truncated or padded to maintain"
+ " the same length as the original example."
+ " Default is True. If False, the function will not truncate or pad the examples."
+ ),
+ )
+ parser.add_argument(
+ "--fim_prefix_token",
+ type=str,
+ default="",
+ help="Fill-in-Middle Prefix token. Defaults to ''.",
+ )
+ parser.add_argument(
+ "--fim_middle_token",
+ type=str,
+ default="",
+ help="Fill-in-Middle Middle token. Defaults to ''.",
+ )
+ parser.add_argument(
+ "--fim_suffix_token",
+ type=str,
+ default="",
+ help="Fill-in-Middle Middle token. Defaults to ''.",
+ )
+ parser.add_argument(
+ "--fim_pad_token",
+ type=str,
+ default="",
+ help=(
+ "Fill-in-Middle Pad token. Used only when 'truncate_or_pad' is set to True." " Defaults to '