huggingface · DavidAfonsoValente · Jan 25, 2024 · Jan 25, 2024 · Jan 26, 2024 · Jan 26, 2024
diff --git a/docs/source/en/model_doc/xlm-roberta.md b/docs/source/en/model_doc/xlm-roberta.md
@@ -55,6 +55,14 @@ This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The
   language from the input ids.
 - Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses masked language modeling on sentences coming from one language.
 
+### Expected speedups
+
+Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `FacebookAI/xlm-roberta-base` checkpoint and the Flash Attention 2 version of the model.
+
+<div style="text-align: center">
+<img src="https://private-user-images.githubusercontent.com/74915610/302696384-d93c45c1-3518-4e18-a316-5efb7b70ebb7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDcyMzU4NzYsIm5iZiI6MTcwNzIzNTU3NiwicGF0aCI6Ii83NDkxNTYxMC8zMDI2OTYzODQtZDkzYzQ1YzEtMzUxOC00ZTE4LWEzMTYtNWVmYjdiNzBlYmI3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAyMDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMjA2VDE2MDYxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI4YTI4NDA4ZDYwZjI4ODQ4Mjc2MjY5MjM4M2FmYjhmZmZmODFkODkyMmQ5ZGY1NDBjMjZjZDgxYTY2NTU3NzAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.a59nTa9CDX5NK9oQ9uLHNNuWZkRePPtQFIo_NemOnZs">
+</div>
+
 ## Resources
 
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with XLM-RoBERTa. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
@@ -113,6 +121,35 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well as the information relative to the inputs and outputs.
 </Tip>
 
+## Combining XLMRoBERTa and Flash Attention 2
+
+First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
+
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+
+Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16`)
+
+To load and run a model using Flash Attention 2, refer to the snippet below:
+
+```python
+>>> import torch
+>>> from transformers import AutoTokenizer, AutoModel
+
+>>> device = "cuda" # the device to load the model onto
+
+>>> tokenizer = AutoTokenizer.from_pretrained('XLM-RoBERTa-base')
+>>> model = AutoModel.from_pretrained("XLM-RoBERTa-base", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
+
+>>> text = "Replace me by any text you'd like."
+
+>>> encoded_input = tokenizer(text, return_tensors='pt').to(device)
+>>> model.to(device)
+
+>>> output = model(**encoded_input)
+```
+
 ## XLMRobertaConfig
 
 [[autodoc]] XLMRobertaConfig

diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md
@@ -54,6 +54,7 @@ FlashAttention-2 is currently supported for the following architectures:
 * [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel)
 * [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model)
 * [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)
+* [xlm_roberta](https://huggingface.co/docs/transformers/model_doc/xlm-roberta#transformers.XLMRobertaModel)
 
 You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request.
 
@@ -393,4 +394,4 @@ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable
     outputs = model.generate(**inputs)
 
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
+```