From 9e133c9cfdca64cd8be7d1d5445883ad7eb806b1 Mon Sep 17 00:00:00 2001 From: Felix Marty <9808326+fxmarty@users.noreply.github.com> Date: Thu, 7 Dec 2023 12:52:09 +0000 Subject: [PATCH] add tip about torch.jit.trace and move bt doc below sdpa --- docs/source/en/perf_infer_gpu_one.md | 84 +++++++++++++++------------- 1 file changed, 45 insertions(+), 39 deletions(-) diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md index 6a31c41b6fa5d2..ef58b879ef5f71 100644 --- a/docs/source/en/perf_infer_gpu_one.md +++ b/docs/source/en/perf_infer_gpu_one.md @@ -142,46 +142,9 @@ FlashAttention is more memory efficient, meaning you can train on much larger se -## BetterTransformer - - - -Part of BetterTransformer features are being upstreamed in Transformers, with native `torch.nn.scaled_dot_product_attention` default support. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to support natively SDPA in Transformers. - - - - - - -Check out our benchmarks with BetterTransformer and scaled dot product attention in the [Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0](https://pytorch.org/blog/out-of-the-box-acceleration/) and learn more about the fastpath execution in the [BetterTransformer](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) blog post. - - - -BetterTransformer accelerates inference with its fastpath (native PyTorch specialized implementation of Transformer functions) execution. The two optimizations in the fastpath execution are: - -1. fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps -2. skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors - -BetterTransformer also converts all attention operations to use the more memory-efficient [scaled dot product attention (SDPA)](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention), and it calls optimized kernels like [FlashAttention](https://huggingface.co/papers/2205.14135) under the hood. - -Before you start, make sure you have 🤗 Optimum [installed](https://huggingface.co/docs/optimum/installation). - -Then you can enable BetterTransformer with the [`PreTrainedModel.to_bettertransformer`] method: - -```python -model = model.to_bettertransformer() -``` - -You can return the original Transformers model with the [`~PreTrainedModel.reverse_bettertransformer`] method. You should use this before saving your model to use the canonical Transformers modeling: - -```py -model = model.reverse_bettertransformer() -model.save_pretrained("saved_model") -``` - -### FlashAttention and memory-efficient attention through PyTorch's scaled_dot_product_attention +## FlashAttention and memory-efficient attention through PyTorch's scaled_dot_product_attention -PyTorch's `torch.nn.functional.scaled_dot_product_attention` (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers, and is used by default for `torch>=2.1.1` when an implementation is available. +PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers, and is used by default for `torch>=2.1.1` when an implementation is available. For now, Transformers supports inference and training through SDPA for the following architectures: * [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel) @@ -222,6 +185,49 @@ RuntimeError: No available kernel. Aborting execution. pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 ``` + + +As of Transformers 4.36, attention modules using `torch.nn.functional.scaled_dot_product_attention` do not support tracing through [`torch.jit.trace`](https://pytorch.org/docs/stable/generated/torch.jit.trace.html). Please load your model with the argument `attn_implementation="eager"` in [`~PreTrainedModel.from_pretrained`] in order to export to TorchScript through `torch.jit.trace`. + + + +## BetterTransformer + + + +Part of BetterTransformer features are being upstreamed in Transformers, with native `torch.nn.scaled_dot_product_attention` default support. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to support natively SDPA in Transformers. + + + + + + +Check out our benchmarks with BetterTransformer and scaled dot product attention in the [Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0](https://pytorch.org/blog/out-of-the-box-acceleration/) and learn more about the fastpath execution in the [BetterTransformer](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) blog post. + + + +BetterTransformer accelerates inference with its fastpath (native PyTorch specialized implementation of Transformer functions) execution. The two optimizations in the fastpath execution are: + +1. fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps +2. skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors + +BetterTransformer also converts all attention operations to use the more memory-efficient [scaled dot product attention (SDPA)](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention), and it calls optimized kernels like [FlashAttention](https://huggingface.co/papers/2205.14135) under the hood. + +Before you start, make sure you have 🤗 Optimum [installed](https://huggingface.co/docs/optimum/installation). + +Then you can enable BetterTransformer with the [`PreTrainedModel.to_bettertransformer`] method: + +```python +model = model.to_bettertransformer() +``` + +You can return the original Transformers model with the [`~PreTrainedModel.reverse_bettertransformer`] method. You should use this before saving your model to use the canonical Transformers modeling: + +```py +model = model.reverse_bettertransformer() +model.save_pretrained("saved_model") +``` + ## bitsandbytes bitsandbytes is a quantization library that includes support for 4-bit and 8-bit quantization. Quantization reduces your model size compared to its native full precision version, making it easier to fit large models onto GPUs with limited memory.