Skip to content

Commit

Permalink
docs: Flash Attention Conceptual Guide (#892)
Browse files Browse the repository at this point in the history
PR for conceptual guide on flash attention. I will add more info unless
I'm told otherwise.

---------

Co-authored-by: Nicolas Patry <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
  • Loading branch information
3 people authored Sep 6, 2023
1 parent 059bb5c commit f260eb7
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,6 @@
- sections:
- local: conceptual/streaming
title: Streaming
- local: conceptual/flash_attention
title: Flash Attention
title: Conceptual Guides
12 changes: 12 additions & 0 deletions docs/source/conceptual/flash_attention.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Flash Attention

Scaling the transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware mainly focus on enhancing compute capacities and not memory and transferring data between hardware. This results in attention operation having a memory bottleneck. **Flash Attention** is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference.

Standard attention mechanism uses High Bandwidth Memory (HBM) to store, read and write keys, queries and values. HBM is large in memory, but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back.

![Flash Attention](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png)

It is implemented for supported models. You can check out the complete list of models that support Flash Attention [here](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models), for models with flash prefix.

You can learn more about Flash Attention by reading the paper in this [link](https://arxiv.org/abs/2205.14135).

0 comments on commit f260eb7

Please sign in to comment.