Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified TidalDecode-GIF #53

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 43 additions & 2 deletions _projects/tidaldecode.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ sparse attention without sacrificing the quality of the generated results.
set of tokens, reducing the overhead of token selection.

{:center: style="text-align: center"}
![image](/img/tidaldecode/TidalDecode-GIF.gif){: width="80%"}
![image](/img/tidaldecode/TidalDecode-GIF.gif){: width="90%"}
{:center}

- **KV Cache Correction**. For tokens decoded by sparse attention methods, their key/value representations can deviate from the original representation of full attention decoded ones,
Expand All @@ -41,8 +41,49 @@ sparse attention without sacrificing the quality of the generated results.
the polluted tokens in the KV cache.

{:center: style="text-align: center"}
![image](/img/tidaldecode/Cache-Correction.jpg){: width="60%"}
![image](/img/tidaldecode/Cache-Correction.jpg){: width="40%"}
{:center}

## Evaluations
For all the evaluations, we only enabled Position Persistent Sparse Attention (with KV Cache Correction off) for a fair comparison. Experiments are conducted on a single
Nvidia A100 (80 GB HBM, SXM4) with CUDA 12.2
- **End-to-end Latency**

{:center: style="text-align: center"}
![image](/img/tidaldecode/llama_e2e_eval.png){: width="90%"}
{:center}

*Figure 1: End-to-end latency results on LLaMA-2-7B model for Full attention baseline(Full), SOTA Quest, and TidalDecode(TD) when context length is 10K, 32K, and 100K, respectively.*

- **Attention Latency**

{:center: style="text-align: center"}
![image](/img/tidaldecode/llama_latency_eval.png){: width="90%"}
{:center}

*Figure 2: Overall attention latency results for different methods on the LLaMA model with (a) 32 and (b) 64 layers. The full attention model is used as a reference to show TidalDecode and Quest's overall attention latency ratio. The left/middle/right bar denotes the full attention baseline, Quest, and TidalDecode, respectively.*

- **Accuracy**

{:center: style="text-align: center"}
![image](/img/tidaldecode/llama3_needle_eval.png){: width="90%"}
{:center}

*Figure 3: 10K- and 100K-context-length Needle-in-the-Haystack test results of TD+Lx (x means recomputing at Layer x) and Quest on Llama-3-8B-Instruct-Gradient-1048k. TidalDecode consistently outperforms Quest and achieves full accuracy with 128 tokens in 10K-, and 100K-context-length tests, which is only 1\% and 0.1\% of total input lengths, respectively.*

## Reference
If you are interested in TidalDecode and want to use it in your project, please consider citing it with
```
@misc{yang2024tidaldecodefastaccuratellm,
title={TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention},
author={Lijie Yang and Zhihao Zhang and Zhuofu Chen and Zikun Li and Zhihao Jia},
year={2024},
eprint={2410.05076},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.05076},
}
```

## Resources
- [Codebase](https://github.com/DerrickYLJ/TidalDecode) for reproducing all the results in the paper.
Binary file modified img/tidaldecode/TidalDecode-GIF.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/tidaldecode/llama3_needle_eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/tidaldecode/llama_e2e_eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/tidaldecode/llama_latency_eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/tidaldecode/tmp.pdf
Binary file not shown.