Skip to content

Commit

Permalink
update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
mht-sharma committed Jun 24, 2024
1 parent 001ec09 commit 3cc2f4e
Showing 1 changed file with 1 addition and 6 deletions.
7 changes: 1 addition & 6 deletions docs/source/basic_tutorials/fp8_kv_cache.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,6 @@
# Accelerating Inference with FP8 KV Cache

Text Generation Inference (TGI) now supports FP8 KV Cache, enhancing inference speed on both Nvidia and AMD GPUs. This feature significantly boosts performance and memory efficiency, enabling faster and more scalable text generation.

By quantizing the KV cache to 8-bit floating point (FP8) formats, we can greatly reduce the memory footprint. This reduction allows for:
* Increased token storage capacity in the cache
* Improved throughput in text generation tasks
* More efficient GPU memory utilization
Text Generation Inference (TGI) now supports FP8 KV Cache, enhancing inference speed on both Nvidia and AMD GPUs. This feature significantly boosts performance and memory efficiency, enabling faster and more scalable text generation. By quantizing the KV cache to 8-bit floating point (FP8) formats, we can greatly reduce the memory footprint. This reduction allows for improved throughput in text generation tasks

## FP8 Formats: E4M3 and E5M2
The Open Compute Project (OCP) defines two common 8-bit floating point data formats:
Expand Down

0 comments on commit 3cc2f4e

Please sign in to comment.