From 3cc2f4e9fabe043b1d4896e04e5fc30ec401a38f Mon Sep 17 00:00:00 2001 From: Mohit Sharma Date: Mon, 24 Jun 2024 14:50:16 +0000 Subject: [PATCH] update doc --- docs/source/basic_tutorials/fp8_kv_cache.md | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/docs/source/basic_tutorials/fp8_kv_cache.md b/docs/source/basic_tutorials/fp8_kv_cache.md index 64e4539f0b3..012471d0a1a 100644 --- a/docs/source/basic_tutorials/fp8_kv_cache.md +++ b/docs/source/basic_tutorials/fp8_kv_cache.md @@ -1,11 +1,6 @@ # Accelerating Inference with FP8 KV Cache -Text Generation Inference (TGI) now supports FP8 KV Cache, enhancing inference speed on both Nvidia and AMD GPUs. This feature significantly boosts performance and memory efficiency, enabling faster and more scalable text generation. - -By quantizing the KV cache to 8-bit floating point (FP8) formats, we can greatly reduce the memory footprint. This reduction allows for: -* Increased token storage capacity in the cache -* Improved throughput in text generation tasks -* More efficient GPU memory utilization +Text Generation Inference (TGI) now supports FP8 KV Cache, enhancing inference speed on both Nvidia and AMD GPUs. This feature significantly boosts performance and memory efficiency, enabling faster and more scalable text generation. By quantizing the KV cache to 8-bit floating point (FP8) formats, we can greatly reduce the memory footprint. This reduction allows for improved throughput in text generation tasks ## FP8 Formats: E4M3 and E5M2 The Open Compute Project (OCP) defines two common 8-bit floating point data formats: