From 6d42f1a2835ce27c3caaf04da07b3f2c2e8c1a23 Mon Sep 17 00:00:00 2001 From: Marut Pandya Date: Tue, 17 Dec 2024 12:04:50 -0800 Subject: [PATCH] Update README.md --- README.md | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4f34b60..5f926f7 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,42 @@ -
+

OpenAI Compatible Tensort-LLM Worker

+A high-performance inference server that combines the power of TensorRT-LLM for optimized model inference with RunPod's serverless infrastructure. This implementation provides an OpenAI-compatible API interface for easy integration with existing applications. + +## Features + +- TensorRT-LLM optimization for faster inference +- OpenAI-compatible API endpoints +- Flexible configuration through environment variables +- Support for model parallelism (tensor and pipeline) +- Hugging Face model integration +- Streaming response support +- RunPod serverless deployment ready + +### Runtime Constraints +- Batch size and sequence length must be determined during engine building time +- Dynamic shape support is limited and may impact performance +- KV-cache size is fixed at build time and affects memory usage +- Changing model parameters requires rebuilding the TensorRT engine + +### Build Time Impact +- Engine building can take significant time (hours for large models) +- Each combination of parameters requires a separate engine +- Changes to maximum sequence length or batch size require rebuilding + +## Environment Variables + +The server can be configured using the following environment variables: +```plaintext +TRTLLM_MODEL # Required: Path or name of the model to load +TRTLLM_TOKENIZER # Optional: Path or name of the tokenizer (defaults to model path) +TRTLLM_MAX_BEAM_WIDTH # Optional: Maximum beam width for beam search +TRTLLM_MAX_BATCH_SIZE # Optional: Maximum batch size for inference +TRTLLM_MAX_NUM_TOKENS # Optional: Maximum number of tokens to generate +TRTLLM_MAX_SEQ_LEN # Optional: Maximum sequence length +TRTLLM_TP_SIZE # Optional: Tensor parallelism size (default: 1) +TRTLLM_PP_SIZE # Optional: Pipeline parallelism size (default: 1) +TRTLLM_KV_CACHE_FREE_GPU_MEMORY_FRACTION # Optional: GPU memory fraction for KV cache (default: 0.9) +TRTLLM_TRUST_REMOTE_CODE # Optional: Whether to trust remote code (default: false) +HF_TOKEN # Optional: Hugging Face API token for protected models