-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latency and Throughput Inquiry #20
Comments
g5.12xlarge is recommended
Could you please share your config? |
I am using g5.12.xlarge and its deployed within a VPC
With this, I am getting 5-6 seconds latency for a prompt + question whereas with DJL-FasterTransformer Container it is sub-second. Is this the expected latency? |
Hi @ctandrewtran do you also do bitsandbytes quantization for FasterTransformer? Wondering if the latency differences are due to quantization! |
If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests Here's the quick start: Step 1 Create a Config.yamlmodel_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003 Step 2: Start the litellm proxy:
Step3 Make Request to LiteLLM proxy:
|
Hello-
I've been looking into hosting an LLM on AWS Infrastructure. I am mainly looking to host Flan T5 XXL. My question is below
Inquiry: what is the recommended container for hosting Flan T5 XXL?
Context: I've hosted Flan T5 XXL using the TGI Container and the DJL-FasterTransformer container. Using the same Prompt, TGI takes around 5-6 seconds whereas the DJL-FasterTransformer container takes .5-1.5 seconds. The DJL-FasterTransformer Container has the tensor-parallel-degree set to 4. The SM_NM_GPU for TGI was set to 4. Both were hosted using ml.g5.12xlarge.
The text was updated successfully, but these errors were encountered: