-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sglang example #92
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost same suggestions from lmdeploy PR
import subprocess | ||
import sys | ||
import threading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dependencies are using in model.py
, and can be removed
orjson | ||
python-multipart | ||
|
||
--extra-index-url https://flashinfer.ai/whl/cu121/torch2.4/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In prod and deb we have cuda 12.4, I'm not sure if it works with this cu121
, need to be verified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I tested on q22, which also has cuda 12.4 where prediction is successful but don't thing this will be a issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used below requirements with dependencies versions to test locally and it worked. I think it's better to include requirements with it's versions here, because before I don't know why but I was getting error when I didn't specify dependencies versions
torch==2.4.0
tokenizers==0.20.2
transformers==4.46.2
accelerate==0.34.2
scipy==1.10.1
optimum==1.23.3
xformers==0.0.27.post2
einops==0.8.0
requests==2.32.2
packaging
ninja
protobuf==3.20.0
sglang[all]==0.3.5.post2
orjson==3.10.11
python-multipart==0.0.17
--extra-index-url https://flashinfer.ai/whl/cu121/torch2.4/
flashinfer
``
@phatvo9 I uploaded the model on prod, upload is successful but predictions are failing. And looking at prod logs I've got below
|
|
||
inference_compute_info: | ||
cpu_limit: "4" | ||
cpu_memory: "24Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to reduce cpu_memory
because max 16Gi is available
No description provided.