-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888
Comments
Hi @AHEADer thanks for opening this issue, I just attempted to reproduce on a machine with L4's with a single L4 I was unable to run ie. command text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-input-tokens 20000 \
--max-batch-prefill-tokens 20000 \
--max-total-tokens 20001 \
--num-shard 2 \
--cuda-graphs 0 once started the server responds to the script from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="hf_xxxxxx")
img_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
chat_completion = client.chat.completions.create(
model="tgi",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe these two images in one sentence."},
{
"type": "image_url",
"image_url": {"url": img_url},
},
{
"type": "image_url",
"image_url": {"url": img_url},
},
],
}
],
top_p=None,
temperature=None,
max_tokens=500,
stream=True,
seed=None,
stop=None,
frequency_penalty=None,
presence_penalty=None,
)
for message in chat_completion:
print(message.choices[0].delta.content, end="")
# The image showcases the iconic Statue of Liberty and downtown New York City skyline over the vast ocean backdrop, conveying the majesty of the city and its significant landmark amid the expansive water body.# would it be possible to try running the endpoint on a machine with more than one GPU? additionally I'd like to note that cuda graphs must be set to 0 at the moment to avoid a separate bug that should be resolved soon with: #2802 |
I'm confused. One L4 has 24GB memory and one L4S has 48G memory. One L4S should have the same behaviour with 2 L4 .... Anyway I'll try to use multiple GPUs on the server and give you feedback soon. |
I tried to deploy the model with 4 L4 in inference endpoints, but still got the same error: |
Hi @AHEADer apologies for my mistake above, I misread the L40S as an L4 🤦♂️. Fortunately, I believe this issue has actually been resolved in a recent PR after Related issue: #2839 would you kindly try the latest docker container please let me know if this resolves your issue! Thanks! |
Is there a way to test tgi locally by myself without the image? Due to some policy reasons, I cannot use your image. I compiled tgi locally and it took me endless time until I got oom error. I have 1400G memory for my local machine. I just run |
Hi drbh, I've successfully infer multiple images by using the image you provided, thanks |
System Info
Information
Tasks
Reproduction
Use below code to query the model and it failed:
You will see such error:
Expected behavior
Successfully return
The text was updated successfully, but these errors were encountered: