Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888

AHEADer · 2025-01-07T15:14:00Z

System Info

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Use below code to query the model and it failed:

# If necessary, install the openai Python library by running 
# pip install openai

from openai import OpenAI
import httpx
import base64
client = OpenAI(
    base_url="https://xxxxx.us-east-1.aws.endpoints.huggingface.cloud/v1/", 
    api_key="hf_xxxxxx"
)
img_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe these two images in one sentence."
            },
            {
                "type": "image_url",
                "image_url": {"url": img_url},
            },
            {
                "type": "image_url",
                "image_url": {"url": img_url},
            }
        ]
    }
],
    top_p=None,
    temperature=None,
    max_tokens=500,
    stream=True,
    seed=None,
    stop=None,
    frequency_penalty=None,
    presence_penalty=None
)
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

You will see such error:

Expected behavior

Successfully return

The text was updated successfully, but these errors were encountered:

drbh · 2025-01-07T22:58:22Z

Hi @AHEADer thanks for opening this issue, I just attempted to reproduce on a machine with L4's

with a single L4 I was unable to run Qwen/Qwen2-VL-7B-Instruct with a 20K context, however when using two L4's I was able to start the sever with the 20K context as shown above.

ie. command

text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-input-tokens 20000 \
--max-batch-prefill-tokens 20000 \
--max-total-tokens 20001 \
--num-shard 2 \
--cuda-graphs 0

once started the server responds to the script

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/v1", api_key="hf_xxxxxx")
img_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe these two images in one sentence."},
                {
                    "type": "image_url",
                    "image_url": {"url": img_url},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": img_url},
                },
            ],
        }
    ],
    top_p=None,
    temperature=None,
    max_tokens=500,
    stream=True,
    seed=None,
    stop=None,
    frequency_penalty=None,
    presence_penalty=None,
)
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

# The image showcases the iconic Statue of Liberty and downtown New York City skyline over the vast ocean backdrop, conveying the majesty of the city and its significant landmark amid the expansive water body.#

would it be possible to try running the endpoint on a machine with more than one GPU?

additionally I'd like to note that cuda graphs must be set to 0 at the moment to avoid a separate bug that should be resolved soon with: #2802

AHEADer · 2025-01-08T01:25:43Z

I'm confused. One L4 has 24GB memory and one L4S has 48G memory. One L4S should have the same behaviour with 2 L4 .... Anyway I'll try to use multiple GPUs on the server and give you feedback soon.

AHEADer · 2025-01-08T01:45:51Z

I tried to deploy the model with 4 L4 in inference endpoints, but still got the same error: Server error: upper bound and larger bound inconsistent with step sign Is this related to the transformers version? Online the container URI is ghcr.io/huggingface/text-generation-inference:3.0.1

drbh · 2025-01-08T14:09:13Z

Hi @AHEADer apologies for my mistake above, I misread the L40S as an L4 🤦‍♂️. Fortunately, I believe this issue has actually been resolved in a recent PR after v3.0.1.

Related issue: #2839
PR: #2840

would you kindly try the latest docker container ghcr.io/huggingface/text-generation-inference:sha-23bc38b which should include this patch?

please let me know if this resolves your issue! Thanks!

AHEADer · 2025-01-09T01:07:39Z

Is there a way to test tgi locally by myself without the image? Due to some policy reasons, I cannot use your image. I compiled tgi locally and it took me endless time until I got oom error. I have 1400G memory for my local machine. I just run BUILD_EXTENSIONS=True make install and seems it stuck at build ext_module forever....

AHEADer · 2025-01-09T09:18:50Z

Hi drbh, I've successfully infer multiple images by using the image you provided, thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888

Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888

AHEADer commented Jan 7, 2025

drbh commented Jan 7, 2025

AHEADer commented Jan 8, 2025

AHEADer commented Jan 8, 2025

drbh commented Jan 8, 2025

AHEADer commented Jan 9, 2025

AHEADer commented Jan 9, 2025

Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888

Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888

Comments

AHEADer commented Jan 7, 2025

System Info

Information

Tasks

Reproduction

Expected behavior

drbh commented Jan 7, 2025

AHEADer commented Jan 8, 2025

AHEADer commented Jan 8, 2025

drbh commented Jan 8, 2025

AHEADer commented Jan 9, 2025

AHEADer commented Jan 9, 2025