Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888

Open
2 of 4 tasks
AHEADer opened this issue Jan 7, 2025 · 6 comments

Comments

@AHEADer
Copy link

AHEADer commented Jan 7, 2025

System Info

image

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Use below code to query the model and it failed:

# If necessary, install the openai Python library by running 
# pip install openai

from openai import OpenAI
import httpx
import base64
client = OpenAI(
    base_url="https://xxxxx.us-east-1.aws.endpoints.huggingface.cloud/v1/", 
    api_key="hf_xxxxxx"
)
img_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe these two images in one sentence."
            },
            {
                "type": "image_url",
                "image_url": {"url": img_url},
            },
            {
                "type": "image_url",
                "image_url": {"url": img_url},
            }
        ]
    }
],
    top_p=None,
    temperature=None,
    max_tokens=500,
    stream=True,
    seed=None,
    stop=None,
    frequency_penalty=None,
    presence_penalty=None
)
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

You will see such error:
image

Expected behavior

Successfully return

@drbh
Copy link
Collaborator

drbh commented Jan 7, 2025

Hi @AHEADer thanks for opening this issue, I just attempted to reproduce on a machine with L4's

with a single L4 I was unable to run Qwen/Qwen2-VL-7B-Instruct with a 20K context, however when using two L4's I was able to start the sever with the 20K context as shown above.

ie. command

text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-input-tokens 20000 \
--max-batch-prefill-tokens 20000 \
--max-total-tokens 20001 \
--num-shard 2 \
--cuda-graphs 0

once started the server responds to the script

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/v1", api_key="hf_xxxxxx")
img_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe these two images in one sentence."},
                {
                    "type": "image_url",
                    "image_url": {"url": img_url},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": img_url},
                },
            ],
        }
    ],
    top_p=None,
    temperature=None,
    max_tokens=500,
    stream=True,
    seed=None,
    stop=None,
    frequency_penalty=None,
    presence_penalty=None,
)
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

# The image showcases the iconic Statue of Liberty and downtown New York City skyline over the vast ocean backdrop, conveying the majesty of the city and its significant landmark amid the expansive water body.#

would it be possible to try running the endpoint on a machine with more than one GPU?

additionally I'd like to note that cuda graphs must be set to 0 at the moment to avoid a separate bug that should be resolved soon with: #2802

@AHEADer
Copy link
Author

AHEADer commented Jan 8, 2025

I'm confused. One L4 has 24GB memory and one L4S has 48G memory. One L4S should have the same behaviour with 2 L4 .... Anyway I'll try to use multiple GPUs on the server and give you feedback soon.

@AHEADer
Copy link
Author

AHEADer commented Jan 8, 2025

I tried to deploy the model with 4 L4 in inference endpoints, but still got the same error: Server error: upper bound and larger bound inconsistent with step sign Is this related to the transformers version? Online the container URI is ghcr.io/huggingface/text-generation-inference:3.0.1

@drbh
Copy link
Collaborator

drbh commented Jan 8, 2025

Hi @AHEADer apologies for my mistake above, I misread the L40S as an L4 🤦‍♂️. Fortunately, I believe this issue has actually been resolved in a recent PR after v3.0.1.

Related issue: #2839
PR: #2840

would you kindly try the latest docker container ghcr.io/huggingface/text-generation-inference:sha-23bc38b which should include this patch?

please let me know if this resolves your issue! Thanks!

@AHEADer
Copy link
Author

AHEADer commented Jan 9, 2025

Is there a way to test tgi locally by myself without the image? Due to some policy reasons, I cannot use your image. I compiled tgi locally and it took me endless time until I got oom error. I have 1400G memory for my local machine. I just run BUILD_EXTENSIONS=True make install and seems it stuck at build ext_module forever....

@AHEADer
Copy link
Author

AHEADer commented Jan 9, 2025

Hi drbh, I've successfully infer multiple images by using the image you provided, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants