Querying PaliGemma VLMs #123

kanishkamisra · 2024-11-21T14:52:41Z

My collaborators and I are trying to use your very useful containers to deploy and use Google's PaliGemma models on GCS/Vertex. I was wondering what is the best way to query the model with images, especially if the images are stored locally? I see that there is an example showing this for Llama Vision but it seems like you have to pass in the images as urls which may not be feasible for us..

We're getting some success by doing something like this, but unsure if that's the right way:

image_path = "/PATH/rabbit.png"

with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"

output = deployed_model.predict(
    instances=[
        {
            "inputs":f"![]({image})What is the animal wearing?",
            "parameters":{"max_new_tokens": 100, "do_sample": False}
        }
    ]
)
#> space suit

Please let me know if you need more details! Any assistance would be much appreciated!

alvarobartt · 2024-11-28T20:28:15Z

Hi here @kanishkamisra, sorry I got to reply this just now!

But you are right, that's how it's supposed to be done programmatically if you don't have the URL of the image but just the images themselves, see the Text Generation Inference Documentation on Visual Language Models.

Here's the full example as shown in the documentation linked above:

import base64
from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:8080")

image_path = "rabbit.png"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

for token in client.text_generation(prompt, max_new_tokens=10, stream=True):
    print(token)

I'll try to create an example within this repository too in order to have a working example with the different alternatives!

alvarobartt · 2024-11-28T20:29:48Z

P.S. just realised that you are missing the two ending line-breaks i.e. \n, and Paligemma is known to be quite sensitive to the prompt formatting, so your code should look like the following instead:

image_path = "/PATH/rabbit.png"

with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"

output = deployed_model.predict(
    instances=[
        {
            "inputs":f"![]({image})What is the animal wearing?\n\n",
            "parameters":{"max_new_tokens": 100, "do_sample": False}
        }
    ]
)
#> space suit

alvarobartt added the question label Nov 28, 2024

alvarobartt self-assigned this Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying PaliGemma VLMs #123

Querying PaliGemma VLMs #123

kanishkamisra commented Nov 21, 2024

alvarobartt commented Nov 28, 2024

alvarobartt commented Nov 28, 2024

Querying PaliGemma VLMs #123

Querying PaliGemma VLMs #123

Comments

kanishkamisra commented Nov 21, 2024

alvarobartt commented Nov 28, 2024

alvarobartt commented Nov 28, 2024