LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

npnkhoi · 2024-11-26T04:10:34Z

System Info

transformers version: 4.46.3
Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
Python version: 3.11.4
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: not found
PyTorch version (GPU?): 2.4.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: Quadro RTX 6000

Who can help?

@ArthurZucker @itazap @amyeroberts @qubvel

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

PROMPT = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n What is the content of the image? ASSISTANT:"
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)

batch = processor(text=[PROMPT], images=[image], padding=True, truncation=True, return_tensors="pt")
decoded_input = processor.decode(batch['input_ids'][0])
print(decoded_input)

This code will decode the input in such a way that the image token is multiplied into 576 tokens. This seems strange to me for two reasons:

At some point in October 2024, I was running this same type of code with the updated version of transformers. Based on my log at that time, the <image> token is not multiplied at all.
Now when I compared my inference results now vs my old code (where <image> was not multiplied), the output now is very weird. While 50% of the time, the generated text is the same as before, some other times it is just a series of numbers. Sometimes it is different and worse.

I guessed my previous transformers version was either the one at git+https://github.com/huggingface/transformers@454a0f2efdf9f0d94b77ef08efabbdc6418c868d or 4.46.1. When I tried the first version, AutoProcessor cannot load llava-hf/llava-1.5-7b-hf. Meanwhile, 4.46.1 produces the same phenomenon with my current env.

Expected behavior

Is it desirable that <image> is replaced with 576 same tokens? Can it be the case that it is not replaced, such as in what I saw in some earlier version? If someone recognize this change and can point me to how to bring my code to the previous state (one <image> only), that'd be greatly appreciated.

Thanks!

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-11-26T07:10:57Z

@npnkhoi hey! Yes, it is the new behavior and is meant to be replaced by image tokens when calling the model. Compared to the prev versions, this methods allows to simply replace each placeholder token with actual embedding, thus we have as many tokens as there will be image embeddings

If you do not want to see them when decoding, indicate skip_special_tokens=True please

Closing as resolved, as this is not a bug

npnkhoi · 2024-11-26T15:21:30Z

Thanks for the answer and sorry for the wrong label.

One more question: What is the latest transformers version before this behavior is incorporated? I think this behavior, combined with my specific hyperparams, somehow affected the performance of my models.

zucchini-nlp · 2024-11-26T16:53:21Z

@npnkhoi for that you would need to load the prev commit of model from the hub specifyin the revision, because it is not in the transformers code only. Also we will stop supporting the old version in transformers in the next v4.48 release, so I'd recommend to use the new version

For revision the code looks as below and you use the correct commit hash from the hub model repo

Model.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", revision="2f7f20bda2e7af8e54438fec01ac5214e9ac6f92")

I think this behavior, combined with my specific hyperparams, somehow affected the performance of my models.
Hmm this is interesting, if you can share more details I might help you find the reason. Might also be the reason that we didnt yet remove the old code, which will be in #34502. After this PR it should be identical to the old llava, but now there are minor differences due to having to choose between old and new versions

npnkhoi added the bug label Nov 26, 2024

zucchini-nlp closed this as completed Nov 26, 2024

ArthurZucker added the Usage General questions about the library label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

npnkhoi commented Nov 26, 2024 •

edited

Loading

zucchini-nlp commented Nov 26, 2024

npnkhoi commented Nov 26, 2024

zucchini-nlp commented Nov 26, 2024

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

Comments

npnkhoi commented Nov 26, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented Nov 26, 2024

npnkhoi commented Nov 26, 2024

zucchini-nlp commented Nov 26, 2024

npnkhoi commented Nov 26, 2024 •

edited

Loading