You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
fromtransformersimportLlavaForConditionalGeneration, AutoProcessorfromPILimportImageimportrequestsPROMPT="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n What is the content of the image? ASSISTANT:"processor=AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model=LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"image=Image.open(requests.get(url, stream=True).raw)
batch=processor(text=[PROMPT], images=[image], padding=True, truncation=True, return_tensors="pt")
decoded_input=processor.decode(batch['input_ids'][0])
print(decoded_input)
This code will decode the input in such a way that the image token is multiplied into 576 tokens. This seems strange to me for two reasons:
At some point in October 2024, I was running this same type of code with the updated version of transformers. Based on my log at that time, the <image> token is not multiplied at all.
Now when I compared my inference results now vs my old code (where <image> was not multiplied), the output now is very weird. While 50% of the time, the generated text is the same as before, some other times it is just a series of numbers. Sometimes it is different and worse.
I guessed my previous transformers version was either the one at git+https://github.com/huggingface/transformers@454a0f2efdf9f0d94b77ef08efabbdc6418c868d or 4.46.1. When I tried the first version, AutoProcessor cannot load llava-hf/llava-1.5-7b-hf. Meanwhile, 4.46.1 produces the same phenomenon with my current env.
Expected behavior
Is it desirable that <image> is replaced with 576 same tokens? Can it be the case that it is not replaced, such as in what I saw in some earlier version? If someone recognize this change and can point me to how to bring my code to the previous state (one <image> only), that'd be greatly appreciated.
Thanks!
The text was updated successfully, but these errors were encountered:
@npnkhoi hey! Yes, it is the new behavior and is meant to be replaced by image tokens when calling the model. Compared to the prev versions, this methods allows to simply replace each placeholder token with actual embedding, thus we have as many tokens as there will be image embeddings
If you do not want to see them when decoding, indicate skip_special_tokens=True please
Thanks for the answer and sorry for the wrong label.
One more question: What is the latest transformers version before this behavior is incorporated? I think this behavior, combined with my specific hyperparams, somehow affected the performance of my models.
@npnkhoi for that you would need to load the prev commit of model from the hub specifyin the revision, because it is not in the transformers code only. Also we will stop supporting the old version in transformers in the next v4.48 release, so I'd recommend to use the new version
For revision the code looks as below and you use the correct commit hash from the hub model repo
I think this behavior, combined with my specific hyperparams, somehow affected the performance of my models.
Hmm this is interesting, if you can share more details I might help you find the reason. Might also be the reason that we didnt yet remove the old code, which will be in #34502. After this PR it should be identical to the old llava, but now there are minor differences due to having to choose between old and new versions
System Info
transformers
version: 4.46.3Who can help?
@ArthurZucker @itazap @amyeroberts @qubvel
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This code will decode the input in such a way that the image token is multiplied into 576 tokens. This seems strange to me for two reasons:
transformers
. Based on my log at that time, the<image>
token is not multiplied at all.<image>
was not multiplied), the output now is very weird. While 50% of the time, the generated text is the same as before, some other times it is just a series of numbers. Sometimes it is different and worse.I guessed my previous transformers version was either the one at
git+https://github.com/huggingface/transformers@454a0f2efdf9f0d94b77ef08efabbdc6418c868d
or4.46.1
. When I tried the first version, AutoProcessor cannot loadllava-hf/llava-1.5-7b-hf
. Meanwhile,4.46.1
produces the same phenomenon with my current env.Expected behavior
Is it desirable that
<image>
is replaced with 576 same tokens? Can it be the case that it is not replaced, such as in what I saw in some earlier version? If someone recognize this change and can point me to how to bring my code to the previous state (one<image>
only), that'd be greatly appreciated.Thanks!
The text was updated successfully, but these errors were encountered: