Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

Closed
2 of 4 tasks
npnkhoi opened this issue Nov 26, 2024 · 3 comments
Closed
2 of 4 tasks
Labels
bug Usage General questions about the library

Comments

@npnkhoi
Copy link

npnkhoi commented Nov 26, 2024

System Info

  • transformers version: 4.46.3
  • Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
  • Python version: 3.11.4
  • Huggingface_hub version: 0.24.6
  • Safetensors version: 0.4.5
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: Quadro RTX 6000

Who can help?

@ArthurZucker @itazap @amyeroberts @qubvel

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

PROMPT = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n What is the content of the image? ASSISTANT:"
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)

batch = processor(text=[PROMPT], images=[image], padding=True, truncation=True, return_tensors="pt")
decoded_input = processor.decode(batch['input_ids'][0])
print(decoded_input)

This code will decode the input in such a way that the image token is multiplied into 576 tokens. This seems strange to me for two reasons:

  • At some point in October 2024, I was running this same type of code with the updated version of transformers. Based on my log at that time, the <image> token is not multiplied at all.
  • Now when I compared my inference results now vs my old code (where <image> was not multiplied), the output now is very weird. While 50% of the time, the generated text is the same as before, some other times it is just a series of numbers. Sometimes it is different and worse.

I guessed my previous transformers version was either the one at git+https://github.com/huggingface/transformers@454a0f2efdf9f0d94b77ef08efabbdc6418c868d or 4.46.1. When I tried the first version, AutoProcessor cannot load llava-hf/llava-1.5-7b-hf. Meanwhile, 4.46.1 produces the same phenomenon with my current env.

Expected behavior

Is it desirable that <image> is replaced with 576 same tokens? Can it be the case that it is not replaced, such as in what I saw in some earlier version? If someone recognize this change and can point me to how to bring my code to the previous state (one <image> only), that'd be greatly appreciated.

Thanks!

@npnkhoi npnkhoi added the bug label Nov 26, 2024
@zucchini-nlp
Copy link
Member

@npnkhoi hey! Yes, it is the new behavior and is meant to be replaced by image tokens when calling the model. Compared to the prev versions, this methods allows to simply replace each placeholder token with actual embedding, thus we have as many tokens as there will be image embeddings

If you do not want to see them when decoding, indicate skip_special_tokens=True please

Closing as resolved, as this is not a bug

@ArthurZucker ArthurZucker added the Usage General questions about the library label Nov 26, 2024
@npnkhoi
Copy link
Author

npnkhoi commented Nov 26, 2024

Thanks for the answer and sorry for the wrong label.

One more question: What is the latest transformers version before this behavior is incorporated? I think this behavior, combined with my specific hyperparams, somehow affected the performance of my models.

@zucchini-nlp
Copy link
Member

@npnkhoi for that you would need to load the prev commit of model from the hub specifyin the revision, because it is not in the transformers code only. Also we will stop supporting the old version in transformers in the next v4.48 release, so I'd recommend to use the new version

For revision the code looks as below and you use the correct commit hash from the hub model repo

Model.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", revision="2f7f20bda2e7af8e54438fec01ac5214e9ac6f92")

I think this behavior, combined with my specific hyperparams, somehow affected the performance of my models.
Hmm this is interesting, if you can share more details I might help you find the reason. Might also be the reason that we didnt yet remove the old code, which will be in #34502. After this PR it should be identical to the old llava, but now there are minor differences due to having to choose between old and new versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Usage General questions about the library
Projects
None yet
Development

No branches or pull requests

3 participants