Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLMs: fix number of image tokens #34332

Merged
merged 10 commits into from
Oct 30, 2024

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Oct 23, 2024

What does this PR do?

Fixes #34284 (comment). Our tests didnt catch this because the tester has one image + one text inputs always. Actually we can try to get a mixture of different images/texts but that would be a whole new story of issues where some tests might slice on batch size and we'll lose the match between number of images and image tokens

Also fixes #34379

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense, we might want a small vlm test for these (to ensure new models properly raise this!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebasing should fix this!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test as well

@@ -523,8 +523,9 @@ def forward(

# TODO: @raushan retain only the new behavior after v4.47
else:
n_image_tokens = (input_ids == self.config.image_token_index).sum(dim=-1)[0].item()
n_image_features = image_features.shape[1]
n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this slows down inference no? .item() induces cuda cpu synch!
Anyways not the point, but thanks for the fix.
n_image_features takes into account padding? (are image features not padded to the batch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if the image is padded it is usually unpadded before we come to this point, e.g in llava-next. Hm, I don't think the slowdown will be drastic especially since we need to check once per input in the pre-fill stage

@zucchini-nlp zucchini-nlp merged commit 913330c into huggingface:main Oct 30, 2024
17 checks passed
@zucchini-nlp
Copy link
Member Author

Might as well go for next patch release @ArthurZucker ?

@ArthurZucker
Copy link
Collaborator

Yep we can do that!

ArthurZucker pushed a commit that referenced this pull request Nov 5, 2024
* fix

* fix tests

* add tests

* style

* style

* fix qwen after rebase

* fix video llava
@agadetsky
Copy link

@ArthurZucker hi!
It seems that this bug #34625 might be related to the PR

@DarkLight1337
Copy link

Thanks for fixing #34379! However, I'm still unable to use LLaVA-NeXT with text-only input. See the failure in https://buildkite.com/vllm/ci-aws/builds/10881#01930542-f9c4-4f8e-9c23-c0e5f5ef0141 which occurs when input_ids are given but not pixel_values.

@pspdada
Copy link

pspdada commented Nov 9, 2024

This pull request might not be working as expected. cc: #30962 (comment)

2015aroras pushed a commit to 2015aroras/transformers that referenced this pull request Nov 15, 2024
* fix

* fix tests

* add tests

* style

* style

* fix qwen after rebase

* fix video llava
@ArthurZucker
Copy link
Collaborator

@zucchini-nlp is working on fixes #34332 was merged to adresse this!

@yurkoff-mv
Copy link

yurkoff-mv commented Dec 5, 2024

The error is reproduced on the Pixtral-12B model in the Transformers v4.46.3. However, in the Transformers v4.45.2 everything works.

with open('./data/dog.jpg', 'rb') as fr:
        image_dog = fr.read()

with open('./data/mountain.jpg', 'rb') as fr:
    image_mountain = fr.read()

images = [image_dog , image_mountain]
images = [Image.open(BytesIO(image)) for image in images]

messages = [
        {
            "role": "user", "content": [
            {"type": "text", "content": "What is this animal?"},
            {"type": "image"},
            {"type": "text", "content": "Can it live here?"},
            {"type": "image"}
        ]
        }
    ]

text_prompt = self.processor.apply_chat_template(model_input['messages'], add_generation_prompt=True)
            inputs = self.processor(text=[text_prompt],
                                                  images=images,
                                                  padding=True,
                                                  return_tensors="pt",
                                                  ).to(self.device)

output_ids = self.model.generate(**inputs,
                                                      do_sample=False,
                                                      max_new_tokens=2048,
                                                      )

Error:

Image features and image tokens do not match: tokens: 248, features 494

mountain
dog

@zucchini-nlp
Copy link
Member Author

@yurkoff-mv this is more about the Pixtral model and probably not related to this PR, because this PR modifies processing code in LLaVA models only. Pixtral has its own processor which is responsible for expanding input text with the correct number of image tokens. Would you mind opening a new issue?

cc @Rocketknight1 in case you've encountered this error already since you've been working on Pixtral processing code

@yurkoff-mv
Copy link

@zucchini-nlp, thank you for relpy.
I use the LLaVa model. The code worked in Transformers v4.45.2.

processor = AutoProcessor.from_pretrained(model_path)
processor.tokenizer.pad_token = self.processor.tokenizer.eos_token
model = LlavaForConditionalGeneration.from_pretrained(model_path,
                                                      quantization_config=bnb_config,
                                                      device_map='auto',
                                                      attn_implementation="flash_attention_2",
                                                      )

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Dec 5, 2024

@yurkoff-mv yes, but the processor is different and the issue here stems from processing code. The processor is responsible to infer the correct amount of image tokens and add it in input ids, while the model only checks if the lengths of encoded images and the image tokens are same

I'll look into thta, but a new issue for its own is nice way to track it as it is not related to this PR

@zucchini-nlp
Copy link
Member Author

Found it, related to #34204 which left some edge cases apparently. In your script you can overcome it by passing text as str not list

We'll try to fix it soon

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* fix

* fix tests

* add tests

* style

* style

* fix qwen after rebase

* fix video llava
@DarkLight1337
Copy link

Thanks for fixing #34379! However, I'm still unable to use LLaVA-NeXT with text-only input. See the failure in https://buildkite.com/vllm/ci-aws/builds/10881#01930542-f9c4-4f8e-9c23-c0e5f5ef0141 which occurs when input_ids are given but not pixel_values.

I'm still getting this problem on v4.46.3.

@zucchini-nlp
Copy link
Member Author

@DarkLight1337 yes, the text-only input is currently not supported and should be fixed by #34502 :)

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
* fix

* fix tests

* add tests

* style

* style

* fix qwen after rebase

* fix video llava
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants