VLMs: fix number of image tokens #34332

zucchini-nlp · 2024-10-23T06:00:23Z

What does this PR do?

Fixes #34284 (comment). Our tests didnt catch this because the tester has one image + one text inputs always. Actually we can try to get a mixture of different images/texts but that would be a whole new story of issues where some tests might slice on batch size and we'll lose the match between number of images and image tokens

Also fixes #34379

ArthurZucker

Thanks, makes sense, we might want a small vlm test for these (to ensure new models properly raise this!)

ArthurZucker · 2024-10-24T13:42:07Z

src/transformers/models/glm/modeling_glm.py

rebasing should fix this!

HuggingFaceDocBuilderDev · 2024-10-24T16:33:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks for adding the test as well

ArthurZucker · 2024-10-28T12:31:41Z

src/transformers/models/llava/modeling_llava.py

@@ -523,8 +523,9 @@ def forward(

            # TODO: @raushan retain only the new behavior after v4.47
            else:
-                n_image_tokens = (input_ids == self.config.image_token_index).sum(dim=-1)[0].item()
-                n_image_features = image_features.shape[1]
+                n_image_tokens = (input_ids == self.config.image_token_index).sum().item()


I think this slows down inference no? .item() induces cuda cpu synch!
Anyways not the point, but thanks for the fix.
n_image_features takes into account padding? (are image features not padded to the batch?

yes, if the image is padded it is usually unpadded before we come to this point, e.g in llava-next. Hm, I don't think the slowdown will be drastic especially since we need to check once per input in the pre-fill stage

zucchini-nlp · 2024-10-30T09:22:20Z

Might as well go for next patch release @ArthurZucker ?

ArthurZucker · 2024-10-30T10:37:37Z

Yep we can do that!

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

agadetsky · 2024-11-06T11:48:46Z

@ArthurZucker hi!
It seems that this bug #34625 might be related to the PR

DarkLight1337 · 2024-11-07T08:52:46Z

Thanks for fixing #34379! However, I'm still unable to use LLaVA-NeXT with text-only input. See the failure in https://buildkite.com/vllm/ci-aws/builds/10881#01930542-f9c4-4f8e-9c23-c0e5f5ef0141 which occurs when input_ids are given but not pixel_values.

pspdada · 2024-11-09T16:59:48Z

This pull request might not be working as expected. cc: #30962 (comment)

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

ArthurZucker · 2024-11-25T13:09:24Z

@zucchini-nlp is working on fixes #34332 was merged to adresse this!

yurkoff-mv · 2024-12-05T06:53:21Z

The error is reproduced on the Pixtral-12B model in the Transformers v4.46.3. However, in the Transformers v4.45.2 everything works.

with open('./data/dog.jpg', 'rb') as fr:
        image_dog = fr.read()

with open('./data/mountain.jpg', 'rb') as fr:
    image_mountain = fr.read()

images = [image_dog , image_mountain]
images = [Image.open(BytesIO(image)) for image in images]

messages = [
        {
            "role": "user", "content": [
            {"type": "text", "content": "What is this animal?"},
            {"type": "image"},
            {"type": "text", "content": "Can it live here?"},
            {"type": "image"}
        ]
        }
    ]

text_prompt = self.processor.apply_chat_template(model_input['messages'], add_generation_prompt=True)
            inputs = self.processor(text=[text_prompt],
                                                  images=images,
                                                  padding=True,
                                                  return_tensors="pt",
                                                  ).to(self.device)

output_ids = self.model.generate(**inputs,
                                                      do_sample=False,
                                                      max_new_tokens=2048,
                                                      )

Error:

Image features and image tokens do not match: tokens: 248, features 494

zucchini-nlp · 2024-12-05T09:01:59Z

@yurkoff-mv this is more about the Pixtral model and probably not related to this PR, because this PR modifies processing code in LLaVA models only. Pixtral has its own processor which is responsible for expanding input text with the correct number of image tokens. Would you mind opening a new issue?

cc @Rocketknight1 in case you've encountered this error already since you've been working on Pixtral processing code

yurkoff-mv · 2024-12-05T09:33:47Z

@zucchini-nlp, thank you for relpy.
I use the LLaVa model. The code worked in Transformers v4.45.2.

processor = AutoProcessor.from_pretrained(model_path)
processor.tokenizer.pad_token = self.processor.tokenizer.eos_token
model = LlavaForConditionalGeneration.from_pretrained(model_path,
                                                      quantization_config=bnb_config,
                                                      device_map='auto',
                                                      attn_implementation="flash_attention_2",
                                                      )

zucchini-nlp · 2024-12-05T10:17:28Z

@yurkoff-mv yes, but the processor is different and the issue here stems from processing code. The processor is responsible to infer the correct amount of image tokens and add it in input ids, while the model only checks if the lengths of encoded images and the image tokens are same

I'll look into thta, but a new issue for its own is nice way to track it as it is not related to this PR

zucchini-nlp · 2024-12-05T10:54:12Z

Found it, related to #34204 which left some edge cases apparently. In your script you can overcome it by passing text as str not list

We'll try to fix it soon

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

DarkLight1337 · 2024-12-06T08:47:39Z

Thanks for fixing #34379! However, I'm still unable to use LLaVA-NeXT with text-only input. See the failure in https://buildkite.com/vllm/ci-aws/builds/10881#01930542-f9c4-4f8e-9c23-c0e5f5ef0141 which occurs when input_ids are given but not pixel_values.

I'm still getting this problem on v4.46.3.

zucchini-nlp · 2024-12-06T09:06:33Z

@DarkLight1337 yes, the text-only input is currently not supported and should be fixed by #34502 :)

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

fix

a413761

zucchini-nlp mentioned this pull request Oct 23, 2024

Expand inputs in processors for VLMs #30962

Merged

zucchini-nlp requested a review from ArthurZucker October 23, 2024 06:01

fix tests

cdce455

ArthurZucker reviewed Oct 24, 2024

View reviewed changes

src/transformers/models/glm/modeling_glm.py Outdated

Copy link

Collaborator

ArthurZucker Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebasing should fix this!

zucchini-nlp reacted with thumbs up emoji

zucchini-nlp added 2 commits October 24, 2024 18:03

add tests

48b294e

merge main

65cd04e

zucchini-nlp added 2 commits October 25, 2024 08:50

style

3a5b058

Merge branch 'main' into vlm-raise-errors

4b81d36

zucchini-nlp requested a review from ArthurZucker October 25, 2024 07:03

zucchini-nlp mentioned this pull request Oct 25, 2024

Unable to run Chameleon model since v4.46 #34379

Closed

4 tasks

ArthurZucker approved these changes Oct 28, 2024

View reviewed changes

zucchini-nlp added 4 commits October 30, 2024 08:49

Merge branch 'main' into vlm-raise-errors

11762b5

style

8211a79

fix qwen after rebase

6477ea6

fix video llava

361182b

zucchini-nlp merged commit 913330c into huggingface:main Oct 30, 2024
17 checks passed

ArthurZucker pushed a commit that referenced this pull request Nov 5, 2024

VLMs: fix number of image tokens (#34332)

7da0eef

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

DarkLight1337 mentioned this pull request Nov 7, 2024

[CI/Build] Bump test transformers version vllm-project/vllm#10106

Merged

jiaqiw09 mentioned this pull request Nov 14, 2024

RuntimeError in _group_tensors_by_device_and_dtype (torch/optim/optimizer.py) when using torchrun on N>1 GPUs. pytorch/pytorch#140471

Closed

2015aroras pushed a commit to 2015aroras/transformers that referenced this pull request Nov 15, 2024

VLMs: fix number of image tokens (huggingface#34332)

c0e097e

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

yurkoff-mv mentioned this pull request Dec 5, 2024

LLaVa with multiple image input throws error: Image features and image tokens do not match #34284

Closed

4 tasks

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

VLMs: fix number of image tokens (huggingface#34332)

9f68b9d

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

Rocketknight1 mentioned this pull request Dec 5, 2024

Fix the structure of images in PixtralProcessor #35107

Closed

DarkLight1337 mentioned this pull request Dec 6, 2024

[CI/Build] Fix broken multimodal test vllm-project/vllm#10950

Merged

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024

VLMs: fix number of image tokens (huggingface#34332)

6fa4e44

* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLMs: fix number of image tokens #34332

VLMs: fix number of image tokens #34332

zucchini-nlp commented Oct 23, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Oct 24, 2024

HuggingFaceDocBuilderDev commented Oct 24, 2024

ArthurZucker left a comment

ArthurZucker Oct 28, 2024

zucchini-nlp Oct 28, 2024

zucchini-nlp commented Oct 30, 2024

ArthurZucker commented Oct 30, 2024

agadetsky commented Nov 6, 2024

DarkLight1337 commented Nov 7, 2024

pspdada commented Nov 9, 2024

ArthurZucker commented Nov 25, 2024

yurkoff-mv commented Dec 5, 2024 •

edited

Loading

zucchini-nlp commented Dec 5, 2024

yurkoff-mv commented Dec 5, 2024

zucchini-nlp commented Dec 5, 2024 •

edited

Loading

zucchini-nlp commented Dec 5, 2024

DarkLight1337 commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

VLMs: fix number of image tokens #34332

VLMs: fix number of image tokens #34332

Conversation

zucchini-nlp commented Oct 23, 2024 • edited Loading

What does this PR do?

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 24, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 24, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 28, 2024

Choose a reason for hiding this comment

zucchini-nlp Oct 28, 2024

Choose a reason for hiding this comment

zucchini-nlp commented Oct 30, 2024

ArthurZucker commented Oct 30, 2024

agadetsky commented Nov 6, 2024

DarkLight1337 commented Nov 7, 2024

pspdada commented Nov 9, 2024

ArthurZucker commented Nov 25, 2024

yurkoff-mv commented Dec 5, 2024 • edited Loading

zucchini-nlp commented Dec 5, 2024

yurkoff-mv commented Dec 5, 2024

zucchini-nlp commented Dec 5, 2024 • edited Loading

zucchini-nlp commented Dec 5, 2024

DarkLight1337 commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

zucchini-nlp commented Oct 23, 2024 •

edited

Loading

yurkoff-mv commented Dec 5, 2024 •

edited

Loading

zucchini-nlp commented Dec 5, 2024 •

edited

Loading