-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLMs: fix number of image tokens #34332
VLMs: fix number of image tokens #34332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, makes sense, we might want a small vlm test for these (to ensure new models properly raise this!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebasing should fix this!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test as well
@@ -523,8 +523,9 @@ def forward( | |||
|
|||
# TODO: @raushan retain only the new behavior after v4.47 | |||
else: | |||
n_image_tokens = (input_ids == self.config.image_token_index).sum(dim=-1)[0].item() | |||
n_image_features = image_features.shape[1] | |||
n_image_tokens = (input_ids == self.config.image_token_index).sum().item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this slows down inference no? .item()
induces cuda cpu synch!
Anyways not the point, but thanks for the fix.
n_image_features
takes into account padding? (are image features not padded to the batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if the image is padded it is usually unpadded before we come to this point, e.g in llava-next. Hm, I don't think the slowdown will be drastic especially since we need to check once per input in the pre-fill stage
Might as well go for next patch release @ArthurZucker ? |
Yep we can do that! |
* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava
@ArthurZucker hi! |
Thanks for fixing #34379! However, I'm still unable to use LLaVA-NeXT with text-only input. See the failure in https://buildkite.com/vllm/ci-aws/builds/10881#01930542-f9c4-4f8e-9c23-c0e5f5ef0141 which occurs when |
This pull request might not be working as expected. cc: #30962 (comment) |
* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava
@zucchini-nlp is working on fixes #34332 was merged to adresse this! |
The error is reproduced on the Pixtral-12B model in the Transformers v4.46.3. However, in the Transformers v4.45.2 everything works.
Error:
|
@yurkoff-mv this is more about the Pixtral model and probably not related to this PR, because this PR modifies processing code in LLaVA models only. Pixtral has its own processor which is responsible for expanding input text with the correct number of image tokens. Would you mind opening a new issue? cc @Rocketknight1 in case you've encountered this error already since you've been working on Pixtral processing code |
@zucchini-nlp, thank you for relpy.
|
@yurkoff-mv yes, but the processor is different and the issue here stems from processing code. The processor is responsible to infer the correct amount of image tokens and add it in input ids, while the model only checks if the lengths of encoded images and the image tokens are same I'll look into thta, but a new issue for its own is nice way to track it as it is not related to this PR |
Found it, related to #34204 which left some edge cases apparently. In your script you can overcome it by passing text as We'll try to fix it soon |
* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava
I'm still getting this problem on v4.46.3. |
@DarkLight1337 yes, the text-only input is currently not supported and should be fixed by #34502 :) |
* fix * fix tests * add tests * style * style * fix qwen after rebase * fix video llava
What does this PR do?
Fixes #34284 (comment). Our tests didnt catch this because the tester has one image + one text inputs always. Actually we can try to get a mixture of different images/texts but that would be a whole new story of issues where some tests might slice on batch size and we'll lose the match between number of images and image tokens
Also fixes #34379