Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the structure of images in PixtralProcessor #35107

Closed
wants to merge 7 commits into from

Conversation

Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Dec 5, 2024

The PixtralProcessor has some issues regarding correct nesting depth of inputs. If we are fully explicit about input nesting, then text should be List[str] and images should be List[List[Image]]. This is because each sample only has one text, but can have multiple images.

The start of the Pixtral processor code handles cases when users don't supply fully nested inputs. However, it uses the heuristic that if the user passes List[str] for text and List[Image] for images, then this indicates one image per sample. This is very confusing for users, especially because it means output changes when batch_size==1, depending on whether that single input is passed as str or [str].

With this PR, we avoid that assumption. If the user supplies a single image or flat list of images, then we assign them all to the first sample in the event that batch_size==1. If batch_size>1 then we throw an error, asking them to supply an explicit list of lists instead. This seems much safer!

@Rocketknight1 Rocketknight1 changed the title Fix the structure of images output by the processor Fix the structure of images output by PixtralProcessor Dec 5, 2024
@Rocketknight1 Rocketknight1 changed the title Fix the structure of images output by PixtralProcessor Fix the structure of images in PixtralProcessor Dec 5, 2024
@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Dec 5, 2024

cc @yurkoff-mv, can you try out this PR and confirm it resolves the issues you were having? You can use it with pip install git+https://github.com/huggingface/transformers.git@pixtral_processor_structure_fix

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member Author

also cc @zucchini-nlp and @ArthurZucker / @LysandreJik for core maintainer review

@yurkoff-mv
Copy link

@Rocketknight1, thank you for responding so quickly. Yes, these fixes work.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, the new errors have to be tested and the doc should mention this to remove confusion!

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am wondering, is there any specific reason to support such a complex processing if at the end we end up flattening it in the modeling code. I think we should support both flat list/batch list as inputs similar to most other VLMs without trying to batch-list inputs before passing them to processors

It can help us in the long run to get standardized processors

@Rocketknight1
Copy link
Member Author

Hi @zucchini-nlp, how do the inputs differ here compared to most other processors? I thought text = List[str] and images = List[List[Image]] was pretty normal!

@zucchini-nlp
Copy link
Member

@Rocketknight1 usually we support both formats, as List[List[image]] and simply List[image], except for some cases like in Idefics where the cross attention module needs to know how many images each prompt should attend to. Maybe we can also ask @yonigozlan, he's working on standardizing processor input/output format for VLMs

@Rocketknight1
Copy link
Member Author

Closing because this has been included in #34801 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants