Fix the structure of images in PixtralProcessor #35107

Rocketknight1 · 2024-12-05T18:32:24Z

The PixtralProcessor has some issues regarding correct nesting depth of inputs. If we are fully explicit about input nesting, then text should be List[str] and images should be List[List[Image]]. This is because each sample only has one text, but can have multiple images.

The start of the Pixtral processor code handles cases when users don't supply fully nested inputs. However, it uses the heuristic that if the user passes List[str] for text and List[Image] for images, then this indicates one image per sample. This is very confusing for users, especially because it means output changes when batch_size==1, depending on whether that single input is passed as str or [str].

With this PR, we avoid that assumption. If the user supplies a single image or flat list of images, then we assign them all to the first sample in the event that batch_size==1. If batch_size>1 then we throw an error, asking them to supply an explicit list of lists instead. This seems much safer!

Rocketknight1 · 2024-12-05T19:09:05Z

cc @yurkoff-mv, can you try out this PR and confirm it resolves the issues you were having? You can use it with pip install git+https://github.com/huggingface/transformers.git@pixtral_processor_structure_fix

HuggingFaceDocBuilderDev · 2024-12-05T19:13:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2024-12-05T19:31:01Z

also cc @zucchini-nlp and @ArthurZucker / @LysandreJik for core maintainer review

yurkoff-mv · 2024-12-06T06:02:35Z

@Rocketknight1, thank you for responding so quickly. Yes, these fixes work.

ArthurZucker

Cool, the new errors have to be tested and the doc should mention this to remove confusion!

zucchini-nlp

i am wondering, is there any specific reason to support such a complex processing if at the end we end up flattening it in the modeling code. I think we should support both flat list/batch list as inputs similar to most other VLMs without trying to batch-list inputs before passing them to processors

It can help us in the long run to get standardized processors

Rocketknight1 · 2024-12-06T14:10:58Z

Hi @zucchini-nlp, how do the inputs differ here compared to most other processors? I thought text = List[str] and images = List[List[Image]] was pretty normal!

zucchini-nlp · 2024-12-06T14:31:45Z

@Rocketknight1 usually we support both formats, as List[List[image]] and simply List[image], except for some cases like in Idefics where the cross attention module needs to know how many images each prompt should attend to. Maybe we can also ask @yonigozlan, he's working on standardizing processor input/output format for VLMs

Rocketknight1 · 2024-12-23T15:47:18Z

Closing because this has been included in #34801 instead

Rocketknight1 added 6 commits December 5, 2024 18:19

Fix the structure of images output by the processor

49055e1

Fix the structure of images output by the processor

ed0b430

make fixup

031fdd5

More error handling

3406432

Correct nesting in test

af9f67c

Correct nesting in test

6769700

Rocketknight1 added the run-slow label Dec 5, 2024

[run-slow] pixtral

f3ff530

Rocketknight1 changed the title ~~Fix the structure of images output by the processor~~ Fix the structure of images output by PixtralProcessor Dec 5, 2024

Rocketknight1 changed the title ~~Fix the structure of images output by PixtralProcessor~~ Fix the structure of images in PixtralProcessor Dec 5, 2024

ArthurZucker reviewed Dec 6, 2024

View reviewed changes

zucchini-nlp reviewed Dec 6, 2024

View reviewed changes

Rocketknight1 closed this Dec 23, 2024

Rocketknight1 mentioned this pull request Dec 23, 2024

[PixtralLarge] Update Pixtral conversion script to support large format! #34801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the structure of images in PixtralProcessor #35107

Fix the structure of images in PixtralProcessor #35107

Rocketknight1 commented Dec 5, 2024 •

edited

Loading

Rocketknight1 commented Dec 5, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 5, 2024

Rocketknight1 commented Dec 5, 2024

yurkoff-mv commented Dec 6, 2024

ArthurZucker left a comment

zucchini-nlp left a comment

Rocketknight1 commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

Rocketknight1 commented Dec 23, 2024

Fix the structure of images in PixtralProcessor #35107

Fix the structure of images in PixtralProcessor #35107

Conversation

Rocketknight1 commented Dec 5, 2024 • edited Loading

Rocketknight1 commented Dec 5, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Dec 5, 2024

Rocketknight1 commented Dec 5, 2024

yurkoff-mv commented Dec 6, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

Rocketknight1 commented Dec 23, 2024

Rocketknight1 commented Dec 5, 2024 •

edited

Loading

Rocketknight1 commented Dec 5, 2024 •

edited

Loading