Auto model & pipeline for image-text-to-image-text models #32926

leloykun · 2024-08-22T01:38:44Z

Feature request

This is a tracker issue for work on interleaved in-and-out image-text generation.

There are now >= 5 open-source models that can do interleaved image-text generation--and many more are expected to be released. Thus, it would now be practical & useful for us to (1) add native support for such models and (2) standardize the logic flow of data through processors and pipelines as done in #31911 and #32472

Model	Github	Notes	PR
Anole	https://github.com/GAIR-NLP/anole	-	#32013
Chameleon	https://github.com/facebookresearch/chameleon	-	#32013
Llava-NeXT-Interleaved	https://github.com/LLaVA-VL/LLaVA-NeXT	-	-
Lumina-mGPT	https://github.com/Alpha-VLLM/Lumina-mGPT	-	-
Show-o	https://github.com/showlab/Show-o	-	-
Transfusion	-	Not open-source (yet, perhaps)	-
XGen-MM	https://github.com/salesforce/LAVIS/tree/xgen-mm	The paper & the github repo don't actually demonstrate interleaved image-text generation yet, but they did train the model on such datasets & the model architecture(s) is perfectly suited for it	-
Emu3	https://github.com/baaivision/Emu3	The official repo only has demos for text-only & image-only generation, but the model seems to have been trained on text-image datasets	-

Initial work for Chameleon & Anole can be found here: #32013 for reference.

Notes:

We explicitly exclude models that can only do text-only generation or image-only generation. We also exclude models that can do image-text generation but not in an interleaved manner.
As I've demonstrated in my repo, explicitly implementing the Finite State Machine (FSM) for switching between text-generation and image-generation modes as done in Chameleon's repo is not necessary. Implicitly implementing the FSM with Logits Processors suffices. Although more work is needed on finding the most efficient implementation.

TODOs:

Motivation

To make benchmarking and evaluating models for interleaved image-to-text tasks saner
To continue work on Multimodal In-and-Out, Interleaved Structured Generation: https://github.com/leloykun/mmsg

Your contribution

I've already started work on Chameleon & Anole here: #32013

But I'm currently blocked by (1) not having enough time due to other responsibilities and (2) not having enough compute resources.

Any help would be appreciated!

zucchini-nlp · 2024-08-22T07:16:15Z

FYI @NielsRogge and @merveenoyan , you've been discussing recently tags for these kinds of models on the hub

GargDivanshu · 2024-10-29T11:13:09Z

@leloykun saw your comment on issue #33905 (Implement LlamaGen for Image Generation)

Want to work on these issues, can you tell where to begin with ?
I am reading #31911 as you mentioned above

leloykun · 2024-10-29T11:19:19Z

@GargDivanshu You might also wanna take a look at #32013

You can start by adding some of the missing tests and such to gain familiarity with the code there. And once you're ready, I can help you implement multimodal in-and-out for the other models.

GargDivanshu · 2024-10-29T11:21:05Z

perfect, moving to #32013

merveenoyan · 2024-11-05T08:47:24Z

@zucchini-nlp I think this falls under any-to-any in Hub side but not sure if in transformers we should have a separate pipeline given we don't have a ton of these models as of now and given the shift to any-to-any we will have to have another pipeline for models that can take audio input or output on top of the modalities here. @NielsRogge

zucchini-nlp · 2024-11-05T10:47:25Z

Yes, I agree it should be any-to-any. Was just adding you in the loop since some contributors are working on adding these types of models :)

leloykun added the Feature request Request for a new feature label Aug 22, 2024

zucchini-nlp mentioned this issue Sep 8, 2024

Support Unified Multimodal Model #33368

Open

leloykun mentioned this issue Oct 27, 2024

Implement LlamaGen for Image Generation #33905

Open

promptless bot mentioned this issue Nov 6, 2024

Docs update for PR #4 on Promptless/transformers-test Promptless/transformers-test#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto model & pipeline for image-text-to-image-text models #32926

Auto model & pipeline for image-text-to-image-text models #32926

leloykun commented Aug 22, 2024 •

edited

Loading

zucchini-nlp commented Aug 22, 2024

GargDivanshu commented Oct 29, 2024

leloykun commented Oct 29, 2024

GargDivanshu commented Oct 29, 2024

merveenoyan commented Nov 5, 2024

zucchini-nlp commented Nov 5, 2024

Auto model & pipeline for image-text-to-image-text models #32926

Auto model & pipeline for image-text-to-image-text models #32926

Comments

leloykun commented Aug 22, 2024 • edited Loading

Feature request

Motivation

Your contribution

zucchini-nlp commented Aug 22, 2024

GargDivanshu commented Oct 29, 2024

leloykun commented Oct 29, 2024

GargDivanshu commented Oct 29, 2024

merveenoyan commented Nov 5, 2024

zucchini-nlp commented Nov 5, 2024

leloykun commented Aug 22, 2024 •

edited

Loading