Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto model & pipeline for image-text-to-image-text models #32926

Open
2 of 14 tasks
leloykun opened this issue Aug 22, 2024 · 6 comments
Open
2 of 14 tasks

Auto model & pipeline for image-text-to-image-text models #32926

leloykun opened this issue Aug 22, 2024 · 6 comments
Labels
Feature request Request for a new feature

Comments

@leloykun
Copy link
Contributor

leloykun commented Aug 22, 2024

Feature request

This is a tracker issue for work on interleaved in-and-out image-text generation.

There are now >= 5 open-source models that can do interleaved image-text generation--and many more are expected to be released. Thus, it would now be practical & useful for us to (1) add native support for such models and (2) standardize the logic flow of data through processors and pipelines as done in #31911 and #32472

Model Github Notes PR
Anole https://github.com/GAIR-NLP/anole - #32013
Chameleon https://github.com/facebookresearch/chameleon - #32013
Llava-NeXT-Interleaved https://github.com/LLaVA-VL/LLaVA-NeXT - -
Lumina-mGPT https://github.com/Alpha-VLLM/Lumina-mGPT - -
Show-o https://github.com/showlab/Show-o - -
Transfusion - Not open-source (yet, perhaps) -
XGen-MM https://github.com/salesforce/LAVIS/tree/xgen-mm The paper & the github repo don't actually demonstrate interleaved image-text generation yet, but they did train the model on such datasets & the model architecture(s) is perfectly suited for it -
Emu3 https://github.com/baaivision/Emu3 The official repo only has demos for text-only & image-only generation, but the model seems to have been trained on text-image datasets -

Initial work for Chameleon & Anole can be found here: #32013 for reference.

Notes:

  • We explicitly exclude models that can only do text-only generation or image-only generation. We also exclude models that can do image-text generation but not in an interleaved manner.
  • As I've demonstrated in my repo, explicitly implementing the Finite State Machine (FSM) for switching between text-generation and image-generation modes as done in Chameleon's repo is not necessary. Implicitly implementing the FSM with Logits Processors suffices. Although more work is needed on finding the most efficient implementation.

TODOs:

Motivation

  1. To make benchmarking and evaluating models for interleaved image-to-text tasks saner
  2. To continue work on Multimodal In-and-Out, Interleaved Structured Generation: https://github.com/leloykun/mmsg

Your contribution

I've already started work on Chameleon & Anole here: #32013

But I'm currently blocked by (1) not having enough time due to other responsibilities and (2) not having enough compute resources.

Any help would be appreciated!

@leloykun leloykun added the Feature request Request for a new feature label Aug 22, 2024
@zucchini-nlp
Copy link
Member

FYI @NielsRogge and @merveenoyan , you've been discussing recently tags for these kinds of models on the hub

@GargDivanshu
Copy link

@leloykun saw your comment on issue #33905 (Implement LlamaGen for Image Generation)

Want to work on these issues, can you tell where to begin with ?
I am reading #31911 as you mentioned above

@leloykun
Copy link
Contributor Author

@GargDivanshu You might also wanna take a look at #32013

You can start by adding some of the missing tests and such to gain familiarity with the code there. And once you're ready, I can help you implement multimodal in-and-out for the other models.

@GargDivanshu
Copy link

perfect, moving to #32013

@merveenoyan
Copy link
Contributor

@zucchini-nlp I think this falls under any-to-any in Hub side but not sure if in transformers we should have a separate pipeline given we don't have a ton of these models as of now and given the shift to any-to-any we will have to have another pipeline for models that can take audio input or output on top of the modalities here. @NielsRogge

@zucchini-nlp
Copy link
Member

Yes, I agree it should be any-to-any. Was just adding you in the loop since some contributors are working on adding these types of models :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

4 participants