Implement LlamaGen for Image Generation #33905

ighoshsubho · 2024-10-03T05:21:12Z

Feature request

Add support for LlamaGen, an autoregressive image generation model, to the Transformers library. LlamaGen applies the next-token prediction paradigm of large language models to visual generation.

Paper: https://arxiv.org/abs/2406.06525
Code: https://github.com/FoundationVision/LlamaGen

Key components to implement:

Image tokenizer
Autoregressive image generation model (based on Llama architecture)
Class-conditional and text-conditional image generation
Classifier-free guidance for sampling

Motivation

LlamaGen demonstrates that vanilla autoregressive models without vision-specific inductive biases can achieve state-of-the-art image generation performance. Implementing it in Transformers would enable easier experimentation and integration with existing language models.

Your contribution

I can help by contributing to this model, and provide examples and detailed explanations of the model architecture and training process if needed.

SOGeKING-NUL · 2024-10-03T12:04:26Z

This looks like an incredible feature Shubo! Please allow me to work with you on this for my open sourced contribution for hacktoberfest.

LysandreJik · 2024-10-04T09:30:59Z

Thanks for the request! cc @qubvel, @molbap, what do you think?

qubvel · 2024-10-04T18:06:27Z

Very interesting! As far as I know, we don't have image-generation models in transformers yet or am I missing it? So, wondering where is the better place for such a model, in transformers or in diffusers (it's not a diffusion model though).
cc @sayakpaul maybe

zucchini-nlp · 2024-10-05T08:00:01Z

Hey! Just saw this issue, and I've been working/reviewing some VLM models that can generate image or text from image+text. TBH we have only ImageGPT as a very old architecture for image generation, very similar to llama-gen iiuc. And two more PRs are open for VLM with image generation: Chameleon's decoder VQ-VAE support which got stale due to contributor being busy and Emu3 which I hopefully can work on in the next weeks

I like Llama-Gen and I think it can be a nice addition. From what I see the model doesn't take image as input, so no inpainting or other tasks, only generation from text. It shouldn't be hard to fit in the general model API. Do we need to have any controlled structured generation for ex: limit tokens to be generated to a specific subset and length? Would be super nice if that kind of control can be done with existing LogitsProcessors, adding new processors is gonna add more maintainment burden to us

sayakpaul · 2024-10-06T05:39:53Z

Very interesting discussion here.

My personal opinion is if the image generation process is more auto-regressive in nature (for which transformers already has nice foundations), it makes sense to keep them inside of transformers.

diffusers houses models that follow some kind of denoising in the overall generation workflow. Only pipeline that is not based on diffusion or rectified-flow in diffusers is aMUSEd (an open reproduction of MUSE). However, it still has an iterative denoising schedule (in the form of masking). Broadly speaking, our generation workflow is abstracted through a DiffusionPipeline which stitches together the different model-level components:

VAE
Denoiser
Text encoder
Scheduler

But I will also let @yiyixuxu chime in here.

GargDivanshu · 2024-10-06T18:02:32Z

interesting problem here. I would like to collaborate with @ighoshsubho on this !

qubvel · 2024-10-07T10:32:12Z

Thanks, everyone, for the discussion! It seems like we've agreed that transformers will be a good place to implement this model. @zucchini-nlp, thanks for sharing the reference models, could you please also link any merged or ongoing PRs? I believe that would be super helpful for understanding patterns for implementation!

zucchini-nlp · 2024-10-07T18:40:04Z

These two PRs might help but they have a lot of extra logic specific for interleaving image and text. I would say the closest one is ImageGPT, so LlamaGen can be implementing in a similar way :)
#32013
#33770

GargDivanshu · 2024-10-16T07:27:27Z

@qubvel @ighoshsubho have you guys started off with the implementation for this model yet ? If yes and if you are okay, I would like to help you out with it

ighoshsubho · 2024-10-16T08:43:04Z

@qubvel @ighoshsubho have you guys started off with the implementation for this model yet ? If yes and if you are okay, I would like to help you out with it

not yet, was busy with something else, will start implementing this soon

deepwilson · 2024-10-17T19:45:02Z

@ighoshsubho would like to contribute as well. please suggest how I can help.

leloykun · 2024-10-27T06:57:32Z

Hi all!

I have a tracker issue here for all the image-text in-and-out models: #32926

If I missed anything, please leave a comment!

I've also started work for Chameleon & Anole here: #32013

I've recently just rebased it to main and the remaining errors seem to be unrelated to the PR (i.e. Flax T5 failing even tho I never touched it). I think the PR would be a good starting point for this and related models.

Please help me out!

ighoshsubho added the Feature request Request for a new feature label Oct 3, 2024

LysandreJik added New model Vision labels Oct 4, 2024

GargDivanshu mentioned this issue Oct 29, 2024

Auto model & pipeline for image-text-to-image-text models #32926

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement LlamaGen for Image Generation #33905

Implement LlamaGen for Image Generation #33905

ighoshsubho commented Oct 3, 2024

SOGeKING-NUL commented Oct 3, 2024

LysandreJik commented Oct 4, 2024

qubvel commented Oct 4, 2024

zucchini-nlp commented Oct 5, 2024

sayakpaul commented Oct 6, 2024

GargDivanshu commented Oct 6, 2024

qubvel commented Oct 7, 2024

zucchini-nlp commented Oct 7, 2024

GargDivanshu commented Oct 16, 2024

ighoshsubho commented Oct 16, 2024

deepwilson commented Oct 17, 2024

leloykun commented Oct 27, 2024

Implement LlamaGen for Image Generation #33905

Implement LlamaGen for Image Generation #33905

Comments

ighoshsubho commented Oct 3, 2024

Feature request

Motivation

Your contribution

SOGeKING-NUL commented Oct 3, 2024

LysandreJik commented Oct 4, 2024

qubvel commented Oct 4, 2024

zucchini-nlp commented Oct 5, 2024

sayakpaul commented Oct 6, 2024

GargDivanshu commented Oct 6, 2024

qubvel commented Oct 7, 2024

zucchini-nlp commented Oct 7, 2024

GargDivanshu commented Oct 16, 2024

ighoshsubho commented Oct 16, 2024

deepwilson commented Oct 17, 2024

leloykun commented Oct 27, 2024