Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLaVa NeXT Video #31252

Merged
merged 24 commits into from
Jun 26, 2024
Merged

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jun 5, 2024

What does this PR do?

Adds a new model for Video LLMs built by tuning LLaVa-NeXT on video dataset, current SOTA for videos on VideoMME bench.

This PR introduces some changes to how we usually handled interleaved multimodal data. I need @amyeroberts opinion on that, of it's a viable option for these kinds of models. I added a separate videoProcessor class which has all same parameters as ImageProcessor but a different logic. The reason: imageProcessor file was already full of extra helper functions, and adding more conditions seemed to degrade readbility

Also, I added a chat template, if you can verify that it's correct @NielsRogge 😄 (it's in the convert_model.py). I will push all templates to the hub later, currently only 7B model has its template working

Fixes #31164

@zucchini-nlp zucchini-nlp marked this pull request as ready for review June 5, 2024 07:51
@zucchini-nlp zucchini-nlp changed the title Add LLaVa NeXT Video [WIP] Add LLaVa NeXT Video Jun 5, 2024
@zucchini-nlp zucchini-nlp marked this pull request as draft June 5, 2024 07:52
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the import needs to be like this I believe

@zucchini-nlp zucchini-nlp marked this pull request as ready for review June 7, 2024 08:04
@zucchini-nlp
Copy link
Member Author

ready for review!

@zucchini-nlp zucchini-nlp changed the title [WIP] Add LLaVa NeXT Video Add LLaVa NeXT Video Jun 7, 2024
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work adding this model!

Most comments are about copied from and diff file choices

src/transformers/image_utils.py Outdated Show resolved Hide resolved
raise ValueError(f"Could not make batched video from {videos}")


class LlavaNextVideoVideoProcessor(BaseImageProcessor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with other models, this should really be LlavaNextVideoImageProcessor.

I'd say we should for do this for this model, and then in a separate PR we can consider introduce a VideoProcessor class, which we might want to use for new models, and as a possible alias for the image processor for the other video models.

The outstanding question would be for video llava, which takes both images and videos. Given the model, I'd say it should probably be a VideoProcessor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, but in case of this model, it also accepts both as input. Right now it uses simple "LLaVaNext" as image processor and adds a new class LlavaNextVideoImageProcessor for video processing. So, for VideoLlava we can separate into two classes in this same way, one for image and another for video processing.

The only major change is that arg name for images will be pixel_values and not pixel_values_images if we do so.

I will call this one LlavaNextVideoImageProcessor processor then and try to unify video-image models in separate PR.

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments, in particular the diff should probably be entirely defined in diff_llava_next_video.py, which then populates the config and modeling files

docs/source/en/model_doc/llava-next-video.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/llava-next-video.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/llava-next-video.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/llava-next-video.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/llava-next-video.md Outdated Show resolved Hide resolved
@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Jun 11, 2024

@amyeroberts ready for re-review

  • Diff was not working and wasn't copying content from parent class (LLava_NeXT). I did some changes in the diff-converter file, so would be nice to get @ArthurZucker review on that. However, diff still doesn't work when we have to overwrite docstring for config, so I simply made my own new config file. I also tried diff-converting example2, still doesn't copy docstring and I couldn't solve it yet
  • VideoProcessor is gone, so now we just have two ImageProcessor classes used for each modality, simply to isolate unrelated code in two files

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 kudos for using the diff file so well

docs/source/en/model_doc/llava-next-video.md Show resolved Hide resolved
utils/diff_model_converter.py Show resolved Hide resolved
utils/diff_model_converter.py Show resolved Hide resolved
Comment on lines 335 to 337
# Iterate directky from node.body as there can be property/setters with same names which are overwritten when we use a dict
for func in original_node.body.body:
name = func.name.value if hasattr(func, "name") else func
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you mean original_methods.items() is overwriding some things?

Otherwise the goal is indeed to overwrite the methods using the func.with_changes!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens only when original class has two methods with identical naming, in my case property and it setter. If we use dict, only one is retained as dict can't hold two keys with same naming.

Not sure if there're cases when directly iterating might cause errors 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hhhhhha I see!
Then it means we are not taking into account the decorator. Indeed that is a good catch and was probably affectingh @younesbelkada in his cohere porting

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow - looks great! 🔥

Some comments relating to the diff logic and the generated model file, but overall looks great

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker I can forsee hitting potential issues when e.g. we have more than one modeling file. For example, the models under data2vec.

Seeing the config file here made me realise this. Rather than try to handle which imports belong to which file e.g. configuration_llava_next vs. modeling_llava_next we could make this simpler (and less prone to error) by having diff files for each e.g. diff_modeling_llava_next_video.py, diff_configuration_llava_next_video.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but there should be an easy way to solve importing stuff. Basically pasting all imports everywhere and removing what is not used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the config and tokenizer and whatever else in a single diff file for now will be simpler IMO !



@dataclass
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->LlavaNextVideo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to copy across the copied from statements from the original file 👀

Copy link
Member Author

@zucchini-nlp zucchini-nlp Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, this is diff-generated but I'll see if that can be fixed + ping @ArthurZucker for visibility

UPD: Found how to remove those and can push a commit if needed. Btw, make fix-copies currently doesn't check for diff-files, so imo "copied from" statements are needed to catch and propagate code changes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we could / should add the check into make fix-copies to test the hash of the file vs hash of the diff converter generated file

Copy link
Member Author

@zucchini-nlp zucchini-nlp Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this one for next the PR (to whoever wants to add the checks) as it's not related to LLaVa-NeXT :)

Edit: will open an issue to not forget

Model tester for `LlavaNextVideoForConditionalGeneration`.
"""

all_model_classes = (LlavaNextVideoForConditionalGeneration,) if is_torch_available() else ()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define all_generative_models here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in another PR, VLMs cannot do GenerationTesterMixin out of the box due to pixel_values, custom generation code or partially having generation related code in modeling (e.g. expand_attn_mask). It's on my todo list, but requires more unification of VLMs before

utils/diff_model_converter.py Outdated Show resolved Hide resolved
@zucchini-nlp
Copy link
Member Author

@amyeroberts this one is also ready for the last review

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - thanks for adding the model and for testing the diff converter!

@amyeroberts
Copy link
Collaborator

@zucchini-nlp Only thing before merge is to do a run on the slow tests for the model with a [run_slow] llava-next-video commit message

@zucchini-nlp
Copy link
Member Author

Hmm I am not sure slow tests are working, they were skipped in here and in BlipVideo. I think the commit message format is correct

@amyeroberts
Copy link
Collaborator

Hmm I am not sure slow tests are working, they were skipped in here and in BlipVideo. I think the commit message format is correct

Ah, we didn't have the run-slow label on the PR, which might have caused it to be skipped. @zucchini-nlp Could you try again?

@amyeroberts amyeroberts mentioned this pull request Jun 24, 2024
5 tasks
@ydshieh ydshieh force-pushed the llava_next_video branch from b0433ed to bc51d0c Compare June 26, 2024 07:19
@zucchini-nlp zucchini-nlp merged commit e71f286 into huggingface:main Jun 26, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LLaVA-NeXT-Video support
7 participants