[`Llava`] Add Llava to transformers #27662

younesbelkada · 2023-11-22T19:13:23Z

What does this PR do?

Adds Llava - a multimodal model in transformers library.
Llava is a multi-modal model that claims to have competitive performance than GPT-4 for multi-modal tasks. There are currently 3 main variants of this architecture:

Llava - Llama
Llava - MPT
Llava - Mistral (known as Bakllava): https://github.com/SkunkworksAI/BakLLaVA

This implementation leverages AutoModelForCausalLM , similarly as Blip2 to load the correct language model. The goal of this PR is to make it agnostic across all language models architectures.

Closes #25789
Closes #27221

Original llava author: https://github.com/haotian-liu/LLaVA @haotian-liu
Original PR author: @shauray8

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForVisionText2Text

model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)

prompt = "<image>\n"
prompt += "USER: What are the things I should be cautious about when I visit this place?\nASSISTANT:"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

model = LlavaForVisionText2Text.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(0)

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0][2:], skip_special_tokens=True))
>>> USER: What are the things I should be cautious about when I visit this place?
ASSISTANT: When visiting this place, which appears to be a dock or pier extending into a body of water, you should be cautious about several factors. First, be aware of the water depth and any potential hazards, such as submerged rocks or debris, that could pose a risk to your safety. Second, be mindful of the weather conditions, as sudden changes in weather can make the dock or pier unsafe to use. Third, be cautious of the surrounding environment, as there may be wildlife or other natural elements that could pose a threat. Finally, be aware of any local regulations or guidelines for using the dock or pier, as these may include rules about fishing, swimming, or other activities. By being cautious and following any applicable guidelines, you can ensure a safe and enjoyable experience at this location.

NielsRogge · 2023-11-23T10:50:17Z

src/transformers/models/llava/processing_llava.py

+        prompts: Union[List[TextInput], List[List[TextInput]]],
+        images=None,


cc @ydshieh as can be seen, any multimodal processor has a different signature here, which will make it very difficult to support them on the image-to-text pipeline for instance. Hence we need additional tests that check whether they all comply to the same API. In this case I'd advocate for the images keyword argument followed by text as that's what most processors use.

I would suggest here to use text as the argument name to LlavaProcessor.

Yes, but also reversed order right?

Hmm, when I worked on Kosmos-2, I kinda copied from Blip2/InstructBlip, where the order is images, text.
I see Fuyu and CLIP processors are text, images.

I don't know if there is a clear criteria to determine this. In this case, we can probably say images is the main input, so should be before text.

What's your opinion on this order thing?

Ok so that's another reason why we need tests for this 😓 perhaps we could extend test_forward_signature for both models + processors (currently it's still implemented for text-only models). Personally I would go for images and then text (and ideally the latter should have been called texts to also be plural).

The current text is likely because it is the argument name in PreTrainedTokenizerBase (so every tokenizer classes)

text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,

HuggingFaceDocBuilderDev · 2023-11-24T10:40:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

…ansformers into add-llava-final

molbap · 2023-12-05T16:09:34Z

src/transformers/models/llava/modeling_llava.py

+        batch_indices, non_image_indices = torch.where(input_ids != self.config.image_token_index)
+
+        # 2. Compute the positions where text should be written
+        new_token_positions = torch.cumsum((image_token_mask * (nb_text_tokens_per_images - 1) + 1), -1) - 1


That's very smart! Maybe add a comment there, like

# Calculate new positions for text tokens in merged image-text sequence. # `image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens. # `torch.cumsum` computes how each image token shifts subsequent text token positions. # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.

or make it part of a step-by-step explanation in the docstrings

molbap · 2023-12-05T16:18:17Z

src/transformers/models/llava/processing_llava.py

+
+        super().__init__(image_processor, tokenizer)
+
+    def __call__(


Nit, maybe add type hints there, and maybe explicit args to tokenizer?

def __call__( self, text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None, images: ImageInput = None, padding: Union[bool, str, PaddingStrategy] = False, truncation: Union[bool, str, TruncationStrategy] = None, max_length: Optional[int] = None, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs, ) -> BatchFeature: ``` also, would either pass **kwargs to tokenizer (or pass explicitly needed arguments), I'm thinking about `pad_to_multiple_of` and `verbose` for instance

NielsRogge · 2023-12-06T08:12:00Z

src/transformers/models/llava/modeling_llava.py

+    """The LLAVA model which consists of a vision backbone and a language model.""",
+    LLAVA_START_DOCSTRING,
+)
+class LlavaForConditionalGeneration(LlavaPreTrainedModel):


Just a question here, are we ok with not having a LlavaModel? I remember for BLIP-2 people were asking for it afterwards, and as we're using AutoModelForCausalLM in the head class, we cannot use AutoModel in the base LlavaModel class, as it would not be compatible with the weights. Hence if we were to add a LlavaModel, we would need to use AutoModelForCausalLM and remove the head, I assume.

I'd rather we leave it as is without extra layer of complexity! And jiust add a new model output class maybe with the hidden states before the head

(also don't want to have issue with base model prefixes)

molbap · 2023-12-06T13:47:54Z

src/transformers/models/llava/modeling_llava.py

+                    # Retrieve the first layer to inspect the logits and mask out the hidden states
+                    # that are set to 0
+                    first_layer_past_key_value = past_key_values[0][0][:, 0, :, 0]
+                    batch_index, non_attended_tokens = torch.where(first_layer_past_key_value == 0)


very nice trick

NielsRogge · 2023-12-06T17:40:25Z

src/transformers/models/llava/modeling_llava.py

+                image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+                # this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.


This made me realize the current way of getting features out of our AutoBackbone classes is very inefficient as well (they also use output_hidden_states=True as done here for instance).

Will update this for all backbones

Will create a Github issue for this. Maybe we can leverage that as well here?

Yep that would be nice I think

* add model like * logits match * minor fixes * fixes * up * up * add todo * llava processor * keep the processor simple * add conversion script * fixup * fix copies * up * add to index * fix config + logits * fix * refactor * more refactor * more refactor * fix copies * add authors * v1 tests * add `LlavaProcessor` in init * remove unneeded import * up * up * docs * up * fix CI * fix CI * add attention mask in test * make fixup * remove the vision model * that' s the dirty way to do it * nits * nits * updates * add more tests * add input tests * fixup * more styling * nits * updates amd cleanup * fixup the generation expected results * fix the testing script * some cleanup and simplification which does not work yet but almost there! * make correct dispatch operations * vectorize works for batch of images and text * last todos * nits * update test and modeling code * remove useless function for now * fix few issues * fix generation * some nits * add bakllava * nits * remove duplicated code * finis merge * cleanup * missed this line * fill the todos * add left padding offset * add left and rignt padding logic * bool to properly index * make sure * more cleanups * batch is fixed 😉 * add correct device for tensor creation * fix some dtype missmatch * ruff * update conversion script * Update src/transformers/__init__.py * fa 2 support + fix conversion script * more * correct reshaping * fix test dict * fix copies by ignoring * fix nit * skip clip vision model * fixup * fixup * LlavaForVisionText2Text -> LlavaForCausalLM * update * fix * raise correct errors * fix * docs * nuke for now * nits here and there * fixup * fix remaining tests * update LlavaForConditionalGeneration instead of CausalLM * fixups * pipeline support * slow and piepline tests * supports batch * nits * cleanup * fix first integration tests * add pad token where needed * correct etsts * fixups * update pipeline testr * fix quality * nits * revert unneeded change * nit * use BatchFeature * from ...feature_extraction_utils import BatchFeature * nits * nits * properly update * more f*** nits * fix copies * comment * keep slow test slow * Update src/transformers/models/llava/processing_llava.py Co-authored-by: Arthur <[email protected]> * add piepline example * add pixel values in docstrign * update pr doctest * fix * fix slow tests * remove hack * fixup * small note * forward contrib credits from PR25789 * forward contrib credits from original implementation and work * add arthur * Update src/transformers/models/llava/processing_llava.py Co-authored-by: Lysandre Debut <[email protected]> * update docstring * nit * move to not doctested because of timeout issues * fixup * add description * more * fix-copies * fix docs * add beam search * add more comments * add typehints on processor * add speedup plot * update slow tests and docs * push test * push batched test * fix batched generation with different number of images * remove benchmark due to a bug * fix test * fix copies * add gcolab demo --------- Co-authored-by: Arthur Zucker <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: shauray8 <[email protected]> Co-authored-by: haotian-liu <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>

Hambaobao · 2023-12-30T16:56:56Z

Hello, can you tell me how to use LlavaForConditionalGeneration in transformers to train from scratch? I want to use the weights of Vicuna and Clip Vision Transformer for training, just like the original author did.

younesbelkada added 14 commits November 22, 2023 13:27

add model like

d95b4c2

logits match

add5ed6

minor fixes

066a60e

fixes

e074c36

up

ef1b497

up

ebf33ce

add todo

0454cc6

llava processor

9ed7c5b

keep the processor simple

0ae0a26

add conversion script

56afb58

fixup

2630f05

fix copies

25a4e30

up

3b51d6f

add to index

3c0f1a1

NielsRogge reviewed Nov 23, 2023

View reviewed changes

younesbelkada added 5 commits November 24, 2023 09:44

Merge remote-tracking branch 'upstream/main' into add-llava-final

9edca36

fix config + logits

af230f3

fix

eab82be

refactor

d1d1dce

more refactor

7858485

younesbelkada added 9 commits November 24, 2023 11:59

more refactor

c271132

fix copies

b699eda

add authors

44a9681

v1 tests

eb6f87c

add LlavaProcessor in init

b29b635

remove unneeded import

c4a750a

up

43c2024

up

d1579bc

docs

db0c237

younesbelkada and others added 5 commits December 5, 2023 15:32

Merge branch 'main' into add-llava-final

286b27b

fix docs

1779ce2

Merge branch 'add-llava-final' of https://github.com/younesbelkada/tr…

6c1e63a

…ansformers into add-llava-final

add beam search

4eb55a5

Merge branch 'add-llava-final' of https://github.com/younesbelkada/tr…

f0b3610

…ansformers into add-llava-final

molbap reviewed Dec 5, 2023

View reviewed changes

ArthurZucker mentioned this pull request Dec 5, 2023

Fuyu Multi-image interleaved processor #27587

Closed

5 tasks

NielsRogge reviewed Dec 6, 2023

View reviewed changes

younesbelkada added 7 commits December 6, 2023 09:37

add more comments

c49f9bd

add typehints on processor

348e7b6

add speedup plot

e3732a9

update slow tests and docs

8a301e4

push test

8ad85b3

push batched test

57a816f

fix batched generation with different number of images

661ff8d

molbap reviewed Dec 6, 2023

View reviewed changes

younesbelkada added 2 commits December 6, 2023 15:26

remove benchmark due to a bug

d809222

fix test

0f7629a

NielsRogge reviewed Dec 6, 2023

View reviewed changes

younesbelkada and others added 3 commits December 7, 2023 08:41

Merge branch 'main' into add-llava-final

6678803

fix copies

3c39ba1

add gcolab demo

a7cc780

younesbelkada merged commit 44b5506 into huggingface:main Dec 7, 2023
23 checks passed

younesbelkada deleted the add-llava-final branch December 7, 2023 08:30

ArthurZucker mentioned this pull request Jan 3, 2024

LlaVa model in transformers #25060

Closed

lifo9 mentioned this pull request Feb 12, 2024

LLaVA support huggingface/optimum-neuron#478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Llava`] Add Llava to transformers #27662

[`Llava`] Add Llava to transformers #27662

younesbelkada commented Nov 22, 2023 •

edited

Loading

NielsRogge Nov 23, 2023

ydshieh Nov 23, 2023

NielsRogge Nov 23, 2023

ydshieh Nov 23, 2023

NielsRogge Nov 23, 2023

ydshieh Nov 23, 2023

HuggingFaceDocBuilderDev commented Nov 24, 2023

molbap Dec 5, 2023

molbap Dec 5, 2023

NielsRogge Dec 6, 2023 •

edited

Loading

ArthurZucker Dec 6, 2023 •

edited

Loading

ArthurZucker Dec 6, 2023

molbap Dec 6, 2023

asetsuna Jan 25, 2024

NielsRogge Dec 6, 2023 •

edited

Loading

NielsRogge Dec 6, 2023

ArthurZucker Dec 7, 2023

NielsRogge Dec 7, 2023

Hambaobao commented Dec 30, 2023

		prompts: Union[List[TextInput], List[List[TextInput]]],
		images=None,

		image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
		# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.

[Llava] Add Llava to transformers #27662

[Llava] Add Llava to transformers #27662

Conversation

younesbelkada commented Nov 22, 2023 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hambaobao commented Dec 30, 2023

[`Llava`] Add Llava to transformers #27662

[`Llava`] Add Llava to transformers #27662

younesbelkada commented Nov 22, 2023 •

edited

Loading

NielsRogge Dec 6, 2023 •

edited

Loading

ArthurZucker Dec 6, 2023 •

edited

Loading

NielsRogge Dec 6, 2023 •

edited

Loading