-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Llava
] Add Llava to transformers
#27662
[Llava
] Add Llava to transformers
#27662
Conversation
prompts: Union[List[TextInput], List[List[TextInput]]], | ||
images=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ydshieh as can be seen, any multimodal processor has a different signature here, which will make it very difficult to support them on the image-to-text pipeline for instance. Hence we need additional tests that check whether they all comply to the same API. In this case I'd advocate for the images
keyword argument followed by text
as that's what most processors use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest here to use text
as the argument name to LlavaProcessor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but also reversed order right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when I worked on Kosmos-2, I kinda copied from Blip2/InstructBlip, where the order is images, text
.
I see Fuyu and CLIP processors are text, images
.
I don't know if there is a clear criteria to determine this. In this case, we can probably say images
is the main input, so should be before text
.
What's your opinion on this order thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so that's another reason why we need tests for this 😓 perhaps we could extend test_forward_signature for both models + processors (currently it's still implemented for text-only models). Personally I would go for images
and then text
(and ideally the latter should have been called texts to also be plural).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current text
is likely because it is the argument name in PreTrainedTokenizerBase
(so every tokenizer classes)
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
…ansformers into add-llava-final
…ansformers into add-llava-final
batch_indices, non_image_indices = torch.where(input_ids != self.config.image_token_index) | ||
|
||
# 2. Compute the positions where text should be written | ||
new_token_positions = torch.cumsum((image_token_mask * (nb_text_tokens_per_images - 1) + 1), -1) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very smart! Maybe add a comment there, like
# Calculate new positions for text tokens in merged image-text sequence.
# `image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
# `torch.cumsum` computes how each image token shifts subsequent text token positions.
# - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
or make it part of a step-by-step explanation in the docstrings
|
||
super().__init__(image_processor, tokenizer) | ||
|
||
def __call__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, maybe add type hints there, and maybe explicit args to tokenizer?
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
images: ImageInput = None,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchFeature:
```
also, would either pass **kwargs to tokenizer (or pass explicitly needed arguments), I'm thinking about `pad_to_multiple_of` and `verbose` for instance
"""The LLAVA model which consists of a vision backbone and a language model.""", | ||
LLAVA_START_DOCSTRING, | ||
) | ||
class LlavaForConditionalGeneration(LlavaPreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question here, are we ok with not having a LlavaModel
? I remember for BLIP-2 people were asking for it afterwards, and as we're using AutoModelForCausalLM
in the head class, we cannot use AutoModel
in the base LlavaModel
class, as it would not be compatible with the weights. Hence if we were to add a LlavaModel
, we would need to use AutoModelForCausalLM
and remove the head, I assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we leave it as is without extra layer of complexity! And jiust add a new model output class maybe with the hidden states before the head
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also don't want to have issue with base model prefixes)
# Retrieve the first layer to inspect the logits and mask out the hidden states | ||
# that are set to 0 | ||
first_layer_past_key_value = past_key_values[0][0][:, 0, :, 0] | ||
batch_index, non_attended_tokens = torch.where(first_layer_past_key_value == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool trick
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True) | ||
# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made me realize the current way of getting features out of our AutoBackbone
classes is very inefficient as well (they also use output_hidden_states=True
as done here for instance).
Will update this for all backbones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will create a Github issue for this. Maybe we can leverage that as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that would be nice I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #27873
* add model like * logits match * minor fixes * fixes * up * up * add todo * llava processor * keep the processor simple * add conversion script * fixup * fix copies * up * add to index * fix config + logits * fix * refactor * more refactor * more refactor * fix copies * add authors * v1 tests * add `LlavaProcessor` in init * remove unneeded import * up * up * docs * up * fix CI * fix CI * add attention mask in test * make fixup * remove the vision model * that' s the dirty way to do it * nits * nits * updates * add more tests * add input tests * fixup * more styling * nits * updates amd cleanup * fixup the generation expected results * fix the testing script * some cleanup and simplification which does not work yet but almost there! * make correct dispatch operations * vectorize works for batch of images and text * last todos * nits * update test and modeling code * remove useless function for now * fix few issues * fix generation * some nits * add bakllava * nits * remove duplicated code * finis merge * cleanup * missed this line * fill the todos * add left padding offset * add left and rignt padding logic * bool to properly index * make sure * more cleanups * batch is fixed 😉 * add correct device for tensor creation * fix some dtype missmatch * ruff * update conversion script * Update src/transformers/__init__.py * fa 2 support + fix conversion script * more * correct reshaping * fix test dict * fix copies by ignoring * fix nit * skip clip vision model * fixup * fixup * LlavaForVisionText2Text -> LlavaForCausalLM * update * fix * raise correct errors * fix * docs * nuke for now * nits here and there * fixup * fix remaining tests * update LlavaForConditionalGeneration instead of CausalLM * fixups * pipeline support * slow and piepline tests * supports batch * nits * cleanup * fix first integration tests * add pad token where needed * correct etsts * fixups * update pipeline testr * fix quality * nits * revert unneeded change * nit * use BatchFeature * from ...feature_extraction_utils import BatchFeature * nits * nits * properly update * more f*** nits * fix copies * comment * keep slow test slow * Update src/transformers/models/llava/processing_llava.py Co-authored-by: Arthur <[email protected]> * add piepline example * add pixel values in docstrign * update pr doctest * fix * fix slow tests * remove hack * fixup * small note * forward contrib credits from PR25789 * forward contrib credits from original implementation and work * add arthur * Update src/transformers/models/llava/processing_llava.py Co-authored-by: Lysandre Debut <[email protected]> * update docstring * nit * move to not doctested because of timeout issues * fixup * add description * more * fix-copies * fix docs * add beam search * add more comments * add typehints on processor * add speedup plot * update slow tests and docs * push test * push batched test * fix batched generation with different number of images * remove benchmark due to a bug * fix test * fix copies * add gcolab demo --------- Co-authored-by: Arthur Zucker <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: shauray8 <[email protected]> Co-authored-by: haotian-liu <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>
* add model like * logits match * minor fixes * fixes * up * up * add todo * llava processor * keep the processor simple * add conversion script * fixup * fix copies * up * add to index * fix config + logits * fix * refactor * more refactor * more refactor * fix copies * add authors * v1 tests * add `LlavaProcessor` in init * remove unneeded import * up * up * docs * up * fix CI * fix CI * add attention mask in test * make fixup * remove the vision model * that' s the dirty way to do it * nits * nits * updates * add more tests * add input tests * fixup * more styling * nits * updates amd cleanup * fixup the generation expected results * fix the testing script * some cleanup and simplification which does not work yet but almost there! * make correct dispatch operations * vectorize works for batch of images and text * last todos * nits * update test and modeling code * remove useless function for now * fix few issues * fix generation * some nits * add bakllava * nits * remove duplicated code * finis merge * cleanup * missed this line * fill the todos * add left padding offset * add left and rignt padding logic * bool to properly index * make sure * more cleanups * batch is fixed 😉 * add correct device for tensor creation * fix some dtype missmatch * ruff * update conversion script * Update src/transformers/__init__.py * fa 2 support + fix conversion script * more * correct reshaping * fix test dict * fix copies by ignoring * fix nit * skip clip vision model * fixup * fixup * LlavaForVisionText2Text -> LlavaForCausalLM * update * fix * raise correct errors * fix * docs * nuke for now * nits here and there * fixup * fix remaining tests * update LlavaForConditionalGeneration instead of CausalLM * fixups * pipeline support * slow and piepline tests * supports batch * nits * cleanup * fix first integration tests * add pad token where needed * correct etsts * fixups * update pipeline testr * fix quality * nits * revert unneeded change * nit * use BatchFeature * from ...feature_extraction_utils import BatchFeature * nits * nits * properly update * more f*** nits * fix copies * comment * keep slow test slow * Update src/transformers/models/llava/processing_llava.py Co-authored-by: Arthur <[email protected]> * add piepline example * add pixel values in docstrign * update pr doctest * fix * fix slow tests * remove hack * fixup * small note * forward contrib credits from PR25789 * forward contrib credits from original implementation and work * add arthur * Update src/transformers/models/llava/processing_llava.py Co-authored-by: Lysandre Debut <[email protected]> * update docstring * nit * move to not doctested because of timeout issues * fixup * add description * more * fix-copies * fix docs * add beam search * add more comments * add typehints on processor * add speedup plot * update slow tests and docs * push test * push batched test * fix batched generation with different number of images * remove benchmark due to a bug * fix test * fix copies * add gcolab demo --------- Co-authored-by: Arthur Zucker <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: shauray8 <[email protected]> Co-authored-by: haotian-liu <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>
Hello, can you tell me how to use |
What does this PR do?
Adds Llava - a multimodal model in transformers library.
Llava is a multi-modal model that claims to have competitive performance than GPT-4 for multi-modal tasks. There are currently 3 main variants of this architecture:
This implementation leverages
AutoModelForCausalLM
, similarly asBlip2
to load the correct language model. The goal of this PR is to make it agnostic across all language models architectures.Closes #25789
Closes #27221
Original llava author: https://github.com/haotian-liu/LLaVA @haotian-liu
Original PR author: @shauray8