Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Llava] Add Llava to transformers #27662

Merged
merged 166 commits into from
Dec 7, 2023

Conversation

younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Nov 22, 2023

What does this PR do?

Adds Llava - a multimodal model in transformers library.
Llava is a multi-modal model that claims to have competitive performance than GPT-4 for multi-modal tasks. There are currently 3 main variants of this architecture:

This implementation leverages AutoModelForCausalLM , similarly as Blip2 to load the correct language model. The goal of this PR is to make it agnostic across all language models architectures.

Closes #25789
Closes #27221

Original llava author: https://github.com/haotian-liu/LLaVA @haotian-liu
Original PR author: @shauray8

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForVisionText2Text

model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)

prompt = "<image>\n"
prompt += "USER: What are the things I should be cautious about when I visit this place?\nASSISTANT:"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

model = LlavaForVisionText2Text.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(0)

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0][2:], skip_special_tokens=True))
>>> USER: What are the things I should be cautious about when I visit this place?
ASSISTANT: When visiting this place, which appears to be a dock or pier extending into a body of water, you should be cautious about several factors. First, be aware of the water depth and any potential hazards, such as submerged rocks or debris, that could pose a risk to your safety. Second, be mindful of the weather conditions, as sudden changes in weather can make the dock or pier unsafe to use. Third, be cautious of the surrounding environment, as there may be wildlife or other natural elements that could pose a threat. Finally, be aware of any local regulations or guidelines for using the dock or pier, as these may include rules about fishing, swimming, or other activities. By being cautious and following any applicable guidelines, you can ensure a safe and enjoyable experience at this location.

Image

Comment on lines 85 to 86
prompts: Union[List[TextInput], List[List[TextInput]]],
images=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ydshieh as can be seen, any multimodal processor has a different signature here, which will make it very difficult to support them on the image-to-text pipeline for instance. Hence we need additional tests that check whether they all comply to the same API. In this case I'd advocate for the images keyword argument followed by text as that's what most processors use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest here to use text as the argument name to LlavaProcessor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but also reversed order right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when I worked on Kosmos-2, I kinda copied from Blip2/InstructBlip, where the order is images, text.
I see Fuyu and CLIP processors are text, images.

I don't know if there is a clear criteria to determine this. In this case, we can probably say images is the main input, so should be before text.

What's your opinion on this order thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so that's another reason why we need tests for this 😓 perhaps we could extend test_forward_signature for both models + processors (currently it's still implemented for text-only models). Personally I would go for images and then text (and ideally the latter should have been called texts to also be plural).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current text is likely because it is the argument name in PreTrainedTokenizerBase (so every tokenizer classes)

text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

batch_indices, non_image_indices = torch.where(input_ids != self.config.image_token_index)

# 2. Compute the positions where text should be written
new_token_positions = torch.cumsum((image_token_mask * (nb_text_tokens_per_images - 1) + 1), -1) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very smart! Maybe add a comment there, like

# Calculate new positions for text tokens in merged image-text sequence.
# `image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
# `torch.cumsum` computes how each image token shifts subsequent text token positions.
# - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.

or make it part of a step-by-step explanation in the docstrings


super().__init__(image_processor, tokenizer)

def __call__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, maybe add type hints there, and maybe explicit args to tokenizer?

def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
        images: ImageInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs,
    ) -> BatchFeature:
    ```
    
    also, would either pass **kwargs to tokenizer (or pass explicitly needed arguments), I'm thinking about `pad_to_multiple_of` and `verbose` for instance

"""The LLAVA model which consists of a vision backbone and a language model.""",
LLAVA_START_DOCSTRING,
)
class LlavaForConditionalGeneration(LlavaPreTrainedModel):
Copy link
Contributor

@NielsRogge NielsRogge Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question here, are we ok with not having a LlavaModel? I remember for BLIP-2 people were asking for it afterwards, and as we're using AutoModelForCausalLM in the head class, we cannot use AutoModel in the base LlavaModel class, as it would not be compatible with the weights. Hence if we were to add a LlavaModel, we would need to use AutoModelForCausalLM and remove the head, I assume.

Copy link
Collaborator

@ArthurZucker ArthurZucker Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we leave it as is without extra layer of complexity! And jiust add a new model output class maybe with the hidden states before the head

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also don't want to have issue with base model prefixes)

# Retrieve the first layer to inspect the logits and mask out the hidden states
# that are set to 0
first_layer_past_key_value = past_key_values[0][0][:, 0, :, 0]
batch_index, non_attended_tokens = torch.where(first_layer_past_key_value == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice trick

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool trick

Comment on lines +391 to +392
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
Copy link
Contributor

@NielsRogge NielsRogge Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me realize the current way of getting features out of our AutoBackbone classes is very inefficient as well (they also use output_hidden_states=True as done here for instance).

Will update this for all backbones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will create a Github issue for this. Maybe we can leverage that as well here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that would be nice I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #27873

@younesbelkada younesbelkada merged commit 44b5506 into huggingface:main Dec 7, 2023
23 checks passed
@younesbelkada younesbelkada deleted the add-llava-final branch December 7, 2023 08:30
nevikw39 pushed a commit to NTHU-ML-2023-team19/transformers that referenced this pull request Dec 7, 2023
* add model like

* logits match

* minor fixes

* fixes

* up

* up

* add todo

* llava processor

* keep the processor simple

* add conversion script

* fixup

* fix copies

* up

* add to index

* fix config + logits

* fix

* refactor

* more refactor

* more refactor

* fix copies

* add authors

* v1 tests

* add `LlavaProcessor` in init

* remove unneeded import

* up

* up

* docs

* up

* fix CI

* fix CI

* add attention  mask in test

* make fixup

* remove the vision model

* that' s the dirty way to do it

* nits

* nits

* updates

* add more tests

* add input tests

* fixup

* more styling

* nits

* updates amd cleanup

* fixup the generation expected results

* fix the testing script

* some cleanup and simplification which does not work yet but almost there!

* make correct dispatch operations

* vectorize works for batch of images and text

* last todos

* nits

* update test and modeling code

* remove useless function for now

* fix few issues

* fix generation

* some nits

* add bakllava

* nits

* remove duplicated code

* finis merge

* cleanup

* missed this line

* fill the todos

* add left padding offset

* add left and rignt padding logic

* bool to properly index

* make sure

* more cleanups

* batch is fixed 😉

* add correct device for tensor creation

* fix some dtype missmatch

* ruff

* update conversion script

* Update src/transformers/__init__.py

* fa 2 support + fix conversion script

* more

* correct reshaping

* fix test dict

* fix copies by ignoring

* fix nit

* skip clip vision model

* fixup

* fixup

* LlavaForVisionText2Text -> LlavaForCausalLM

* update

* fix

* raise correct errors

* fix

* docs

* nuke for now

* nits here and there

* fixup

* fix remaining tests

* update LlavaForConditionalGeneration instead of CausalLM

* fixups

* pipeline support

* slow and piepline tests

* supports batch

* nits

* cleanup

* fix first integration tests

* add pad token where needed

* correct etsts

* fixups

* update pipeline testr

* fix quality

* nits

* revert unneeded change

* nit

* use BatchFeature

* from ...feature_extraction_utils import BatchFeature

* nits

* nits

* properly update

* more f*** nits

* fix copies

* comment

* keep slow test slow

* Update src/transformers/models/llava/processing_llava.py

Co-authored-by: Arthur <[email protected]>

* add piepline example

* add pixel values in docstrign

* update pr doctest

* fix

* fix slow tests

* remove hack

* fixup

* small note

* forward contrib credits from PR25789

* forward contrib credits from original implementation and work

* add arthur

* Update src/transformers/models/llava/processing_llava.py

Co-authored-by: Lysandre Debut <[email protected]>

* update docstring

* nit

* move to not doctested because of timeout issues

* fixup

* add description

* more

* fix-copies

* fix docs

* add beam search

* add more comments

* add typehints on processor

* add speedup plot

* update slow tests and docs

* push test

* push batched test

* fix batched generation with different number of images

* remove benchmark due to a bug

* fix test

* fix copies

* add gcolab demo

---------

Co-authored-by: Arthur Zucker <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: shauray8 <[email protected]>
Co-authored-by: haotian-liu <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
nevikw39 pushed a commit to NTHU-ML-2023-team19/transformers that referenced this pull request Dec 7, 2023
* add model like

* logits match

* minor fixes

* fixes

* up

* up

* add todo

* llava processor

* keep the processor simple

* add conversion script

* fixup

* fix copies

* up

* add to index

* fix config + logits

* fix

* refactor

* more refactor

* more refactor

* fix copies

* add authors

* v1 tests

* add `LlavaProcessor` in init

* remove unneeded import

* up

* up

* docs

* up

* fix CI

* fix CI

* add attention  mask in test

* make fixup

* remove the vision model

* that' s the dirty way to do it

* nits

* nits

* updates

* add more tests

* add input tests

* fixup

* more styling

* nits

* updates amd cleanup

* fixup the generation expected results

* fix the testing script

* some cleanup and simplification which does not work yet but almost there!

* make correct dispatch operations

* vectorize works for batch of images and text

* last todos

* nits

* update test and modeling code

* remove useless function for now

* fix few issues

* fix generation

* some nits

* add bakllava

* nits

* remove duplicated code

* finis merge

* cleanup

* missed this line

* fill the todos

* add left padding offset

* add left and rignt padding logic

* bool to properly index

* make sure

* more cleanups

* batch is fixed 😉

* add correct device for tensor creation

* fix some dtype missmatch

* ruff

* update conversion script

* Update src/transformers/__init__.py

* fa 2 support + fix conversion script

* more

* correct reshaping

* fix test dict

* fix copies by ignoring

* fix nit

* skip clip vision model

* fixup

* fixup

* LlavaForVisionText2Text -> LlavaForCausalLM

* update

* fix

* raise correct errors

* fix

* docs

* nuke for now

* nits here and there

* fixup

* fix remaining tests

* update LlavaForConditionalGeneration instead of CausalLM

* fixups

* pipeline support

* slow and piepline tests

* supports batch

* nits

* cleanup

* fix first integration tests

* add pad token where needed

* correct etsts

* fixups

* update pipeline testr

* fix quality

* nits

* revert unneeded change

* nit

* use BatchFeature

* from ...feature_extraction_utils import BatchFeature

* nits

* nits

* properly update

* more f*** nits

* fix copies

* comment

* keep slow test slow

* Update src/transformers/models/llava/processing_llava.py

Co-authored-by: Arthur <[email protected]>

* add piepline example

* add pixel values in docstrign

* update pr doctest

* fix

* fix slow tests

* remove hack

* fixup

* small note

* forward contrib credits from PR25789

* forward contrib credits from original implementation and work

* add arthur

* Update src/transformers/models/llava/processing_llava.py

Co-authored-by: Lysandre Debut <[email protected]>

* update docstring

* nit

* move to not doctested because of timeout issues

* fixup

* add description

* more

* fix-copies

* fix docs

* add beam search

* add more comments

* add typehints on processor

* add speedup plot

* update slow tests and docs

* push test

* push batched test

* fix batched generation with different number of images

* remove benchmark due to a bug

* fix test

* fix copies

* add gcolab demo

---------

Co-authored-by: Arthur Zucker <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: shauray8 <[email protected]>
Co-authored-by: haotian-liu <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
@Hambaobao
Copy link

Hello, can you tell me how to use LlavaForConditionalGeneration in transformers to train from scratch? I want to use the weights of Vicuna and Clip Vision Transformer for training, just like the original author did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.