Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior with attn_implementation="eager" #35270

Open
2 of 4 tasks
pspdada opened this issue Dec 14, 2024 · 17 comments
Open
2 of 4 tasks

Strange behavior with attn_implementation="eager" #35270

pspdada opened this issue Dec 14, 2024 · 17 comments

Comments

@pspdada
Copy link

pspdada commented Dec 14, 2024

System Info

  • transformers version: 4.47.0
  • Platform: Linux-5.15.0-120-generic-x86_64-with-glibc2.35
  • Python version: 3.10.15
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.1.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA A100-PCIE-40GB

Who can help?

@zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am trying to analyze the attention pattern of the LLAVA v1.5 7B model, so I used attn_implementation="eager" when initing the model to obtain the attention weights. However, this has led to several issues. Firstly, the output IDs are incorrect, and secondly, errors may occur. I've noticed that this problem only appears with specific images and user prompts, while it does not occur in other cases, which is quite peculiar. Below is my code:

import numpy as np
import torch
from dotenv import load_dotenv
from PIL import Image
from transformers import (
    LlavaForConditionalGeneration,
    LlavaProcessor,
)
from transformers.generation.utils import GenerateDecoderOnlyOutput

np.set_printoptions(threshold=np.inf)
model_name = "llava-hf/llava-1.5-7b-hf"

model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    cache_dir="/root/llm/utils/models/hub",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="cuda:0",
    attn_implementation="eager",
)
processor: LlavaProcessor = LlavaProcessor.from_pretrained(
    model_name,
    cache_dir="/root/llm/utils/models/hub",
    padding_side="left",
    patch_size=model.config.vision_config.patch_size,
    vision_feature_select_strategy=model.config.vision_feature_select_strategy,
)

images = [
    Image.open("/root/llm/utils/eval/Object_HalBench/images/339761.jpg"),
    Image.open("/root/llm/utils/eval/Object_HalBench/images/431256.jpg"),
]
users = [
    "Provide a thorough description of the given image.",
    "What is this photo about? Please answer in great detail.",
]

prompts: list[str] = []
for u in users:
    conversation: list[dict[str]] = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": u},
            ],
        },
    ]
    prompt: str = processor.apply_chat_template(
        conversation,
        tokenize=False,
        add_generation_prompt=True,
    )
    prompts.append(prompt)

with torch.inference_mode():
    encoded_inputs: dict[str, torch.Tensor] = processor(
        images=images,
        text=prompts,
        return_tensors="pt",
        return_token_type_ids=False,
        padding=True,
    ).to("cuda:0", torch.float16)

    output: GenerateDecoderOnlyOutput = model.generate(
        **encoded_inputs,
        max_new_tokens=50,
        num_beams=1,
        do_sample=False,
        temperature=0.7,
        output_attentions=True,
        use_cache=True,
        return_legacy_cache=True,
        return_dict_in_generate=True,
    )
generated_ids: list[torch.LongTensor] = output.sequences  # list of shape (batch_size, sequence_length)
print(generated_ids.cpu().numpy())
generated_ids = [o[len(i) :] for i, o in zip(encoded_inputs.input_ids, generated_ids)]
print()
decoded_outputs: list[str] = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)
print(decoded_outputs)
decoded_outputs = [d.rstrip("\n").strip(" ") for d in decoded_outputs]
print(decoded_outputs)
print(len(output.attentions))

Notice: the image I used is from Object_HalBench benchmark

The output is:
Some other warning:

/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:628: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
Expanding inputs for image tokens in LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.

The generated_ids: (I remove a large number of <image> token for readability)

[[32001     1  3148  1001 29901 29871 32000 32000 32000 32000 32000 32000
  32000 32000 32000 32000 32000 32000 29871    13  1184 29894   680   263
  17826  6139   310   278  2183  1967 29889   319  1799  9047 13566 29901
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]
 [    1  3148  1001 29901 29871 32000 32000 32000 32000 32000 32000 32000
  32000 32000 32000 32000 32000 29871    13  5618   338   445 15373  1048
  29973  3529  1234   297  2107  9493 29889   319  1799  9047 13566 29901
    450  1967  4332  1973   263 15007  3377   261   297  3158 29892 15859
    263  8938   373   263 15007 29899 11911   287   364  1160 29889   450
  15007  3377   261   338   297   278  7256   310   278  9088 29892   411
   1009 15007  3377  7962 19540   963 29889 29871    13    13  8439   526
   3196   916]]

Notice that <image>: 32000, <pad>: 32001

The output after batch_decode is:

['', 'The image captures a snowboarder in action, performing a trick on a snow-covered ramp. The snowboarder is in the middle of the scene, with their snowboard visible beneath them. \n\nThere are several other']

It's strange that there is token id 0 generated.

Only set output_attentions=False and return_dict_in_generate=False without removing attn_implementation="eager", won't make any change.

Notice that removing attn_implementation="eager", and not returning dict overcome this question, the output then become correct:

[[32001     1  3148  1001 29901 29871 32000 32000 32000 32000 32000 32000
  32000 32000 32000 32000 32000 32000 29871    13  1184 29894   680   263
  17826  6139   310   278  2183  1967 29889   319  1799  9047 13566 29901
    450  1967  5680   263 27683  8345   411   263  2919  7933  8024 15678
    701   278 10090 29892  4969   263   301  1878   322   325  4626   424
  25005 29889   450  8024   338 24046  2978   278  1510   261  4038 29892
   4417   263  6023   310  5469   304   278  2913 29889 29871    13    13
    797   278]
 [    1  3148  1001 29901 29871 32000 32000 32000 32000 32000 32000 32000
  32000 32000 32000 32000 32000 29871    13  5618   338   445 15373  1048
  29973  3529  1234   297  2107  9493 29889   319  1799  9047 13566 29901
    450  1967  4332  1973   263 15007  3377   261   297  3158 29892 15859
    263  8938   373   263 15007 29899 11911   287   364  1160 29889   450
  15007  3377   261   338   297   278  7256   310   278  9088 29892   411
   1009 15007  3377  7962 19540   963 29889 29871    13    13  8439   526
   3196   916]]

['The image features a bathroom with a large green plant growing up the wall, creating a lush and vibrant atmosphere. The plant is situated near the shower area, adding a touch of nature to the space. \n\nIn the',
'The image captures a snowboarder in action, performing a trick on a snow-covered ramp. The snowboarder is in the middle of the scene, with their snowboard visible beneath them. \n\nThere are several other']

Beside this, some error may occur with attn_implementation="eager", in some other case (different Image input)

Loading checkpoint shards: 100%|████| 3/3 [00:03<00:00,  1.12s/it]
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
  File "/root/llm/LVLM/test2.py", line 38, in <module>
    print(generator.gen(images, users,do_sample=True,
  File "/root/llm/LVLM/model/generator/llava.py", line 184, in gen
    out = gen_hf(
  File "/root/llm/LVLM/model/generator/utils.py", line 279, in gen_hf
    output, encoded_inputs = _gen_hf(
  File "/root/llm/LVLM/model/generator/utils.py", line 229, in _gen_hf
    output: GenerateDecoderOnlyOutput = model.generate(
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
    result = self._sample(
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 3297, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

fix it

@zucchini-nlp
Copy link
Member

I can reproduce the error with my own images, and even after removing all the legacy code the behavior persists. Interestingly generating from only text works well with eager attention, and seems that the weird behavior comes from concatenating images.

So I tried to load the vision model on eager attention while keeping the text backbone in sdpa with the following code. The generated text matched very well independently of whether the text model has sdpa or eager.

model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="cuda:0",
    attn_implementation={"text_config": "sdpa", "vision_config": "eager"},
)

Also increasing the precision to bfloat16 brings back the quality, so I think it is related to the way torch handles matmul operation in eager and sdpa modes, accumulating small numerical precision errors. The vision backbone most probably return a slightly different embedding on "sdpa" which is the way LLaVA was trained

@pspdada
Copy link
Author

pspdada commented Dec 18, 2024

I can reproduce the error with my own images, and even after removing all the legacy code the behavior persists. Interestingly generating from only text works well with eager attention, and seems that the weird behavior comes from concatenating images.

So I tried to load the vision model on eager attention while keeping the text backbone in sdpa with the following code. The generated text matched very well independently of whether the text model has sdpa or eager.

model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="cuda:0",
    attn_implementation={"text_config": "sdpa", "vision_config": "eager"},
)

Also increasing the precision to bfloat16 brings back the quality, so I think it is related to the way torch handles matmul operation in eager and sdpa modes, accumulating small numerical precision errors. The vision backbone most probably return a slightly different embedding on "sdpa" which is the way LLaVA was trained

Thank you for your reply. Could you please advise on how I should proceed to resolve this issue? Can I simply change the precision to bfloat16 to continue my operations? However, I noticed that LLaVA 1.5 was trained using float16.

@zucchini-nlp
Copy link
Member

@pspdada depending on what you are trying to achieve, you are free to use any of the two options I outlined above, and then simply report the weird behavior you observed if it's some kind of research. I don't see any string preference for any of workarounds, but forcing "vision" model to eager will be more in-line with the release in original LLaVA-VL repo

SDPA in CLIP-like models was added long time after llava models were released so I think the authors have been running on eager attention in CLIP all the time

@pspdada
Copy link
Author

pspdada commented Dec 18, 2024

@pspdada depending on what you are trying to achieve, you are free to use any of the two options I outlined above, and then simply report the weird behavior you observed if it's some kind of research. I don't see any string preference for any of workarounds, but forcing "vision" model to eager will be more in-line with the release in original LLaVA-VL repo

SDPA in CLIP-like models was added long time after llava models were released so I think the authors have been running on eager attention in CLIP all the time

I found that the behavior I observed is somewhat different from what you described. I tried loading the model using the method you suggested:

model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    cache_dir="/root/llm/utils/models/hub",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="cuda:0",
    attn_implementation={"text_config": "sdpa", "vision_config": "eager"},
)

This results in the following warning:

LlamaModel is using LlamaSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.

Additionally, this setup leads to errors (generating ids=0).

When I tried using:

{"text_config": "eager", "vision_config": "eager"}

the warning does not appear, but the generation still results in errors (generating ids=0). Only by changing the dtype to torch.bfloat16 does the generation work correctly.

@pspdada
Copy link
Author

pspdada commented Dec 18, 2024

@zucchini-nlp You mentioned that the "eager" mode is the one used during the training of LLaVA, but using this mode along with torch.float16 did not yield the correct results for me.

@zucchini-nlp
Copy link
Member

Oh sorry, I forgot that I used a different branch to make text-only generation and verify the generation quality. You're right, the above doesn't fix it and seems to be same error as in #34824. The precision change works though however I'd recommend to wait until the PR is merged because otherwise it will not match the original impl

this setup leads to errors (generating ids=0).

btw, what type of error you mean?

@ArthurZucker can we merge the VLMs clean up PR soon to fix the recurring errors? (#34502)

@pspdada
Copy link
Author

pspdada commented Dec 18, 2024

Oh sorry, I forgot that I used a different branch to make text-only generation and verify the generation quality. You're right, the above doesn't fix it and seems to be same error as in #34824. The precision change works though however I'd recommend to wait until the PR is merged because otherwise it will not match the original impl

this setup leads to errors (generating ids=0).

btw, what type of error you mean?

@ArthurZucker can we merge the VLMs clean up PR soon to fix the recurring errors? (#34502)

I'm very sorry for not being clear enough, which has caused you confusion. By "Error," I refer to incorrect generation results. In some scenarios (image-text pairs), the generation process will complete but return a generated_id = 0 (an abnormal ID that does not correspond to any token). In other cases (others image-text pairs), the generation process will be interrupted with an error:

../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.

From my observations, regardless of setting {"text_config": "eager", "vision_config": "eager"} or {"text_config": "sdpa", "vision_config": "eager"}, the phenomena I described above still occur.

The issue is only resolved when changing the data type to torch.bfloat16. (I have not tested if this resolves the issue in all qa pairs inputs.)

@zucchini-nlp
Copy link
Member

Let's wait for the linked PR to be merged then, as currently llava does some past-kv manipulations due to incorrect indentation. The CUDA error you see is probably from sampling when the model for some reason outputs nan in logits

@hsilva664
Copy link

I did some testing of this right now (for the same images OP used) and I found that if the two images are run separately, as opposed to being in a single batch, the output seems to work correctly.

@hsilva664
Copy link

hsilva664 commented Dec 18, 2024

Running code from pr-34502 did not solve it when testing on my local machine. I got the same output as @pspdada (it skips describing the first image and only describes the second). Attached is the code to reproduce it if desired, it is very similar to his, except for some adaptations so that I could more easily run locally without any extra setup. The dataset was obtained from this repository and the images correspond to his (a bathroom with a plant is id 339761 and a snowboader is id 431256), although I think this problem is not specific to these images.

@pspdada
Copy link
Author

pspdada commented Dec 20, 2024

@zucchini-nlp I found that even when loading the model with torch_dtype=torch.bfloat16, the results of attn_implementation="eager" are incorrect in some case (Maybe occur more when the input prompts are longer). I hope to fix this issue.
The example I use:

a = [
    {
        "image_id": "339761.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/339761.jpg",
        "question": "Provide a thorough description of the given image.",
    },
    {
        "image_id": "431256.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/431256.jpg",
        "question": "What is this photo about? Please answer in great detail.",
    },
    {
        "image_id": "501400.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/501400.jpg",
        "question": "Provide a thorough description of the given picture.",
    },
    {
        "image_id": "264619.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/264619.jpg",
        "question": "Explain the narrative or story that the image seems to convey, detailing each part that contributes to it.",
    },
    {
        "image_id": "551791.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/551791.jpg",
        "question": "Please provide a detailed description of the image. Describe the visual elements, colors, shapes, textures, and any objects or people present along with the overall mood or atmosphere portrayed in the image.",
    },
]
images = [Image.open(d["image_path"]) for d in a]
users = [d["question"] for d in a]

The output is:

['The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The',
'The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The',
'The',
'The image shows a group of people enjoying themselves at the beach. They are flying kites, with three kites soaring high in the sky above the water. Some of the people are surfing, while others are simply enjoying the beach atmosphere. The surfers are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are are',
'The image features a cozy living room with a Christmas theme. There are two teddy bears, one on the left side and the other on the right side of the room. A suitcase is placed in the center of the room, adding a sense of travel or adventure to the scene.\n\nA television is mounted on the wall, and a bookshelf filled with various books can be seen in the background. A chair is also present in the room, providing a comfortable se']

@pspdada
Copy link
Author

pspdada commented Dec 20, 2024

@zucchini-nlp It's strange that even without any additional configurations, using the suggested code from https://huggingface.co/docs/transformers/main/en/model_doc/llava#batched-inference directly also results in issues.
The code is:

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

# Load the model in half-precision
model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    cache_dir="/root/llm/utils/models/hub",
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    cache_dir="/root/llm/utils/models/hub",
    padding_side="left",
)

a = [
    {
        "image_id": "339761.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/339761.jpg",
        "question": "Provide a thorough description of the given image.",
    },
    {
        "image_id": "431256.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/431256.jpg",
        "question": "What is this photo about? Please answer in great detail.",
    },
    {
        "image_id": "501400.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/501400.jpg",
        "question": "Provide a thorough description of the given picture.",
    },
    {
        "image_id": "264619.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/264619.jpg",
        "question": "Explain the narrative or story that the image seems to convey, detailing each part that contributes to it.",
    },
    {
        "image_id": "551791.jpg",
        "image_path": "/root/llm/utils/eval/Object_HalBench/images/551791.jpg",
        "question": "Please provide a detailed description of the image. Describe the visual elements, colors, shapes, textures, and any objects or people present along with the overall mood or atmosphere portrayed in the image.",
    },
]
images = [Image.open(d["image_path"]) for d in a]
users = [d["question"] for d in a]

prompts: list[str] = []
for u in users:
    conversation: list[dict[str]] = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": u},
            ],
        },
    ]
    prompt: str = processor.apply_chat_template(
        conversation,
        tokenize=False,
        add_generation_prompt=True,
    )
    prompts.append(prompt)

inputs = processor(images=images, text=prompts, padding=True, return_tensors="pt").to(model.device, torch.float16)

generate_ids = model.generate(**inputs, max_new_tokens=50)
a = processor.batch_decode(generate_ids, skip_special_tokens=True)
print(a)

The output is:

['USER:  \nProvide a thorough description of the given image. ASSISTANT: The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The', 'USER:  \nWhat is this photo about? Please answer in great detail. ASSISTANT: The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The', 'USER:  \nProvide a thorough description of the given picture. ASSISTANT: The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The', 'USER:  \nExplain the narrative or story that the image seems to convey, detailing each part that contributes to it. ASSISTANT: The image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image image', 'USER:  \nPlease provide a detailed description of the image. Describe the visual elements, colors, shapes, textures, and any objects or people present along with the overall mood or atmosphere portrayed in the image. ASSISTANT: The image features a cozy living room with a Christmas theme. There are two teddy bears, one on the left side and the other on the right side of the room. A suitcase is placed in the center of the room, adding']

The bug persists in versions 4.46.3 and 4.47.0 of the transformers library. However, changing the model to another one, such as Qwen2-VL, yields correct output.

@ArthurZucker
Copy link
Collaborator

BTW, do you know if this is a regression (was this broken recently or not?)
Otherwise there are expected differences between eager and sdpa, it could also be attention mask related!

@hsilva664
Copy link

I have been looking at this for a while and intend to look more in the coming days. My tests were all with the images from the original message (339761 and 431256 from Object_HalBench) and using the code from pr-34502, linked by @zucchini-nlp. Here is what I gathered so far:

  • It seems the bug is related to the padding token (id 32001). In this example, one sequence is 599 characters and the other is 600 characters. The script uses left padding to make both length 600 when they are batched together. When the 599 sequence is run by itself, it works properly. When it is run together with the 600 one, the 599 returns an empty sequence and the 600 one works. If the 599 sequence is run alone, but we artifically add the padding token to it, it also returns an empty sequence instead of the description of the image.
  • The empty output happens because there is a -inf happening in the penultimate self-attention layer of the language model. In particular, it happens in the LlamaMLP operation of the penultimate LlamaDecoderLayer (i.e. the MLP after the self-attention happens). It happens because there is a multiplication of two numbers being lower than the lowest number representable with float16. Because of that, on the 32nd (i.e. last) self-attention layer, all Q, K, V relative to the padding token become NaN. Since all self-attention outputs from that layer result from inner products and softmax weighting with these NaN K, V, they also become NaN for the other tokens in the sequence. When producing the output, these NaN will make the output logits also a NaN. The outputs are taken with argmax of an array of NaN, hence the output_id being zero, which the LlavaProcessor then interprets as an empty sentence. The NaN Q and V are cached to be used in subsequent evaluation iterations, which means all the output tokens will be contaminated for that sequence.
  • The LlamaSdpaAttention works. The main difference between it and LlamaAttention (i.e. the eager argument), is that the sdpa version calls torch.nn.functional.scaled_dot_product_attention, whereas eager implements self-attention "by hand". Both implementations seems to be mostly the same, but the torch call on sdpa does not allow returning the softmax attention masks, which I believe was the reason OP wanted to use the eager version instead, so he could analyze the masks. Upon closer inspection, the function LlamaModel._update_causal_mask, which creates the causal mask to be passed to the decoder layers, has a segment with the code:
if (
	self.config._attn_implementation == "sdpa"
	and attention_mask is not None
	and attention_mask.device.type == "cuda"
	and not output_attentions
):
	# Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
	# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
	# Details: https://github.com/pytorch/pytorch/issues/110213
	min_dtype = torch.finfo(dtype).min
	causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)

This segment takes input token positions where all entries are masked---making them essentially useless, as the operations are discarded---and removes the masking, "activating" them again. One such token position where all entries are masked is exactly the padding token. In case we run sdpa and comment this part out, it fails just like eager. If we, for testing, include eager in this if statement, the incorrect NaN behaviour stops happening.

I think a possible solution would be to simply multiply the V relative to the padding token by zero to avoid these entries contaminating the operations, but just wanted to hear other people's thoughts before coding anything.

@hsilva664
Copy link

As for the latest @pspdada sample code where the it breaks using the 5 examples, I just ran it (fully on a GPU setup that fits the whole batch): although it breaks on the main branch, it works correctly in the PR by @zucchini-nlp . The output I get there is:

['The image features a large indoor plant growing up a wall in a bathroom. The plant is situated in the corner of the room, and it appears to be a vine or a large leafy plant. The bathroom is equipped with',
'The image captures a snowboarder in action, performing a trick on a snowy ramp. The snowboarder is in the middle of the scene, with their snowboard visible beneath them. The ramp is located in the center of',
'The image features a dining table with a plate of food, including a sandwich and a side of eggs. The sandwich is cut into triangles, and the eggs are arranged in a triangle shape as well. The plate is placed on the',
'The image captures a lively scene at the beach, where a group of people is enjoying various activities. Some of the individuals are flying kites, with three kites visible in the sky. The kites come in different colors and sizes',
'The image features a cozy living room with a Christmas theme. There are two teddy bears, one on the left side and the other on the right side of the room. A suitcase is placed in the center of the room, adding']

Which seems to be correct---in fact, the two last descriptions match the ones posted above. The summary of my tests using both the two case example given above and the five example are given below. When it says main, I used the main branch from @zucchini-nlp fork, but, since the PR was not merged yet and my results were consistent with @pspdada, I think this should be fine.

Branch # images bfloat Works Comments
main 2 no no original error of this issue
main 2 yes yes proposed solution in the comment section of changing to bfloat16
main 5 no no Issue with 5 examples mentioned by @pspdada
main 5 yes no Issue with 5 examples mentioned by @pspdada
pr-34502 2 no no original error of this issue, not solved by PR, but perhaps can be solved with my suggestion
pr-34502 2 yes yes proposed solution in the comment section of changing to bfloat16
pr-34502 5 no yes PR solved this issue with 5 examples
pr-34502 5 yes yes PR solved this issue with 5 examples

@pspdada
Copy link
Author

pspdada commented Dec 26, 2024

@hsilva664 Thank you for conducting such a detailed and in-depth study on the question I raised. I apologize for not having enough time and energy to delve deeply into this issue myself.

I would like to know if your proposed solution and PR-34502 can be used together to address the examples with two images and five images mentioned above? Additionally, have there been more examples tested to verify the correctness and stability of the model during batch inference in the eager mode?

@hsilva664
Copy link

@pspdada They can be used together. I have created a PR as a candidate to be merged to the PR by @zucchini-nlp , feel free to test the behaviour on the modified code (it is linked above). I have not run extensive testing beyond the ones discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants