You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, it occasionally raises RuntimeError: CUDA error: device-side assert triggered when computing extended_attention_mask. This error happens randomly during the whole evaluation, sometimes happens in the third batch, sometimes in the last batch, etc.
I print some shapes in the model.forward() method and I think the extended_attention_mask is wrongly computed.
defforward(
self,
input_ids: torch.LongTensor=None,
pixel_values: torch.FloatTensor=None,
attention_mask: Optional[torch.Tensor] =None,
position_ids: Optional[torch.LongTensor] =None,
past_key_values: Optional[List[torch.FloatTensor]] =None,
inputs_embeds: Optional[torch.FloatTensor] =None,
vision_feature_layer: Optional[int] =None,
vision_feature_select_strategy: Optional[str] =None,
labels: Optional[torch.LongTensor] =None,
use_cache: Optional[bool] =None,
output_attentions: Optional[bool] =None,
output_hidden_states: Optional[bool] =None,
return_dict: Optional[bool] =None,
) ->Union[Tuple, LlavaCausalLMOutputWithPast]:
output_attentions=output_attentionsifoutput_attentionsisnotNoneelseself.config.output_attentionsoutput_hidden_states= (
output_hidden_statesifoutput_hidden_statesisnotNoneelseself.config.output_hidden_states
)
return_dict=return_dictifreturn_dictisnotNoneelseself.config.use_return_dictvision_feature_layer= (
vision_feature_layerifvision_feature_layerisnotNoneelseself.config.vision_feature_layer
)
vision_feature_select_strategy= (
vision_feature_select_strategyifvision_feature_select_strategyisnotNoneelseself.config.vision_feature_select_strategy
)
ifinputs_embedsisNone:
# 1. Extra the input embeddingsinputs_embeds=self.get_input_embeddings()(input_ids)
# 2. Merge text and imagesifpixel_valuesisnotNoneandinput_ids.shape[1] !=1:
image_outputs=self.vision_tower(pixel_values, output_hidden_states=True)
# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.selected_image_feature=image_outputs.hidden_states[vision_feature_layer]
ifvision_feature_select_strategy=="default":
selected_image_feature=selected_image_feature[:, 1:]
elifvision_feature_select_strategy=="full":
selected_image_feature=selected_image_featureelse:
raiseValueError(
f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
)
image_features=self.multi_modal_projector(selected_image_feature)
inputs_embeds, attention_mask, position_ids=self._merge_input_ids_with_image_features(
image_features, inputs_embeds, input_ids, attention_mask, position_ids
)
iflabelsisNone:
labels=torch.full_like(attention_mask, self.config.ignore_index).to(torch.long)
else:
# In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of# generation with cacheifpast_key_valuesisnotNoneandpixel_valuesisnotNoneandinput_ids.shape[1] ==1:
# Retrieve the first layer to inspect the logits and mask out the hidden states# that are set to 0first_layer_past_key_value=past_key_values[0][0][:, 0, :, 0]
batch_index, non_attended_tokens=torch.where(first_layer_past_key_value==0)
# Get the target lengthtarget_seqlen=first_layer_past_key_value.shape[-1] +1extended_attention_mask=torch.ones(
(attention_mask.shape[0], target_seqlen-attention_mask.shape[1]),
dtype=attention_mask.dtype,
device=attention_mask.device,
)
# Zero-out the places where we don't need to attendprint(extended_attention_mask.shape) # torch.Size([16,575])print(len(past_key_values)) # 32print(len(past_key_values[0])) # 2print(past_key_values[0][0].shape) # torch.Size([16,32,688,128])print(attention_mask.shape) # torch.Size(16,114)print(batch_index) #tensor([2],device='cuda:0')print(non_attended_tokens) #tensor([687],device='cuda:0')try:
extended_attention_mask[batch_index, non_attended_tokens] =0except:
pdb.set_trace()
attention_mask=torch.cat((attention_mask, extended_attention_mask), dim=1)
position_ids=torch.sum(attention_mask, dim=1).unsqueeze(-1) -1####Following code is ignored
Apparently, extended_attention_mask has a constant sequence length of 575 (target_seqlen - attention_mask.shape[1]), which I think is roughly the number of image tokens, while the index of non_attended_tokens may exceed this length and then raise the CUDA error. Maybe the sequence length of extended_attention_mask should just be target_seqlen, and don't need to be concatenate with attention_mask? Honestly I don't understand the code here, it's really weird.
Expected behavior
The generation should always work fine when using cache.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.36.2Who can help?
@younesbelkad
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm evaluating llava-1.5-7b-hf on MM-Vet using batch generation with
use_cache=True
, here is my script:However, it occasionally raises
RuntimeError: CUDA error: device-side assert triggered
when computingextended_attention_mask
. This error happens randomly during the whole evaluation, sometimes happens in the third batch, sometimes in the last batch, etc.I print some shapes in the
model.forward()
method and I think theextended_attention_mask
is wrongly computed.Apparently,
extended_attention_mask
has a constant sequence length of 575 (target_seqlen - attention_mask.shape[1]), which I think is roughly the number of image tokens, while the index ofnon_attended_tokens
may exceed this length and then raise the CUDA error. Maybe the sequence length ofextended_attention_mask
should just betarget_seqlen
, and don't need to be concatenate withattention_mask
? Honestly I don't understand the code here, it's really weird.Expected behavior
The generation should always work fine when using cache.
The text was updated successfully, but these errors were encountered: