`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims... #30095

minostauros · 2024-04-07T06:20:30Z

... SDPA causal mask generation may be wrong for the mask generation.

transformers/src/transformers/modeling_attn_mask_utils.py

Lines 421 to 433 in 76fa17c

    
           if torch.all(mask == 1): 
        
               if is_tracing: 
        
                   pass 
        
               elif tgt_len == 1: 
        
                   # For query_length == 1, causal attention and bi-directional attention are the same. 
        
                   return None 
        
               elif key_value_length == tgt_len: 
        
                   return None 
        
               else: 
        
                   # Unfortunately, for query_length > 1 and key_value_length != query_length, we can not generally ignore the attention mask, as SDPA causal mask generation 
        
                   # may be wrong. We will set is_causal=False in SDPA and rely on Transformers attention_mask instead, hence not setting it to None here. 
        
                   # Reference: https://github.com/pytorch/pytorch/issues/108108 
        
                   return AttentionMaskConverter._expand_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)

Will it be safe to just return None for the else: case?

For causal attention, we can just use _prepare_4d_causal_attention_mask_for_sdpa

Related issues:
pytorch/pytorch#108108
Dao-AILab/flash-attention@9e5e8bc
#28802

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-08T17:07:14Z

cc @fxmarty

fxmarty · 2024-04-09T08:00:59Z

Hi @minostauros, thank you for the report.

... SDPA causal mask generation may be wrong for the mask generation.

_prepare_4d_attention_mask_for_sdpa does not handle causal masks. However,

Will it be safe to just return None for the else: case?

Yes, good catch, I'll fix that! This is a somewhat unlikely case though, where one would use past key values for typically encoder-type of models. How did you run into this case?

minostauros · 2024-04-09T08:11:28Z

This is a somewhat unlikely case though, where one would use past key values for typically encoder-type of models. How did you run into this case?

I didn't run into the specific section but I was just reviewing #28802 and was trying to add flash-attention-2 to BERT (BLIP-2 variant of BERT to be exact).
Thanks for confirmation!

github-actions · 2024-05-07T08:03:12Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

minostauros · 2024-05-07T08:10:16Z

This issue will be closed by #30138

fxmarty mentioned this issue Apr 9, 2024

Ignore non-causal mask in more cases with SDPA #30138

Merged

huggingface deleted a comment from github-actions bot Jun 3, 2024

fxmarty closed this as completed in #30138 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims... #30095

`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims... #30095

minostauros commented Apr 7, 2024 •

edited

Loading

amyeroberts commented Apr 8, 2024

fxmarty commented Apr 9, 2024

minostauros commented Apr 9, 2024

github-actions bot commented May 7, 2024

minostauros commented May 7, 2024

_prepare_4d_attention_mask_for_sdpa is not for causal attention but claims... #30095

_prepare_4d_attention_mask_for_sdpa is not for causal attention but claims... #30095

Comments

minostauros commented Apr 7, 2024 • edited Loading

amyeroberts commented Apr 8, 2024

fxmarty commented Apr 9, 2024

minostauros commented Apr 9, 2024

github-actions bot commented May 7, 2024

minostauros commented May 7, 2024

`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims... #30095

`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims... #30095

minostauros commented Apr 7, 2024 •

edited

Loading