Why attention masks are not used in the SD3 self-attention computation? #8628

reminisce · 2024-06-18T22:23:02Z

reminisce
Jun 18, 2024

In the SD3 attention implementation below, why attention masks are not used to mask out text padding tokens in the self-attention computation? Asking this because I'm assuming the text prompts in a batch may have different lengths; we would have to pad shorter sequences to the max length, but we want to mask out padding tokens in the self-attention computation so that they don't affect model inference outputs.

diffusers/src/diffusers/models/attention_processor.py

Line 1135 in 298ce67

    
           hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)

sayakpaul · 2024-06-21T12:54:22Z

sayakpaul
Jun 21, 2024
Maintainer

First of all, the text embeddings don't interfere with the image ones, as seen in cross-attention. They are processed separately as introduced in the MMDiT architecture of SD3.

So, text embeddings are passed through self-attention layers. The text embeddings come as a concatenated version of sub-embeddings coming from three different text encoders. So, I would assume implementing a masking scheme would probably be not that straightforward.

The reference implementation also doesn't mask, either:
https://github.com/Stability-AI/sd3-ref

0 replies

reminisce · 2024-06-21T17:35:09Z

reminisce
Jun 21, 2024
Author

Thanks @sayakpaul.

According to the implementation, the text and image tokens attend each other in attention as they are concatenated on the sequence dimension before being fed to the scaled_dot_product_attention (see the code below). The paper of SD3 also confirms this (see the quote below).

In this sense, the padding tokens (if there are any) of text sequences will take part in attention calculation (resulting in non-zero attention scores). This is odd as different values of max_seq_length would affect diffusion model outputs for the text prompts with padding tokens. Although implementing a masking scheme is not as simple as in other architectures where only one text encoder is used, I think it is still feasible to do it for the MMDiT in SD3 (the mask will have interleaved 0s and -infs).

diffusers/src/diffusers/models/attention_processor.py

Lines 1020 to 1030 in e7b9a07

    
           query = torch.cat([query, encoder_hidden_states_query_proj], dim=1) 
        
           key = torch.cat([key, encoder_hidden_states_key_proj], dim=1) 
        
           value = torch.cat([value, encoder_hidden_states_value_proj], dim=1) 
        
           inner_dim = key.shape[-1] 
        
           head_dim = inner_dim // attn.heads 
        
           query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) 
        
           key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) 
        
           value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) 
        
           hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)

Quote from SD3 paper Section 4: this is equivalent to
having two independent transformers for each modality, but
joining the sequences of the two modalities for the attention
operation, such that both representations can work in their
own space yet take the other one into account.

1 reply

sayakpaul Jun 22, 2024
Maintainer

Oh you are 100% right. If you want to take a stab at implementing it, feel free to -- we will be happy to help.

reminisce · 2024-06-23T17:58:56Z

reminisce
Jun 23, 2024
Author

Thanks for confirming this. It would be good to create an issue to get help from the community and track the progress (sorry that I may not have bandwidth to implement the fix).

2 replies

sayakpaul Jun 24, 2024
Maintainer

Sure feel free to create one.

reminisce Jun 24, 2024
Author

Issue created: #8673.

Thanks for the discussion and feel free to close this thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why attention masks are not used in the SD3 self-attention computation? #8628

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why attention masks are not used in the SD3 self-attention computation? #8628

reminisce Jun 18, 2024

Replies: 3 comments · 3 replies

sayakpaul Jun 21, 2024 Maintainer

reminisce Jun 21, 2024 Author

sayakpaul Jun 22, 2024 Maintainer

reminisce Jun 23, 2024 Author

sayakpaul Jun 24, 2024 Maintainer

reminisce Jun 24, 2024 Author

reminisce
Jun 18, 2024

Replies: 3 comments 3 replies

sayakpaul
Jun 21, 2024
Maintainer

reminisce
Jun 21, 2024
Author

sayakpaul Jun 22, 2024
Maintainer

reminisce
Jun 23, 2024
Author

sayakpaul Jun 24, 2024
Maintainer

reminisce Jun 24, 2024
Author