Replies: 3 comments 3 replies
-
First of all, the text embeddings don't interfere with the image ones, as seen in cross-attention. They are processed separately as introduced in the MMDiT architecture of SD3. So, text embeddings are passed through self-attention layers. The text embeddings come as a concatenated version of sub-embeddings coming from three different text encoders. So, I would assume implementing a masking scheme would probably be not that straightforward. The reference implementation also doesn't mask, either: |
Beta Was this translation helpful? Give feedback.
-
Thanks @sayakpaul. According to the implementation, the text and image tokens attend each other in attention as they are concatenated on the sequence dimension before being fed to the In this sense, the padding tokens (if there are any) of text sequences will take part in attention calculation (resulting in non-zero attention scores). This is odd as different values of diffusers/src/diffusers/models/attention_processor.py Lines 1020 to 1030 in e7b9a07
|
Beta Was this translation helpful? Give feedback.
-
Thanks for confirming this. It would be good to create an issue to get help from the community and track the progress (sorry that I may not have bandwidth to implement the fix). |
Beta Was this translation helpful? Give feedback.
-
In the SD3 attention implementation below, why attention masks are not used to mask out text padding tokens in the self-attention computation? Asking this because I'm assuming the text prompts in a batch may have different lengths; we would have to pad shorter sequences to the max length, but we want to mask out padding tokens in the self-attention computation so that they don't affect model inference outputs.
diffusers/src/diffusers/models/attention_processor.py
Line 1135 in 298ce67
Beta Was this translation helpful? Give feedback.
All reactions