You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the recent GroundingDino PR (#26087) one thing that was missing was the cross-attention masking when running Multiheaded attention between text and image features. The assumption was that for inference the text used would probably be a fixed set of labels, thus not needing the cross-attention masking. However, for a more robust and training-friendly implementation, we should support text cross-attention mask.
Motivation
This was an actual request in the PR of the model from people who are using the transformers implementation on their research see #26087 (comment)
Your contribution
A PR should be open soon 😄
The text was updated successfully, but these errors were encountered:
Feature request
In the recent
GroundingDino
PR (#26087) one thing that was missing was the cross-attention masking when running Multiheaded attention between text and image features. The assumption was that for inference the text used would probably be a fixed set of labels, thus not needing the cross-attention masking. However, for a more robust and training-friendly implementation, we should support text cross-attention mask.Motivation
This was an actual request in the PR of the model from people who are using the
transformers
implementation on their research see #26087 (comment)Your contribution
A PR should be open soon 😄
The text was updated successfully, but these errors were encountered: