-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support varied input sequence lengths with a fixed block mask #31
Comments
In general no, the current blessed solution is to call |
Thanks for the prompt response! So, can I understand it like this: if we need to perform evaluations on common LM benchmarks, which often contain questions of varying lengths, we need to create the block mask on the fly for each input (ideally with _compile=True to speed up this process)? |
yup thats the best approach, with _compile=True the cost should be relatively low compared to actual compute and this cost gets ammortized over all attention calls throughout the model |
Got it! The last question is that I find that sometimes setting |
hmmm this is likely a dynamic shapes thing, @Chillee |
Thanks for the great repo!
When using a custom-defined attention mask pattern (e.g., the A-shape mask in this work), I noticed that when the input length (e.g., 512) is shorter than the length of the predefined block mask (e.g., 1024) in
mask_mod
, the generation results may not be correct, even though the attention pattern of the former is a truncated version of the latter.Therefore, I wonder if FlexAttention generally supports varying input sequence lengths under a fixed block mask, and how it handles this situation?
The text was updated successfully, but these errors were encountered: