What is the expected gpu memory performance drop wrt flash attention with block masks? #54

arilato · 2024-10-19T20:29:52Z

I'm testing out using flex attention to utilize some custom attention masks. The attention masks I'm working with are causal, except there is usually a relatively small single rectangular area of 0s in the mask, which I added support for with a block mask.

Previously, using xformers which maps to flash attention 2, I managed to train mistral-large at 28k sequence length with just tp on 8xh100. However, with flex attention, even 16k sequence length is running OOM. Is thie expected?

I am compiling flex_attention after importing it, and also compiling the block mask when instantiating it.

The text was updated successfully, but these errors were encountered:

Chillee · 2024-10-21T23:04:05Z

@arilato I would not expect that. The extra memory overhead from FlexAttention should be S^2/(BLOCK_SIZE^2). With the default BLOCK_SIZE of 128, at 28k, the extra memory overhead should be around 100kb or so.

drisspg · 2024-10-21T23:59:28Z

@arilato you need to also ensure that you are compiling the create_block_mask function since if this is called without compile it will realized the full 28k x 28k sequence tensor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the expected gpu memory performance drop wrt flash attention with block masks? #54

What is the expected gpu memory performance drop wrt flash attention with block masks? #54

arilato commented Oct 19, 2024 •

edited

Loading

Chillee commented Oct 21, 2024

drisspg commented Oct 21, 2024

What is the expected gpu memory performance drop wrt flash attention with block masks? #54

What is the expected gpu memory performance drop wrt flash attention with block masks? #54

Comments

arilato commented Oct 19, 2024 • edited Loading

Chillee commented Oct 21, 2024

drisspg commented Oct 21, 2024

arilato commented Oct 19, 2024 •

edited

Loading