Enable Attention Mask for Training #1516

Sanger2000 · 2023-11-06T01:42:35Z

Feature request

It appears that originally, attention masks were ignored for training because they used the slow path in pytorch's scaled dot product attention.

Am not fully confident, but I believe that they now support custom attention masks with memory efficient attention as per - pytorch/pytorch#104310.

It would be good to enable custom attention masks in BetterTransformer training.

Motivation

Want to throw in custom attention mask (for example fitting multiple examples in a given sequence, but only letting tokens pay attention to others in the same example.

Your contribution

It could be as straightforward as just removing the lines:

if self.is_training:
    attn_mask = None

In all implementations. I would be happy to do this. Perhaps it is also worth it to warn the user that memory-efficient attention will be used instead of flash attention.

The text was updated successfully, but these errors were encountered:

fxmarty · 2023-12-13T16:20:28Z

Hi @Sanger2000 that's a good point. Just so you know, we are upstreaming SDPA support in Transformers directly & used by default, and you can already use it for a few models (see https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention), with good performances during training (see the benchmark at huggingface/transformers#28005).

I won't be putting too much effort in BetterTransformer (when it is only about using SDPA, not e.g. nested tensors, etc.), but rather extending the support of models with SDPA in Transformers.

rightaditya · 2024-03-06T06:05:11Z

The attention mask is only supported for Memory-Efficient Attention, not FlashAttention. But even without the mask, on CUDA FlashAttention won't be used anyway because it requires fp16 or bf16 dtypes (on CUDA). You can test what kernel is being used enabling only one kernel at a time with the relevant functions in torch.backends.cuda (e.g., torch.backends.cuda.sdp_kernel).

@fxmarty Your point makes sense, but seeing as not all models are currently supported, and since this library is the recommended solution for models that aren't yet supported in the main Transformers library, it would help to allow the masks, at least if running on PyTorch v2.1+. Otherwise, would you be willing to accept a pull request that addresses this?

fxmarty · 2024-03-18T06:52:22Z

The attention mask is only supported for Memory-Efficient Attention, not FlashAttention. But even without the mask, on CUDA FlashAttention won't be used anyway because it requires fp16 or bf16 dtypes (on CUDA). You can test what kernel is being used enabling only one kernel at a time with the relevant functions in torch.backends.cuda (e.g., torch.backends.cuda.sdp_kernel).

Yes!

@fxmarty Your point makes sense, but seeing as not all models are currently supported, and since this library is the recommended solution for models that aren't yet supported in the main Transformers library, it would help to allow the masks, at least if running on PyTorch v2.1+. Otherwise, would you be willing to accept a pull request that addresses this?

Right, happy to review PRs. In the future I think PRs as huggingface/transformers#28802 are the way to go (although taking ages to be merged)

fxmarty closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Attention Mask for Training #1516

Enable Attention Mask for Training #1516

Sanger2000 commented Nov 6, 2023 •

edited

Loading

fxmarty commented Dec 13, 2023

rightaditya commented Mar 6, 2024

fxmarty commented Mar 18, 2024

Enable Attention Mask for Training #1516

Enable Attention Mask for Training #1516

Comments

Sanger2000 commented Nov 6, 2023 • edited Loading

Feature request

Motivation

Your contribution

fxmarty commented Dec 13, 2023

rightaditya commented Mar 6, 2024

fxmarty commented Mar 18, 2024

Sanger2000 commented Nov 6, 2023 •

edited

Loading