Weird Attention module #229

pfeatherstone · 2024-06-09T14:04:27Z

pfeatherstone
Jun 9, 2024

Your PSA class uses an Attention block with the following layers:

Lines 781 to 783 in ea93d4f

    
           self.qkv = Conv(dim, h, 1, act=False) 
        
           self.proj = Conv(dim, dim, 1, act=False) 
        
           self.pe = Conv(dim, dim, 3, 1, g=dim, act=False)

These conv blocks all use batch normalization. This is weird. I've never seen that inside an attention module. You would normally see LayerNorm right at the end of the attention module, not batch norm after every projection.
Can you explain

pfeatherstone · 2024-06-09T14:06:48Z

pfeatherstone
Jun 9, 2024
Author

I'm not sure, but I don't think batchnorm is equivalent to layernorm when using channel first layout. (a normal attention module would be channel last).
If this works, and helps stabilize or accelerate training, this could be a deeper result for attention in general.
This to me looks more like an oversight and normal nn.Conv2d modules should have been used.

0 replies

pfeatherstone · 2024-06-10T12:12:08Z

pfeatherstone
Jun 10, 2024
Author

Also, that module can use flash attention, i.e. scaled_dot_product_attention(), as an enhancement.

1 reply

pfeatherstone Jun 15, 2024
Author

So this is equivalent and uses flash attention:

from einops import rearrange

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, attn_ratio=0.5):
        super().__init__()
        self.num_heads  = num_heads
        self.dim_head   = dim // num_heads
        self.key_dim    = int(self.dim_head * attn_ratio)
        h               = (self.dim_head + self.key_dim*2) * num_heads
        self.qkv        = Conv(dim, h, 1, act=False)
        self.proj       = Conv(dim, dim, 1, act=False)
        self.pe         = Conv(dim, dim, 3, 1, g=dim, act=False)

    def forward(self, x):
        H, W    = x.shape[-2:]
        q, k, v = rearrange(self.qkv(x), 'b (h d) y x -> b h (y x) d', h=self.num_heads).split([self.key_dim, self.key_dim, self.dim_head], -1)
        x       = F.scaled_dot_product_attention(q, k, v)
        x, v    = map(lambda t: rearrange(t, 'b h (y x) d -> b (h d) y x', y=H, x=W), (x, v))
        x       = self.proj(x + self.pe(v))
        return x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird Attention module #229

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Weird Attention module #229

pfeatherstone Jun 9, 2024

Replies: 2 comments · 1 reply

pfeatherstone Jun 9, 2024 Author

pfeatherstone Jun 10, 2024 Author

pfeatherstone Jun 15, 2024 Author

pfeatherstone
Jun 9, 2024

Replies: 2 comments 1 reply

pfeatherstone
Jun 9, 2024
Author

pfeatherstone
Jun 10, 2024
Author

pfeatherstone Jun 15, 2024
Author