Why are transformers unet are slower than CNNs despite being smaller ? #9029

RochMollero · 2024-07-31T23:38:04Z

RochMollero
Jul 31, 2024

Hello

So I'm working on the AudioDiffusion pipeline and trying both the classical unet and the Conditional one based on transformers. I have these values plotted from

        param_number = sum(p.numel() for p in self.unet.parameters())
        print(f"PARAM NUMBER: {param_number} params ({param_number/1000000}M)")

CNN UNET :
3217 MiB
PARAM NUMBER: 113668609 params (113.668609M)
2 or 3 seconds on runpod H100 PCIE for 50 diffusion steps

(Conditional) CNN Transformers :
4733 MiB
PARAM NUMBER: 69734529 params (69.734529M)
10 or 11 seconds on runpod H100 PCIE for 50 diffusion steps

So the transformer one is basically 66% size of the CNN on, but is 4 to 5 longer to execute. And It takes more space on the GPU !!

Why ? I guess it's due to the type of operations but is there more to know ? In particular why do people like transformers so much if it's that slow ? Do we only need 20% of the parameters to get same qualitative results ? Should I lower the parameter count of my transformer unet ?

sayakpaul · 2024-08-03T03:59:58Z

sayakpaul
Aug 3, 2024
Collaborator

@bghira

0 replies

tolgacangoz · 2024-08-03T14:06:27Z

tolgacangoz
Aug 3, 2024

Hello @RochMollero,
Could you share the exact code you run during testing for full reproducibility? The second model can have fewer parameters but its amount of floating point operations could be higher. Could you also measure their FLOPs? What is your PyTorch version? The performance of scaled_dot_product_attention is as follows in terms of PyTorch versions: 2.2+ > 2.0-2.1 > 1.x .

AFAIK, The Pareto Principle says that roughly 80% of consequences come from 20% of causes. Also, The Lottery Ticket Hypothesis argues that we may even need as low <20% parameters as possible to get similar results in some situations. For quantization, see this blog post: Memory-efficient Diffusion Transformers with Quanto and Diffusers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are transformers unet are slower than CNNs despite being smaller ? #9029

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why are transformers unet are slower than CNNs despite being smaller ? #9029

RochMollero Jul 31, 2024

Replies: 2 comments

sayakpaul Aug 3, 2024 Collaborator

tolgacangoz Aug 3, 2024

RochMollero
Jul 31, 2024

sayakpaul
Aug 3, 2024
Collaborator

tolgacangoz
Aug 3, 2024