Why are transformers unet are slower than CNNs despite being smaller ? #9029
Replies: 2 comments
-
Hello @RochMollero, AFAIK, The Pareto Principle says that roughly 80% of consequences come from 20% of causes. Also, The Lottery Ticket Hypothesis argues that we may even need as low <20% parameters as possible to get similar results in some situations. For quantization, see this blog post: Memory-efficient Diffusion Transformers with Quanto and Diffusers |
Beta Was this translation helpful? Give feedback.
-
Hello
So I'm working on the AudioDiffusion pipeline and trying both the classical unet and the Conditional one based on transformers. I have these values plotted from
CNN UNET :
3217 MiB
PARAM NUMBER: 113668609 params (113.668609M)
2 or 3 seconds on runpod H100 PCIE for 50 diffusion steps
(Conditional) CNN Transformers :
4733 MiB
PARAM NUMBER: 69734529 params (69.734529M)
10 or 11 seconds on runpod H100 PCIE for 50 diffusion steps
So the transformer one is basically 66% size of the CNN on, but is 4 to 5 longer to execute. And It takes more space on the GPU !!
Why ? I guess it's due to the type of operations but is there more to know ? In particular why do people like transformers so much if it's that slow ? Do we only need 20% of the parameters to get same qualitative results ? Should I lower the parameter count of my transformer unet ?
Beta Was this translation helpful? Give feedback.
All reactions