Triton group normalization is slower than torch ops in some case. #134

chenly15 · 2024-03-12T09:58:38Z

In the techniques, I found GroupNorm was effective to speed up. But when I tested the speed between triton and torch operations, I found triton was slower than torch. The script is the same as https://github.com/chengzeyi/stable-fast/blob/main/src/sfast/triton/ops/group_norm.py , and the result is shown below.
For the GroupNorm operation, whether triton is faster than torch. Can I replace the torch.nn.GroupNorm with TritonGroupNorm directly, to accelerate stable diffusion model.

My env is
A100 GPU, torch 2.1, triton 2.1, no xformers, diffusers 0.21.2

chengzeyi · 2024-05-09T14:55:12Z

@chenly15 Our implementation may be not super effecient now. I currently use other methods to speed up groupnorm computation. However, it has not been open-sourced by far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton group normalization is slower than torch ops in some case. #134

Triton group normalization is slower than torch ops in some case. #134

chenly15 commented Mar 12, 2024 •

edited

Loading

chengzeyi commented May 9, 2024

Triton group normalization is slower than torch ops in some case. #134

Triton group normalization is slower than torch ops in some case. #134

Comments

chenly15 commented Mar 12, 2024 • edited Loading

chengzeyi commented May 9, 2024

chenly15 commented Mar 12, 2024 •

edited

Loading