Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoisting the invariant out of multi-level nested loops #61420

Closed
kunalspathak opened this issue Nov 10, 2021 · 13 comments
Closed

Hoisting the invariant out of multi-level nested loops #61420

kunalspathak opened this issue Nov 10, 2021 · 13 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Nov 10, 2021

If there are multi-level nested loops and there is an invariant calculation inside the inner most loop, we only hoist it to one level outer loop instead of hoisting it to multiple level outer loop (as appropriate). Below is an example of real world code:

https://github.com/SixLabors/ImageSharp/blob/255226b03c8350f88e641bdb58c9450b68729ef7/src/ImageSharp/Processing/Processors/Quantization/WuQuantizer%7BTPixel%7D.cs#L418-L444

Here is a simple repro where all the calculation of ind1 is done inside the a-loop instead of spreading it to b-loop, g-loop and r-loop.

I tried to hand-code this in SixLabors/ImageSharp#1818 and see some good improvements. Ideally this should be done by JIT.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Nov 10, 2021
@ghost
Copy link

ghost commented Nov 10, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

If there are multi-level nested loops and there is an invariant calculation inside the inner most loop, we only hoist it to one level outer loop instead of hoisting it to multiple level outer loop (as appropriate). Below is an example of real world code:

https://github.com/SixLabors/ImageSharp/blob/255226b03c8350f88e641bdb58c9450b68729ef7/src/ImageSharp/Processing/Processors/Quantization/WuQuantizer%7BTPixel%7D.cs#L418-L444

Here is a simple repro where all the calculation of ind1 is done inside the a-loop.

I tried to hand-code this in SixLabors/ImageSharp#1818 and see some good improvements. Ideally this should be done by JIT.

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib , @BruceForstall

@EgorBo
Copy link
Member

EgorBo commented Nov 10, 2021

This is exactly why I see a small regression in #61408, e.g.:

void foo(int i)
{
    for (int j = 0; j < 100; j++)
    {
        for (int k = 0; k < 100; k++)
        {
            Console.WriteLine(Vector128.Create(42, 42));
        }
    }
}

Current codegen in Main:

; Method Program:foo(int):this
G_M10044_IG01:
       push     rdi
       push     rsi
       sub      rsp, 56
       vzeroupper 
       vmovaps  qword ptr [rsp+20H], xmm6
						;; bbWeight=1    PerfScore 5.25

G_M10044_IG02:
       xor      esi, esi
						;; bbWeight=1    PerfScore 0.25

G_M10044_IG03:
       xor      edi, edi
       vmovupd  xmm6, xmmword ptr [reloc @RWD00]
						;; bbWeight=4    PerfScore 13.00

G_M10044_IG04:
       mov      rcx, 0xD1FFAB1E      ; System.Runtime.Intrinsics.Vector128`1[Int64]
       call     CORINFO_HELP_NEWSFAST
       vmovupd  xmmword ptr [rax+8], xmm6
       mov      rcx, rax
       call     System.Console:WriteLine(System.Object)
       inc      edi
       cmp      edi, 100
       jl       SHORT G_M10044_IG04
						;; bbWeight=16    PerfScore 96.00

G_M10044_IG05:
       inc      esi
       cmp      esi, 100
       jl       SHORT G_M10044_IG03
						;; bbWeight=4    PerfScore 6.00

G_M10044_IG06:
       vmovaps  xmm6, qword ptr [rsp+20H]
       add      rsp, 56
       pop      rsi
       pop      rdi
       ret      
						;; bbWeight=1    PerfScore 6.25
RWD00  	dq	000000000000002Ah, 000000000000002Ah
; Total bytes of code: 82

As you can see, Vector128.Create(42, 42) was only hoisted from the inner loop.

@AndyAyersMS
Copy link
Member

We've got multiple issues open on this already -- linking them up for context

@BruceForstall
Copy link
Member

BruceForstall commented Nov 10, 2021

@kunalspathak Wrote a simplified example from the imagesharp case:

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

public class C {
    
    private const int IndexBits = 5;
    private const int IndexAlphaBits = 5;
    private const int IndexCount = (1 << IndexBits) + 1;
    private const int IndexAlphaCount = (1 << IndexAlphaBits) + 1;

    
    public int M() {
        int sum = 0;
        for (int r = 0; r < IndexCount; r++)
        {
            for (int g = 0; g < IndexCount; g++)
            {
                for (int b = 0; b < IndexCount; b++)
                {
                    for (int a = 0; a < IndexAlphaCount; a++)
                    {
                        int ind1 = (r << ((IndexBits * 2) + IndexAlphaBits))
                                    + (r << (IndexBits + IndexAlphaBits + 1))
                                    + (g << (IndexBits + IndexAlphaBits))
                                    + (r << (IndexBits * 2))
                                    + (r << (IndexBits + 1))
                                    + (g << IndexBits)
                                    + ((r + g + b) << IndexAlphaBits)
                                    + r + g + b + a;
                        sum += ind1;
                    }
                }
            }
        }
        return sum;
    }
}

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Nov 11, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Nov 11, 2021
@kunalspathak
Copy link
Member Author

kunalspathak commented Feb 4, 2022

Another impact of this is when we refer a static variable inside a loop. This is bad in for Arm64 because we need 3 instructions to construct the address of static variable.

private static int COUNT = 5;

[MethodImpl(MethodImplOptions.NoInlining)]
public static double issue3(int x, int y, int z)
{
    double result = 0;
    for (int i = 0; i < x; i++)
    {
        for (int j = 0; j < y; i++)
        {
            result += COUNT;
        }
    }
    return result;
}
G_M22420_IG03:              ;; offset=0020H
        7100003F          cmp     w1, #0
        5400018D          ble     G_M22420_IG05

 ; Below construction of 0x7ffb568dda90 can be hoisted outside the outer most loop IG03.
        D29B5214          movz    x20, #0xda90
        F2AAD1B4          movk    x20, #0x568d LSL #16
        F2CFFF74          movk    x20, #0x7ffb LSL #32 
        AA1403E0          mov     x0, x20
        528001A1          mov     w1, #13
        94000000          bl      CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
        B9405E80          ldr     w0, [x20,#92]
        1E620010          scvtf   d16, w0
                                                ;; bbWeight=4    PerfScore 48.00
G_M22420_IG04:              ;; offset=0048H
        1E702908          fadd    d8, d8, d16
        11000673          add     w19, w19, #1
        17FFFFFE          b       G_M22420_IG04
                                                ;; bbWeight=16    PerfScore 72.00
G_M22420_IG05:              ;; offset=0054H
        11000673          add     w19, w19, #1
        6B00027F          cmp     w19, w0
        54FFFE2B          blt     G_M22420_IG03

@kunalspathak
Copy link
Member Author

kunalspathak commented Feb 7, 2022

@BruceForstall - do you think you will have time to look into this in .NET 7?

@BruceForstall BruceForstall self-assigned this Feb 8, 2022
@BruceForstall
Copy link
Member

@BruceForstall - do you think you will have time to look into this in .NET 7?

It feels unlikely.

@kunalspathak
Copy link
Member Author

I was trying to understand why hoisting of static variable address doesn't get hoisted at 2-levels with the following example:

private static int SOME_TEMP= 4;

int Test() {
  int result = 0;
  for (int i = 0; i < n; i++)  {
      for (int j = 0; j < m; j++)  {
          result += SOME_TEMP;
      }
  }
}

The problem is because we hoist the code from outer to inner loop and not vice-versa as pointed out here.

// We really should consider hoisting from conditionally executed blocks, if they are frequently executed
// and it is safe to evaluate the tree early.
//
// In particular if we have a loop nest, when scanning the outer loop we should consider hoisting from blocks
// in enclosed loops. However, this is likely to scale poorly, and we really should instead start
// hoisting inner to outer.
//

I tried switching the ordering with the following change:

@@ -6049,7 +6063,6 @@ void Compiler::optHoistLoopNest(unsigned lnum, LoopHoistContext* hoistCtxt)
     m_loopsConsidered++;
 #endif // LOOP_HOIST_STATS

-    optHoistThisLoop(lnum, hoistCtxt);

     VNSet* hoistedInCurLoop = hoistCtxt->ExtractHoistedInCurLoop();

@@ -6086,6 +6099,8 @@ void Compiler::optHoistLoopNest(unsigned lnum, LoopHoistContext* hoistCtxt)
             }
         }
     }
+
+     optHoistThisLoop(lnum, hoistCtxt);

However, when we visit the blocks to hoist, we only visit blocks that dominate exit blocks of the loop. Here is the original block ordering:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..006)-> BB05 ( cond )                     i 
BB02 [0001]  2       BB01,BB04             4     0 [006..00A)-> BB04 ( cond )                     i Loop Loop0 bwd bwd-target 
BB03 [0002]  2       BB02,BB03            16     1 [00A..01A)-> BB03 ( cond )                     i Loop Loop0 hascall bwd bwd-target 
BB04 [0004]  2       BB02,BB03             4     0 [01A..022)-> BB02 ( cond )                     i bwd 
BB05 [0006]  2       BB01,BB04             1       [022..024)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

There are two loops - L00, the outer loop and L01, the inner loop. With my change, we hoist the block BB03 and add a preheader block for the hoisted code. The new blocks look like this:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..006)-> BB05 ( cond )                     i 
BB02 [0001]  2       BB01,BB04             4     0 [006..00A)-> BB04 ( cond )                     i Loop Loop0 bwd bwd-target 
BB06 [0009]  1       BB02                  4     0 [00A..???)                                     internal LoopPH 
BB03 [0002]  2       BB03,BB06            16     1 [00A..01A)-> BB03 ( cond )                     i Loop Loop0 hascall bwd bwd-target 
BB04 [0004]  2       BB02,BB03             4     0 [01A..022)-> BB02 ( cond )                     i bwd 
BB05 [0006]  2       BB01,BB04             1       [022..024)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

But when we try to hoist code for loop L00, we just hoist it for BB02 and BB04 because those are the ones that dominate the exit block BB04. Hence, we discard the block BB06 which has code that can be hoisted out - because we don't consider conditionally executed blocks today.

@kunalspathak
Copy link
Member Author

Hence, we discard the block BB06 which has code that can be hoisted out - because we don't consider conditionally executed blocks today.

I believe we should have a way to look at the edge weights and consider all the hot blocks that participate in the loop for hoisting. @AndyAyersMS ?

@BruceForstall
Copy link
Member

@kunalspathak Since you're working on this, I'm assigning it to you.

@kunalspathak
Copy link
Member Author

@kunalspathak Wrote a simplified example from the imagesharp case:

Here is the diff from #68061 . Full diffs: https://www.diffchecker.com/DKGAZKq0
image

@kunalspathak
Copy link
Member Author

The example mentioned in #61420 (comment) is addressed by #68061

image

@ghost ghost locked as resolved and limited conversation to collaborators Jul 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

5 participants