You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue discusses in detail how the compiler treats a loop containing a compiled function which is sensitive to the ctype of its argument.
Consider the program https://github.com/alexandergall/snabbswitch/blob/funcf-loop-analysis/src/loops.lua (note that this is on a branch that disables JIT in program.snsh to make the results deterministic). The objects d1 and d2 will have different ctype IDs (1417 vs 1419).
We are interested in the compilation of the last loop with the condition that foo() has already been compiled when it is reached. Satisfying this condition is an interesting exercise by itself, so let's start with that.
/* Type of hot counter. Must match the code in the assembler VM. */
/* 16 bits are sufficient. Only 0.0015% overhead with maximum slot penalty. */
typedef uint16_t HotCount;
/* Number of hot counter hash table entries (must be a power of two). */
#define HOTCOUNT_SIZE 64
#define HOTCOUNT_PCMASK ((HOTCOUNT_SIZE-1)*sizeof(HotCount))
/* Hotcount decrements. */
#define HOTCOUNT_LOOP 2
#define HOTCOUNT_CALL 1
JIT_P_hotloop is the value of the LuaJIT runtime parameter hotloop (http://luajit.org/running.html) which is 56 by default. Therefore, all 64 hot counters are initialized to 56*2-1 = 111.
|// Decrement hashed hotcount and trigger trace recorder if zero.
|.macro hotloop, reg
| mov reg, PC
| shr reg, 1
| and reg, HOTCOUNT_PCMASK
| sub word [DISPATCH+reg+GG_DISP2HOT], HOTCOUNT_LOOP
| jb ->vm_hotloop
|.endmacro
|
|.macro hotcall, reg
| mov reg, PC
| shr reg, 1
| and reg, HOTCOUNT_PCMASK
| sub word [DISPATCH+reg+GG_DISP2HOT], HOTCOUNT_CALL
| jb ->vm_hotcall
|.endmacro
case BC_FORL:
|.if JIT
| hotloop RB
|.endif
case BC_FUNCF:
|.if JIT
| hotcall RB
|.endif
We can see here that the PC of the bytecode instruction is used as an index into the hotcount table by right-shifting once and masking by 0x3F (i.e. byte codes don't have individual hot counters). We also see that the counters are decremented by 2 for a loop but only 1 for a call, which means that it actually takes 112 passes through FUNCF before the call becomes hot.
How can we achieve that the function foo() in our example program becomes hot? Naively, we might try
for i = 1, 112 do
foo(d1, i)
end
Let's see what happens. For the first 55 iterations, the interpreter is executing all instructions and the hot counters are 1 and 56 for the loop and the function, respectively. With the next iteration, the counter for foo() is decremented once more (the loop body is executed before the FORL instruction) but the loop becomes hot and a root trace is starting to get recorded. Detection of hot loops and calls is disabled during recording and, of course, also during the execution of a compiled trace, which means that the hot counter for foo() remains at 55 not matter how many more iterations there are.
So, to achieve our goal, we have to add another loop with at least 56 iterations. Note that we have to chose at least 58 iterations to avoid the ultimate iteration abort syndrome#1393.
Note that the bytecode for the function call is FUNCF and the ctype ID of the function argument is 1417.
The second loop on line 16 is interpreted for the first 56 iterations. At that point, just before the loop itself gets hot, the function`s hot counter goes to -1 and the function is compiled in trace #4
Note how the bytecode of the function call has changed to JFUNFC due to the fact the the function is now compiled. However, the trace recorder ignores this fact and inlines the function call just like in the first loop.
Finally we get to the case when a new loop is encountered that calls the compiled function with a different argument. Already in the first iteration, the interpreter hits the JFUNCF bytecode and executes the compiled function. However, the argument d2 has a different ctype ID (1419) than what was used when the function was compiled. Therefore, trace #4 is immediately left via exit #0 and execution is resumed by the interpreter. This happens 10 times according to the runtime option hotexit before this side exit becomes hot and a new trace is started to be recorded.
Being a side trace, it is not allowed to contain a loop and the compiler is starting to unroll the loop up to the maximum given by the option loopunroll which is 15 by default. After that, the recorder aborts with
This happens again 3 more times according to the option tryside. At this point, the loop has been executed 70 times (10 times to reach hotexit and 4*15 times for loopunroll). The compiler "balcklists" exit #0 of trace #4 by a fixed fallback to the interpreter
From this point on, trace #4 will exit to the interpreter immediately, just like before the exit became hot.
What happened to the hot counter of the loop? It is decremented in the usual fashion whenever an iteration is interpreted, i.e. during the first 10 iterations and after the blacklisting but not during the failed recordings of the side trace. So we need 60 additional iterations to see the loop finally getting compiled. That would be 116 iterations, but then we would hit #1393 again, so we add two more. The loop is compiled in trace #7
Note again that the JFUNCF bytecode is ignored and the function is inlined. Due to that, the compiler is now specializing the code for ctype ID 1419, as it should.
At this point the loop is finally compiled and optimized. In this simple example, compilation of the function has simply delayed compilation of the last loop and nothing bad has happened. However, the systems has spent quite a bit more time in the interpreter than it would have if foo() hadn't been compiled for a different ctype ID. I suspect that this behavior can make a difference in a more complex application, where additional time spent in the interpreter can lead to suboptimal traces in other parts of the system.
The text was updated successfully, but these errors were encountered:
Just fwiw, the simplified way that I read examples like this is to say:
The JIT will compile each inner loop separately.
Called functions are effectively copy-pasted (inlined) into the loop and then freshly compiled there.
So in this case
There are three for ... end loops.
These all qualify as inner loops because they don't call any function that contains another loop.
Each loop will be specialized separately: the first two for d1 and the third for d2.
So I'd expect to have three looping traces each specialized for one FFI ctype and then perhaps some "background noise" traces that compile individual functions or connections between loops.
I like this simplified view as a way to quickly sketch how the most important parts of a program - its inner loops - are likely to be compiled.
This issue discusses in detail how the compiler treats a loop containing a compiled function which is sensitive to the
ctype
of its argument.Consider the program https://github.com/alexandergall/snabbswitch/blob/funcf-loop-analysis/src/loops.lua (note that this is on a branch that disables JIT in
program.snsh
to make the results deterministic). The objectsd1
andd2
will have differentctype
IDs (1417 vs 1419).We are interested in the compilation of the last loop with the condition that
foo()
has already been compiled when it is reached. Satisfying this condition is an interesting exercise by itself, so let's start with that.At this point, it is necessary to understand how the compiler manages the hot counters for loops and functions. The system uses a single array of counters, which is initialized at https://github.com/alexandergall/snabbswitch/blob/771b55c829f42a1a788002c2924c6d7047cd1568/lib/luajit/src/lj_dispatch.c#L83
The parameters are defined in https://github.com/alexandergall/snabbswitch/blob/771b55c829f42a1a788002c2924c6d7047cd1568/lib/luajit/src/lj_dispatch.h#L69
JIT_P_hotloop
is the value of the LuaJIT runtime parameterhotloop
(http://luajit.org/running.html) which is 56 by default. Therefore, all 64 hot counters are initialized to56*2-1 = 111
.The code that decrements the counters can be found in the implementation of the bytecode instructions
ITERN
,FORL
,ITERL
,LOOP
andFUNCF
. In this example, we are only interested inFORL
andFUNCF
and we consider the 32-bit VM from https://github.com/alexandergall/snabbswitch/blob/master/lib/luajit/src/vm_x86.dasc, i.e.We can see here that the PC of the bytecode instruction is used as an index into the hotcount table by right-shifting once and masking by
0x3F
(i.e. byte codes don't have individual hot counters). We also see that the counters are decremented by 2 for a loop but only 1 for a call, which means that it actually takes 112 passes throughFUNCF
before the call becomes hot.How can we achieve that the function
foo()
in our example program becomes hot? Naively, we might tryLet's see what happens. For the first 55 iterations, the interpreter is executing all instructions and the hot counters are 1 and 56 for the loop and the function, respectively. With the next iteration, the counter for
foo()
is decremented once more (the loop body is executed before theFORL
instruction) but the loop becomes hot and a root trace is starting to get recorded. Detection of hot loops and calls is disabled during recording and, of course, also during the execution of a compiled trace, which means that the hot counter forfoo()
remains at 55 not matter how many more iterations there are.So, to achieve our goal, we have to add another loop with at least 56 iterations. Note that we have to chose at least 58 iterations to avoid the
ultimate iteration abort syndrome
#1393.Let's run our program with
and look at the dump.txt
The first trace that got recorded starts at the loop on line 13
Note that the bytecode for the function call is
FUNCF
and thectype
ID of the function argument is 1417.The second loop on line 16 is interpreted for the first 56 iterations. At that point, just before the loop itself gets hot, the function`s hot counter goes to -1 and the function is compiled in trace #4
Again, note the
ctype
id 1417. Right after that, the loop itself is also compiledNote how the bytecode of the function call has changed to
JFUNFC
due to the fact the the function is now compiled. However, the trace recorder ignores this fact and inlines the function call just like in the first loop.Finally we get to the case when a new loop is encountered that calls the compiled function with a different argument. Already in the first iteration, the interpreter hits the
JFUNCF
bytecode and executes the compiled function. However, the argumentd2
has a differentctype
ID (1419) than what was used when the function was compiled. Therefore, trace #4 is immediately left via exit #0 and execution is resumed by the interpreter. This happens 10 times according to the runtime optionhotexit
before this side exit becomes hot and a new trace is started to be recorded.Being a side trace, it is not allowed to contain a loop and the compiler is starting to unroll the loop up to the maximum given by the option
loopunroll
which is 15 by default. After that, the recorder aborts withThis happens again 3 more times according to the option
tryside
. At this point, the loop has been executed 70 times (10 times to reachhotexit
and 4*15 times forloopunroll
). The compiler "balcklists" exit #0 of trace #4 by a fixed fallback to the interpreterFrom this point on, trace #4 will exit to the interpreter immediately, just like before the exit became hot.
What happened to the hot counter of the loop? It is decremented in the usual fashion whenever an iteration is interpreted, i.e. during the first 10 iterations and after the blacklisting but not during the failed recordings of the side trace. So we need 60 additional iterations to see the loop finally getting compiled. That would be 116 iterations, but then we would hit #1393 again, so we add two more. The loop is compiled in trace #7
Note again that the
JFUNCF
bytecode is ignored and the function is inlined. Due to that, the compiler is now specializing the code forctype
ID 1419, as it should.At this point the loop is finally compiled and optimized. In this simple example, compilation of the function has simply delayed compilation of the last loop and nothing bad has happened. However, the systems has spent quite a bit more time in the interpreter than it would have if
foo()
hadn't been compiled for a differentctype
ID. I suspect that this behavior can make a difference in a more complex application, where additional time spent in the interpreter can lead to suboptimal traces in other parts of the system.The text was updated successfully, but these errors were encountered: