loongarch: sljit_emit_atomic_load/store implementation #172

carenas · 2023-06-09T23:27:02Z

assumes all atomics are at least 32bit aligned through setting a per CPU type SLJIT_ATOMIC_WIDTH and make emulation optional and disabled by default with SLJIT_ATOMIC_EMULATION

only the bits that correspond to the operand size are taking into consideration with the rest pressured to be 0, so update tests to support that mode as well.

operations with lower alignment are undefined unless emulation is enabled.

test_src/sljitTest.c

zherczeg · 2023-06-10T06:46:00Z

sljit_src/sljitConfigInternal.h

@@ -786,6 +786,7 @@ SLJIT_API_FUNC_ATTRIBUTE sljit_sw sljit_exec_offset(void* ptr);
 #define SLJIT_NUMBER_OF_SAVED_FLOAT_REGISTERS 12
 #define SLJIT_MASKED_SHIFT 1
 #define SLJIT_MASKED_SHIFT32 1
+#define SLJIT_ATOMIC_MIN_WIDTH 32


There should be no such constraint. When a 16 bit is loaded, and you have a 64 bit atomic rmw, you need to do the following steps:

Align address to 64 bit

Load 64 bit value

Insert the 16 bit data in the correct bit position (xor can help)

RMW the 64 bit value

You can do this with any data type.

you are correct if we will be implementing an atomic rmw operation direcly, but we are not.

the load has to return a valid value for the type used and so, all non relevant bits from the wider register that are loaded from memory with the correct alignment and that include the smaller type had to be discarded.

by the time the store comes, you can't write the value back to memory or will risk wiping several unrelated bytes, so then a second load must be done to insert the smaller type in the right place before the atomic store can be completed.

all that is very expensive, specially when you have a very simple RISC CPU that doesn't have a cmpxchg like CAS instruction either.

I suspect the "problem" is also present at least on MIPS and the "solution" was to put in the specification what I put in code here.

you can LL/SC any type you want as far as it is aligned to the minimum width that is supported by the CPU; not sure if the data pulled from memory gets masked during transfer and the offset fixed like ARM64 seems to be doing (specially considering the granularity of the reservation).

if our API would implement instead a cmpxchg like interface, then it could be implemented just like you mention.

note that loongarch does support CAS like RMW operations in hardware with its LAM extension, so it would be even easier to implement atomic add, xor and the rest of the usual RMW operations with that, than doing everything based on the LL/SC primitives.

You can add a flag to cpuinfo to tell that 8/16 bit updates are not efficient. Still, the emulation can use instructions which are not available otherwise, so it will be still better that the user can do.

there is also the counterpart of that, which is to (like I am proposing here) make the limitation of the CPU visible at build time and optimize for the fast use case by default.

of course, will add a way to enable the emulation (had some WIP code that does using an SLJIT_ATOMIC_EMULATION define and that by now has been mostly implemented for SLJIT_MOV_U16 only) and that I could share for feedback once it actually works. The end result will probably look like the initial proposal from Mingtao, only uglier.

I frankly suspect that in real world scenarios all atomics will be aligned correctly and have enough padding regardless and so the added code from the emulation is just dead weight. I was thinking on splitting the change so SLJIT_ATOMIC_WIDTH could be defined independently and then implement SLJIT_ATOMIC_EMULATION also for s390x, which unlike loongarch is likely to be enabled by default, so it can be tested independently.

We have an infrastructure for checking cpu features. And only two options are here: 8/16 bit is available or emulated. From user perspective, it is hard to use a "bit width" define, because it can have any value.

while implementing the atomic store for SLJIT_MOV_U8, I found another scenario where the test will infloop. LL/SC is IMHO too fragile and it would seem that with most CPUs (even the traditionally RISC ones) providing CAS it would seem worth looking if an API more in that line might be worth exploring.

There should be no such constraint. When a 16 bit is loaded, and you have a 64 bit atomic rmw, you need to do the following steps:

Align address to 64 bit

Load 64 bit value

Insert the 16 bit data in the correct bit position (xor can help)

RMW the 64 bit value

You can do this with any data type.

LoongArch64 requires 16 instructions (maybe optimized to 12) when implementing RMW operations compatible with SLJIT_MOV_U8, and LL-SC should use fewer instructions.

64 bit byte reverse on systems without byte reverse instruction takes more. The thing is, if a code (such as WebAssembly) wants to do an 8 bit atomic update, it does not really matter how many instructions are needed, the job needs to be done somehow. And cpu specific instruction sequences are always shorter than using sljit instructions for emulation.

4 bit byte reverse on systems without byte reverse instruction takes more

that is not a fair comparison, because there is no such thing as restrictions to the number of instruction that you can run in any CPU for that operation, but for LL/SC (which is conceptually one single instruction that can be used to implement "RMW") apparently LoongArch has, and for my reading it would seem RISCV does as well.

the end result is that in those cases the reservation gets always invalidated, and then you are left with a process that infloops.

for all practical reasons the compiler doesn't let you emit a load independently of a store, and deduces the type of operation used, opting for the CAS like equivalent, for example the following C11 code:

unsigned char s = 32; volatile unsigned char m = 127; atomic_uchar c; c = (c > m) ? 0 : c + s;

will be most likely compiled into something closer to what you get now for loongarch with emulation disabled, and ensure that c is aligned to 32bit, or even better will detect LAM is available and then it can use ammax.wu for the comparison and amadd.w for the increment.

An option could be to create a cmpxchg for all architectures, and create an optional LL/SC based variant, which has limited support.

Assumes all atomics are at least 32bit aligned by setting SLJIT_ATOMIC_WIDTH. In that mode, only the bits that correspond to the operand size are used and the rest are pressumed to be 0. Operations with lower alignment than it are undefined and will result in a SIGBUS. If SLJIT_ATOMIC_EMULATION is defined, then additional code is enabled that allow manipulating those narrower types at a significant performance cost. Original code by Mingtao, all bugs mine.

WIP because atomic load/store not supported yet: zherczeg/sljit#172

carenas commented Jun 9, 2023

View reviewed changes

test_src/sljitTest.c Show resolved Hide resolved

carenas force-pushed the atomic-dragon branch from e8dea2c to 58d0892 Compare June 9, 2023 23:34

carenas commented Jun 9, 2023

View reviewed changes

test_src/sljitTest.c Outdated Show resolved Hide resolved

carenas commented Jun 10, 2023

View reviewed changes

test_src/sljitTest.c Show resolved Hide resolved

carenas force-pushed the atomic-dragon branch 2 times, most recently from 2393db3 to 073f7de Compare June 10, 2023 06:41

zherczeg reviewed Jun 10, 2023

View reviewed changes

carenas force-pushed the atomic-dragon branch 5 times, most recently from dea217c to 8797280 Compare June 11, 2023 18:19

carenas force-pushed the atomic-dragon branch from 8797280 to 8bbd705 Compare June 13, 2023 05:07

xry111 added a commit to xry111/areweloongyet that referenced this pull request Jul 16, 2023

feat(data): Add sljit

6c8c978

WIP because atomic load/store not supported yet: zherczeg/sljit#172

carenas mentioned this pull request Jul 1, 2024

loongarch: Implement ll/sc support for atomic operations #254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loongarch: sljit_emit_atomic_load/store implementation #172

loongarch: sljit_emit_atomic_load/store implementation #172

carenas commented Jun 9, 2023 •

edited

Loading

zherczeg Jun 10, 2023

carenas Jun 10, 2023 •

edited

Loading

zherczeg Jun 10, 2023

carenas Jun 10, 2023 •

edited

Loading

zherczeg Jun 11, 2023

carenas Jun 11, 2023

Xiao-Tao Jun 12, 2023

zherczeg Jun 12, 2023

carenas Jun 12, 2023 •

edited

Loading

zherczeg Jun 12, 2023

loongarch: sljit_emit_atomic_load/store implementation #172

Are you sure you want to change the base?

loongarch: sljit_emit_atomic_load/store implementation #172

Conversation

carenas commented Jun 9, 2023 • edited Loading

zherczeg Jun 10, 2023

Choose a reason for hiding this comment

carenas Jun 10, 2023 • edited Loading

Choose a reason for hiding this comment

zherczeg Jun 10, 2023

Choose a reason for hiding this comment

carenas Jun 10, 2023 • edited Loading

Choose a reason for hiding this comment

zherczeg Jun 11, 2023

Choose a reason for hiding this comment

carenas Jun 11, 2023

Choose a reason for hiding this comment

Xiao-Tao Jun 12, 2023

Choose a reason for hiding this comment

zherczeg Jun 12, 2023

Choose a reason for hiding this comment

carenas Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

zherczeg Jun 12, 2023

Choose a reason for hiding this comment

carenas commented Jun 9, 2023 •

edited

Loading

carenas Jun 10, 2023 •

edited

Loading

carenas Jun 10, 2023 •

edited

Loading

carenas Jun 12, 2023 •

edited

Loading