-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loongarch: sljit_emit_atomic_load/store implementation #172
base: master
Are you sure you want to change the base?
Conversation
2393db3
to
073f7de
Compare
sljit_src/sljitConfigInternal.h
Outdated
@@ -786,6 +786,7 @@ SLJIT_API_FUNC_ATTRIBUTE sljit_sw sljit_exec_offset(void* ptr); | |||
#define SLJIT_NUMBER_OF_SAVED_FLOAT_REGISTERS 12 | |||
#define SLJIT_MASKED_SHIFT 1 | |||
#define SLJIT_MASKED_SHIFT32 1 | |||
#define SLJIT_ATOMIC_MIN_WIDTH 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no such constraint. When a 16 bit is loaded, and you have a 64 bit atomic rmw, you need to do the following steps:
- Align address to 64 bit
- Load 64 bit value
- Insert the 16 bit data in the correct bit position (xor can help)
- RMW the 64 bit value
You can do this with any data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are correct if we will be implementing an atomic rmw operation direcly, but we are not.
the load has to return a valid value for the type used and so, all non relevant bits from the wider register that are loaded from memory with the correct alignment and that include the smaller type had to be discarded.
by the time the store comes, you can't write the value back to memory or will risk wiping several unrelated bytes, so then a second load must be done to insert the smaller type in the right place before the atomic store can be completed.
all that is very expensive, specially when you have a very simple RISC CPU that doesn't have a cmpxchg
like CAS instruction either.
I suspect the "problem" is also present at least on MIPS and the "solution" was to put in the specification what I put in code here.
you can LL/SC any type you want as far as it is aligned to the minimum width that is supported by the CPU; not sure if the data pulled from memory gets masked during transfer and the offset fixed like ARM64 seems to be doing (specially considering the granularity of the reservation).
if our API would implement instead a cmpxchg
like interface, then it could be implemented just like you mention.
note that loongarch does support CAS like RMW operations in hardware with its LAM extension, so it would be even easier to implement atomic add, xor and the rest of the usual RMW operations with that, than doing everything based on the LL/SC primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a flag to cpuinfo to tell that 8/16 bit updates are not efficient. Still, the emulation can use instructions which are not available otherwise, so it will be still better that the user can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is also the counterpart of that, which is to (like I am proposing here) make the limitation of the CPU visible at build time and optimize for the fast use case by default.
of course, will add a way to enable the emulation (had some WIP code that does using an SLJIT_ATOMIC_EMULATION
define and that by now has been mostly implemented for SLJIT_MOV_U16
only) and that I could share for feedback once it actually works. The end result will probably look like the initial proposal from Mingtao, only uglier.
I frankly suspect that in real world scenarios all atomics will be aligned correctly and have enough padding regardless and so the added code from the emulation is just dead weight. I was thinking on splitting the change so SLJIT_ATOMIC_WIDTH
could be defined independently and then implement SLJIT_ATOMIC_EMULATION
also for s390x, which unlike loongarch is likely to be enabled by default, so it can be tested independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an infrastructure for checking cpu features. And only two options are here: 8/16 bit is available or emulated. From user perspective, it is hard to use a "bit width" define, because it can have any value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while implementing the atomic store for SLJIT_MOV_U8, I found another scenario where the test will infloop. LL/SC is IMHO too fragile and it would seem that with most CPUs (even the traditionally RISC ones) providing CAS it would seem worth looking if an API more in that line might be worth exploring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no such constraint. When a 16 bit is loaded, and you have a 64 bit atomic rmw, you need to do the following steps:
- Align address to 64 bit
- Load 64 bit value
- Insert the 16 bit data in the correct bit position (xor can help)
- RMW the 64 bit value
You can do this with any data type.
LoongArch64 requires 16 instructions (maybe optimized to 12) when implementing RMW operations compatible with SLJIT_MOV_U8
, and LL-SC
should use fewer instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 bit byte reverse on systems without byte reverse instruction takes more. The thing is, if a code (such as WebAssembly) wants to do an 8 bit atomic update, it does not really matter how many instructions are needed, the job needs to be done somehow. And cpu specific instruction sequences are always shorter than using sljit instructions for emulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 bit byte reverse on systems without byte reverse instruction takes more
that is not a fair comparison, because there is no such thing as restrictions to the number of instruction that you can run in any CPU for that operation, but for LL/SC (which is conceptually one single instruction that can be used to implement "RMW") apparently LoongArch has, and for my reading it would seem RISCV does as well.
the end result is that in those cases the reservation gets always invalidated, and then you are left with a process that infloops.
for all practical reasons the compiler doesn't let you emit a load independently of a store, and deduces the type of operation used, opting for the CAS like equivalent, for example the following C11 code:
unsigned char s = 32;
volatile unsigned char m = 127;
atomic_uchar c;
c = (c > m) ? 0 : c + s;
will be most likely compiled into something closer to what you get now for loongarch with emulation disabled, and ensure that c
is aligned to 32bit, or even better will detect LAM is available and then it can use ammax.wu
for the comparison and amadd.w
for the increment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option could be to create a cmpxchg for all architectures, and create an optional LL/SC based variant, which has limited support.
dea217c
to
8797280
Compare
Assumes all atomics are at least 32bit aligned by setting SLJIT_ATOMIC_WIDTH. In that mode, only the bits that correspond to the operand size are used and the rest are pressumed to be 0. Operations with lower alignment than it are undefined and will result in a SIGBUS. If SLJIT_ATOMIC_EMULATION is defined, then additional code is enabled that allow manipulating those narrower types at a significant performance cost. Original code by Mingtao, all bugs mine.
WIP because atomic load/store not supported yet: zherczeg/sljit#172
assumes all atomics are at least 32bit aligned through setting a per CPU type
SLJIT_ATOMIC_WIDTH
and make emulation optional and disabled by default withSLJIT_ATOMIC_EMULATION
only the bits that correspond to the operand size are taking into consideration with the rest pressured to be 0, so update tests to support that mode as well.
operations with lower alignment are undefined unless emulation is enabled.