Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loongarch: sljit_emit_atomic_load/store implementation #172

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Jun 9, 2023

assumes all atomics are at least 32bit aligned through setting a per CPU type SLJIT_ATOMIC_WIDTH and make emulation optional and disabled by default with SLJIT_ATOMIC_EMULATION

only the bits that correspond to the operand size are taking into consideration with the rest pressured to be 0, so update tests to support that mode as well.

operations with lower alignment are undefined unless emulation is enabled.

test_src/sljitTest.c Outdated Show resolved Hide resolved
@carenas carenas force-pushed the atomic-dragon branch 2 times, most recently from 2393db3 to 073f7de Compare June 10, 2023 06:41
@@ -786,6 +786,7 @@ SLJIT_API_FUNC_ATTRIBUTE sljit_sw sljit_exec_offset(void* ptr);
#define SLJIT_NUMBER_OF_SAVED_FLOAT_REGISTERS 12
#define SLJIT_MASKED_SHIFT 1
#define SLJIT_MASKED_SHIFT32 1
#define SLJIT_ATOMIC_MIN_WIDTH 32
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no such constraint. When a 16 bit is loaded, and you have a 64 bit atomic rmw, you need to do the following steps:

  • Align address to 64 bit
  • Load 64 bit value
  • Insert the 16 bit data in the correct bit position (xor can help)
  • RMW the 64 bit value

You can do this with any data type.

Copy link
Contributor Author

@carenas carenas Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct if we will be implementing an atomic rmw operation direcly, but we are not.

the load has to return a valid value for the type used and so, all non relevant bits from the wider register that are loaded from memory with the correct alignment and that include the smaller type had to be discarded.

by the time the store comes, you can't write the value back to memory or will risk wiping several unrelated bytes, so then a second load must be done to insert the smaller type in the right place before the atomic store can be completed.

all that is very expensive, specially when you have a very simple RISC CPU that doesn't have a cmpxchg like CAS instruction either.

I suspect the "problem" is also present at least on MIPS and the "solution" was to put in the specification what I put in code here.

you can LL/SC any type you want as far as it is aligned to the minimum width that is supported by the CPU; not sure if the data pulled from memory gets masked during transfer and the offset fixed like ARM64 seems to be doing (specially considering the granularity of the reservation).

if our API would implement instead a cmpxchg like interface, then it could be implemented just like you mention.

note that loongarch does support CAS like RMW operations in hardware with its LAM extension, so it would be even easier to implement atomic add, xor and the rest of the usual RMW operations with that, than doing everything based on the LL/SC primitives.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a flag to cpuinfo to tell that 8/16 bit updates are not efficient. Still, the emulation can use instructions which are not available otherwise, so it will be still better that the user can do.

Copy link
Contributor Author

@carenas carenas Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is also the counterpart of that, which is to (like I am proposing here) make the limitation of the CPU visible at build time and optimize for the fast use case by default.

of course, will add a way to enable the emulation (had some WIP code that does using an SLJIT_ATOMIC_EMULATION define and that by now has been mostly implemented for SLJIT_MOV_U16 only) and that I could share for feedback once it actually works. The end result will probably look like the initial proposal from Mingtao, only uglier.

I frankly suspect that in real world scenarios all atomics will be aligned correctly and have enough padding regardless and so the added code from the emulation is just dead weight. I was thinking on splitting the change so SLJIT_ATOMIC_WIDTH could be defined independently and then implement SLJIT_ATOMIC_EMULATION also for s390x, which unlike loongarch is likely to be enabled by default, so it can be tested independently.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an infrastructure for checking cpu features. And only two options are here: 8/16 bit is available or emulated. From user perspective, it is hard to use a "bit width" define, because it can have any value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while implementing the atomic store for SLJIT_MOV_U8, I found another scenario where the test will infloop. LL/SC is IMHO too fragile and it would seem that with most CPUs (even the traditionally RISC ones) providing CAS it would seem worth looking if an API more in that line might be worth exploring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no such constraint. When a 16 bit is loaded, and you have a 64 bit atomic rmw, you need to do the following steps:

  • Align address to 64 bit
  • Load 64 bit value
  • Insert the 16 bit data in the correct bit position (xor can help)
  • RMW the 64 bit value

You can do this with any data type.

LoongArch64 requires 16 instructions (maybe optimized to 12) when implementing RMW operations compatible with SLJIT_MOV_U8, and LL-SC should use fewer instructions.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 bit byte reverse on systems without byte reverse instruction takes more. The thing is, if a code (such as WebAssembly) wants to do an 8 bit atomic update, it does not really matter how many instructions are needed, the job needs to be done somehow. And cpu specific instruction sequences are always shorter than using sljit instructions for emulation.

Copy link
Contributor Author

@carenas carenas Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 bit byte reverse on systems without byte reverse instruction takes more

that is not a fair comparison, because there is no such thing as restrictions to the number of instruction that you can run in any CPU for that operation, but for LL/SC (which is conceptually one single instruction that can be used to implement "RMW") apparently LoongArch has, and for my reading it would seem RISCV does as well.

the end result is that in those cases the reservation gets always invalidated, and then you are left with a process that infloops.

for all practical reasons the compiler doesn't let you emit a load independently of a store, and deduces the type of operation used, opting for the CAS like equivalent, for example the following C11 code:

unsigned char s = 32;
volatile unsigned char  m = 127;
atomic_uchar c;
c = (c > m) ? 0 : c + s;

will be most likely compiled into something closer to what you get now for loongarch with emulation disabled, and ensure that c is aligned to 32bit, or even better will detect LAM is available and then it can use ammax.wu for the comparison and amadd.w for the increment.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option could be to create a cmpxchg for all architectures, and create an optional LL/SC based variant, which has limited support.

@carenas carenas force-pushed the atomic-dragon branch 5 times, most recently from dea217c to 8797280 Compare June 11, 2023 18:19
Assumes all atomics are at least 32bit aligned by setting
SLJIT_ATOMIC_WIDTH.

In that mode, only the bits that correspond to the operand
size are used and the rest are pressumed to be 0.

Operations with lower alignment than it are undefined and
will result in a SIGBUS.

If SLJIT_ATOMIC_EMULATION is defined, then additional code
is enabled that allow manipulating those narrower types at
a significant performance cost.

Original code by Mingtao, all bugs mine.
xry111 added a commit to xry111/areweloongyet that referenced this pull request Jul 16, 2023
WIP because atomic load/store not supported yet:
zherczeg/sljit#172
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants