JIT compiler for RISC-V #275

tevador · 2023-10-08T21:05:11Z

Currently it assumes RV64GC as the baseline, which matches what Linux expects in terms of ISA extensions.

Additionally, when configured with -DARCH=native, it will check for the presence of two additional extensions:

Zba - address generation (useful for the IADD_RS instruction)
Zbb - basic bit manipulation (useful mainly for rotations)

These two extensions give about a 3-5% speed-up compared to the base ISA and a 10% speed-up for cache initialization (Argon2 is heavy on rotations).

Two additional extensions will be useful in the future: V (for SIMD) and Zkn (for AES), but AFAIK there are currently no chips that support them.

Here are some benchmarks with my StarFive JH7110 based board:

> ./randomx-benchmark --auto --mine --largePages --threads 4
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 41.209 s
Initializing 4 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: 10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Reference result:  10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Performance: 74.8368 hashes per second

The mining performance is about 75% of the Raspberry Pi 4, which is not bad considering the lack of vector instructions.

> ./randomx-benchmark --auto --verify --largePages
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - light memory mode (256 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing ...
Memory initialized in 4.30378 s
Initializing 1 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: 10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Reference result:  10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Performance: 107.174 ms per hash

The verification performance is probably more useful as it's substantially faster than the interpreter, which takes over 1300 ms to verify a hash.

CMakeLists.txt

src/jit_compiler.hpp

hyc · 2023-10-08T23:57:37Z

I've got my Lichee Pi 4a now but haven't set it up yet. Will try this in a few days.

src/tests/riscv64_zkn.s

SChernykh · 2023-10-10T13:25:52Z

The code looks good. How many hashes did you run the benchmark for? I think it needs to be tested with at least 10M hashes and the result must be identical to what x64/aarch64 versions produce.

10M hashes = 1.5 days of your board running at 75 h/s, so it's possible.

tevador · 2023-10-10T18:57:14Z

I'm running 1M now (should take a couple hours) and then I'll try 10M. Then we'll have to repeat the runs with the native build. Do you have the correct hashes for 1M and 10M?

tevador · 2023-10-10T19:08:59Z

Some notes about hardware AES:

RISC-V actually has 2 different crypto extensions:

Scalar crypto, which uses the integer registers. The extension was ratified in January 2022. The first chips with the support of this extension might appear next year.
Vector crypto, which depends on the V extension and uses the vector registers. It was ratified yesterday. It will probably take a couple years before we see the first chips that support it.

So it's possible that we'll have to support at least 2 different extensions in the future with the scalar one likely coming first. Actually, the scalar crypto AES instructions are split into two separate extensions Zkne (encryption only) and Zknd (decryption only). Hopefully, hardware designers will be sane and include both extensions.

In order to limit the scope of this PR, I did not include hardware AES for now. It would not be very useful anyways since no chips you can buy today support it.

SChernykh · 2023-10-10T20:07:15Z

I'm running 1M now (should take a couple hours) and then I'll try 10M. Then we'll have to repeat the runs with the native build. Do you have the correct hashes for 1M and 10M?

1M b4b51690546a44a459ad3e043369e1237f61c5ec27046d2f5249cf0b3f57e00c
10M c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b

This is what master branch randomx-benchmark shows.

tevador · 2023-10-10T21:32:50Z

> ./randomx-benchmark --auto --mine --largePages --threads 4 --nonces 1000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 44.2732 s
Initializing 4 virtual machine(s) ...
Running benchmark (1000000 nonces) ...
Calculated result: b4b51690546a44a459ad3e043369e1237f61c5ec27046d2f5249cf0b3f57e00c
Performance: 72.6978 hashes per second

tevador · 2023-10-12T16:38:43Z

10M also matches with the default build

> ./randomx-benchmark --auto --mine --largePages --threads 4 --nonces 10000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 44.2651 s
Initializing 4 virtual machine(s) ...
Running benchmark (10000000 nonces) ...
Calculated result: c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b
Performance: 72.7311 hashes per second

felixonmars · 2023-10-14T08:57:43Z

Tried on more boards, all built with default rv64gc, with comparison to x86_64:

SChernykh · 2023-10-14T09:02:42Z

@felixonmars How many threads did you run on SG2042? The optimal number of threads is 32 there.

hyc · 2023-10-14T09:04:19Z

Was going to say the same. If that run was with 64 threads, need to try again with 32.

tho ... 1356 / 40 = 33. Probably won't make a huge difference.

felixonmars · 2023-10-14T09:13:09Z

@felixonmars How many threads did you run on SG2042? The optimal number of threads is 32 there.

I did run with 64. The results vary quite much on each run and even down to only 2 hps sometimes. I'll retry with 32 later (it's fully loaded now).

tevador · 2023-10-14T11:45:58Z

@felixonmars I can see from your JH7110 hashrates that you are probably not using hugepages. Try to use --largePages when running the benchmark. It can boost the performance by 30%.

I ran the 10M hashes with the native build and the result also matches.

> ./randomx-benchmark --auto --mine --largePages --threads 4 --nonces 10000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 41.2095 s
Initializing 4 virtual machine(s) ...
Running benchmark (10000000 nonces) ...
Calculated result: c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b
Performance: 74.9915 hashes per second

I will wait for someone to independently verify my hashes before merging this PR.

felixonmars · 2023-10-14T12:22:41Z

I can see from your JH7110 hashrates that you are probably not using hugepages. Try to use --largePages when running the benchmark. It can boost the performance by 30%.

Indeed. I tried with --largePages but the results are much lower for SG2042. And my TH1520 kernel doesn't have support for hugepages. So I disabled it for all my benchmarks :(

hyc · 2023-10-14T16:44:51Z

Got some results on Lichee Pi 4a. Numbers were quite slow on the shipped firmware, dated July 2023. Updated to September 20 2023 image and my results look more reasonable. Speed with 1 thread and no largePages was 34.89H/s, same as @felixonmars got. With largepages
1 thread = 43.45H/s
2 thread = 78.71H/s
4 thread = 132.54H/s

felixonmars · 2023-10-14T19:09:04Z

I will wait for someone to independently verify my hashes before merging this PR.

$ ./randomx-benchmark --auto --mine --largePages --threads 32 --nonces 10000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (64 threads) ...
Memory initialized in 7.06713 s
Initializing 32 virtual machine(s) ...
Running benchmark (10000000 nonces) ...
Calculated result: c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b
Performance: 424.954 hashes per second

Hashes match here. With large pages on and threads set to 32, the performance is much lower though.

hyc · 2023-10-14T21:13:17Z

Hashes match here. With large pages on and threads set to 32, the performance is much lower though.

Since the C920 caches are arranged in clusters per 4 cores, you'd need to be able to pin thread and memory allocations to particular cores to obtain optimal memory layout. Along the lines of what the numactl command does. The --affinity option would help here.

tevador requested review from hyc and SChernykh October 8, 2023 21:05

SChernykh reviewed Oct 8, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

SChernykh reviewed Oct 8, 2023

View reviewed changes

src/jit_compiler.hpp Show resolved Hide resolved

tevador force-pushed the pr-riscv branch from 4ea05b0 to 859f8c3 Compare October 9, 2023 08:04

plowsof mentioned this pull request Oct 9, 2023

Monero Community Workgroup Meeting: Saturday 14th October 2023 @ 15:00 UTC monero-project/meta#906

Closed

SChernykh reviewed Oct 10, 2023

View reviewed changes

src/tests/riscv64_zkn.s Outdated Show resolved Hide resolved

JIT compiler for RISC-V

027ecb8

tevador force-pushed the pr-riscv branch from 859f8c3 to 027ecb8 Compare October 10, 2023 18:51

SChernykh approved these changes Oct 12, 2023

View reviewed changes

hyc approved these changes Oct 14, 2023

View reviewed changes

tevador merged commit 2777910 into master Oct 15, 2023
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT compiler for RISC-V #275

JIT compiler for RISC-V #275

tevador commented Oct 8, 2023

hyc commented Oct 8, 2023

SChernykh commented Oct 10, 2023 •

edited

Loading

tevador commented Oct 10, 2023

tevador commented Oct 10, 2023

SChernykh commented Oct 10, 2023 •

edited

Loading

tevador commented Oct 10, 2023

tevador commented Oct 12, 2023

felixonmars commented Oct 14, 2023

SChernykh commented Oct 14, 2023

hyc commented Oct 14, 2023 •

edited

Loading

felixonmars commented Oct 14, 2023

tevador commented Oct 14, 2023

felixonmars commented Oct 14, 2023

hyc commented Oct 14, 2023

felixonmars commented Oct 14, 2023

hyc commented Oct 14, 2023 •

edited

Loading

JIT compiler for RISC-V #275

JIT compiler for RISC-V #275

Conversation

tevador commented Oct 8, 2023

hyc commented Oct 8, 2023

SChernykh commented Oct 10, 2023 • edited Loading

tevador commented Oct 10, 2023

tevador commented Oct 10, 2023

SChernykh commented Oct 10, 2023 • edited Loading

tevador commented Oct 10, 2023

tevador commented Oct 12, 2023

felixonmars commented Oct 14, 2023

SChernykh commented Oct 14, 2023

hyc commented Oct 14, 2023 • edited Loading

felixonmars commented Oct 14, 2023

tevador commented Oct 14, 2023

felixonmars commented Oct 14, 2023

hyc commented Oct 14, 2023

felixonmars commented Oct 14, 2023

hyc commented Oct 14, 2023 • edited Loading

SChernykh commented Oct 10, 2023 •

edited

Loading

SChernykh commented Oct 10, 2023 •

edited

Loading

hyc commented Oct 14, 2023 •

edited

Loading

hyc commented Oct 14, 2023 •

edited

Loading