Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT compiler for RISC-V #275

Merged
merged 1 commit into from
Oct 15, 2023
Merged

JIT compiler for RISC-V #275

merged 1 commit into from
Oct 15, 2023

Conversation

tevador
Copy link
Owner

@tevador tevador commented Oct 8, 2023

Currently it assumes RV64GC as the baseline, which matches what Linux expects in terms of ISA extensions.

Additionally, when configured with -DARCH=native, it will check for the presence of two additional extensions:

  • Zba - address generation (useful for the IADD_RS instruction)
  • Zbb - basic bit manipulation (useful mainly for rotations)

These two extensions give about a 3-5% speed-up compared to the base ISA and a 10% speed-up for cache initialization (Argon2 is heavy on rotations).

Two additional extensions will be useful in the future: V (for SIMD) and Zkn (for AES), but AFAIK there are currently no chips that support them.

Here are some benchmarks with my StarFive JH7110 based board:

> ./randomx-benchmark --auto --mine --largePages --threads 4
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 41.209 s
Initializing 4 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: 10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Reference result:  10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Performance: 74.8368 hashes per second

The mining performance is about 75% of the Raspberry Pi 4, which is not bad considering the lack of vector instructions.

> ./randomx-benchmark --auto --verify --largePages
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - light memory mode (256 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing ...
Memory initialized in 4.30378 s
Initializing 1 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: 10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Reference result:  10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Performance: 107.174 ms per hash

The verification performance is probably more useful as it's substantially faster than the interpreter, which takes over 1300 ms to verify a hash.

@tevador tevador requested review from hyc and SChernykh October 8, 2023 21:05
@hyc
Copy link
Collaborator

hyc commented Oct 8, 2023

I've got my Lichee Pi 4a now but haven't set it up yet. Will try this in a few days.

src/tests/riscv64_zkn.s Outdated Show resolved Hide resolved
@SChernykh
Copy link
Collaborator

SChernykh commented Oct 10, 2023

The code looks good. How many hashes did you run the benchmark for? I think it needs to be tested with at least 10M hashes and the result must be identical to what x64/aarch64 versions produce.

10M hashes = 1.5 days of your board running at 75 h/s, so it's possible.

@tevador
Copy link
Owner Author

tevador commented Oct 10, 2023

I'm running 1M now (should take a couple hours) and then I'll try 10M. Then we'll have to repeat the runs with the native build. Do you have the correct hashes for 1M and 10M?

@tevador
Copy link
Owner Author

tevador commented Oct 10, 2023

Some notes about hardware AES:

RISC-V actually has 2 different crypto extensions:

  1. Scalar crypto, which uses the integer registers. The extension was ratified in January 2022. The first chips with the support of this extension might appear next year.
  2. Vector crypto, which depends on the V extension and uses the vector registers. It was ratified yesterday. It will probably take a couple years before we see the first chips that support it.

So it's possible that we'll have to support at least 2 different extensions in the future with the scalar one likely coming first. Actually, the scalar crypto AES instructions are split into two separate extensions Zkne (encryption only) and Zknd (decryption only). Hopefully, hardware designers will be sane and include both extensions.

In order to limit the scope of this PR, I did not include hardware AES for now. It would not be very useful anyways since no chips you can buy today support it.

@SChernykh
Copy link
Collaborator

SChernykh commented Oct 10, 2023

I'm running 1M now (should take a couple hours) and then I'll try 10M. Then we'll have to repeat the runs with the native build. Do you have the correct hashes for 1M and 10M?

1M b4b51690546a44a459ad3e043369e1237f61c5ec27046d2f5249cf0b3f57e00c
10M c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b

This is what master branch randomx-benchmark shows.

@tevador
Copy link
Owner Author

tevador commented Oct 10, 2023

> ./randomx-benchmark --auto --mine --largePages --threads 4 --nonces 1000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 44.2732 s
Initializing 4 virtual machine(s) ...
Running benchmark (1000000 nonces) ...
Calculated result: b4b51690546a44a459ad3e043369e1237f61c5ec27046d2f5249cf0b3f57e00c
Performance: 72.6978 hashes per second

@tevador
Copy link
Owner Author

tevador commented Oct 12, 2023

10M also matches with the default build

> ./randomx-benchmark --auto --mine --largePages --threads 4 --nonces 10000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 44.2651 s
Initializing 4 virtual machine(s) ...
Running benchmark (10000000 nonces) ...
Calculated result: c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b
Performance: 72.7311 hashes per second

@felixonmars
Copy link

Tried on more boards, all built with default rv64gc, with comparison to x86_64:

2023-10-14-09-36-04

@SChernykh
Copy link
Collaborator

@felixonmars How many threads did you run on SG2042? The optimal number of threads is 32 there.

@hyc
Copy link
Collaborator

hyc commented Oct 14, 2023

Was going to say the same. If that run was with 64 threads, need to try again with 32.

tho ... 1356 / 40 = 33. Probably won't make a huge difference.

@felixonmars
Copy link

@felixonmars How many threads did you run on SG2042? The optimal number of threads is 32 there.

I did run with 64. The results vary quite much on each run and even down to only 2 hps sometimes. I'll retry with 32 later (it's fully loaded now).

@tevador
Copy link
Owner Author

tevador commented Oct 14, 2023

@felixonmars I can see from your JH7110 hashrates that you are probably not using hugepages. Try to use --largePages when running the benchmark. It can boost the performance by 30%.

I ran the 10M hashes with the native build and the result also matches.

> ./randomx-benchmark --auto --mine --largePages --threads 4 --nonces 10000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (4 threads) ...
Memory initialized in 41.2095 s
Initializing 4 virtual machine(s) ...
Running benchmark (10000000 nonces) ...
Calculated result: c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b
Performance: 74.9915 hashes per second

I will wait for someone to independently verify my hashes before merging this PR.

@felixonmars
Copy link

I can see from your JH7110 hashrates that you are probably not using hugepages. Try to use --largePages when running the benchmark. It can boost the performance by 30%.

Indeed. I tried with --largePages but the results are much lower for SG2042. And my TH1520 kernel doesn't have support for hugepages. So I disabled it for all my benchmarks :(

@hyc
Copy link
Collaborator

hyc commented Oct 14, 2023

Got some results on Lichee Pi 4a. Numbers were quite slow on the shipped firmware, dated July 2023. Updated to September 20 2023 image and my results look more reasonable. Speed with 1 thread and no largePages was 34.89H/s, same as @felixonmars got. With largepages
1 thread = 43.45H/s
2 thread = 78.71H/s
4 thread = 132.54H/s

@felixonmars
Copy link

I will wait for someone to independently verify my hashes before merging this PR.

$ ./randomx-benchmark --auto --mine --largePages --threads 32 --nonces 10000000
RandomX benchmark v1.1.12
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - software AES mode
 - large pages mode
 - batch mode
Initializing (64 threads) ...
Memory initialized in 7.06713 s
Initializing 32 virtual machine(s) ...
Running benchmark (10000000 nonces) ...
Calculated result: c32b1910abef45bc659ffd8d5aeeb0384f7122e623f2fafe568a393a9c3be60b
Performance: 424.954 hashes per second

Hashes match here. With large pages on and threads set to 32, the performance is much lower though.

@hyc
Copy link
Collaborator

hyc commented Oct 14, 2023

Hashes match here. With large pages on and threads set to 32, the performance is much lower though.

Since the C920 caches are arranged in clusters per 4 cores, you'd need to be able to pin thread and memory allocations to particular cores to obtain optimal memory layout. Along the lines of what the numactl command does. The --affinity option would help here.

@tevador tevador merged commit 2777910 into master Oct 15, 2023
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants