-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX2 polyvec_{de,}compress
#410
Conversation
Together with #409, this PR achieves this performance on my machine:
That's outperforming the code from the official Kyber repo. |
cda2628
to
bf5458d
Compare
bf5458d
to
98cc5de
Compare
98cc5de
to
6c2a30c
Compare
64c7279
to
8097ad5
Compare
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
This commits adds the AVX2 intrinsic implementation of polyvec_compress and polyvec_decompress from the official Kyber repository. As a part of #224 it was identified that the majority of the performance difference in keypair and decaps of our current implementation and the Kyber AVX2 implementation is due to the AVX2 polyvec_compress and polyvec_decompress. This commit adds these two functions to the native interface and adds the AVX2 intrinic-based implementations from the Kyber repository. These are almost verbatim copies. The only two differences are: 1) The AVX2 impelementations requires the uint8_t buffer to be slightly larger than MLKEM_POLYVECCOMPRESSEDBYTES, so that full vectors can be stored/loaded. The official implementation allocated those bytes on top level of the function. That would be slightly messy in our implementation, so I instead allocate the larger buffer in polyvec_compress_avx2/polyvec_decompress_avx2 itself and copy the inputs/outputs. 2) The official AVX2 implementation extended the poly type to also be accessible as a __m256i*. I changed this to a cast as we guarantee the alignment in another way. Below are the performance results on my 13th Gen Intel i7-1360P (Raptor Lake) using gcc 14.2.1 from the Arch Linux repo. | part | Our code 6aa6118 |Kyber repo|Our code(+polyvec_{,de}compress) | | -------- | ---------------- | -------- | ------------------------------- | | 512 kg | 22353 | 22348 | 22252 | | 512 enc | 27820 | 24868 | 26472 | | 512 dec | 35663 | 34984 | 33107 | | 768 kg | 39626 | 38070 | 41590 | | 768 enc | 43605 | 39056 | 44049 | | 768 dec | 54916 | 53726 | 53432 | | 1024 kg | 58983 | 53532 | 57411 | | 1024 enc | 65402 | 56698 | 61613 | | 1024 dec | 80370 | 75874 | 74681 | Signed-off-by: Matthias J. Kannwischer <[email protected]>
34df740
to
0561120
Compare
After the recent merges (adding PMU support, adding LTO support), let's re-do the benchmarks with this PR rebased on top of 75f52dc. TL;LR: We do see performance gains for encaps and decaps of up to 8% with this PR. This is consistent accross platforms (Raptor Lake, c7i) and compiler versions (gcc 13.2.0, gcc 14.2.1). Here are the results on my Raptor Lake (gcc 14.2.1)
this PR (0561120):
Here are the results on the EC2 c7i instance (gcc 13.2.0):
this PR (0561120):
Component benchmarks (trimmmed to just polyvec_compress and polyvec_decompress)Here are the results on my Raptor Lake (gcc 14.2.1)
this PR (0561120):
Here are the results on the EC2 c7i instance (gcc 13.2.0):
this PR (0561120):
|
BENCH("polyvec-compress", | ||
polyvec_compress((uint8_t *)data0, (polyvec *)data1)); | ||
BENCH("polyvec-decompress", | ||
polyvec_decompress((polyvec *)data0, (uint8_t *)data1)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be removed.
@mkannwischer Could you also measure with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkannwischer Would you mind doing the PR in smaller steps, as follows:
- Hoist out the polynomial compression into a separate function first. This will require suffixing existing polynomial [de]compression routines with the
d
-value -- which is anyway a good idea and in line with our naming forscalar_[de]compress
. - Then, allow native replacement of those poly compress/decompress variants.
The reason I'd like to do it this is way is: (a) It's cleaner. (b) Once we have the polynomial [de]compression hoisted out, it's easier to investigate how it could be rewritten for better auto-vectorization.
Here are results for clang: TL;DR: clang is not much better at autovectorizing this. In these results it looks like clang overall performs much worse on Raptor Lake vs. gcc14, and a bit better on c6i vs. gcc13. I tried to re-run the gcc14 benchmarks on Raptor Lake today and I cannot reproduce the numbers I got yesterday. Maybe I did make a mistake yesterday - but that won't matter for this PR hopefully. Raptor Lake (clang 18.1.8)main (75f52dc)
this PR (0561120):
c7i (clang 18.1.3)
this PR (0561120):
|
okay, I'll do this as a separate PR first. |
After #435 got merged this would require a major re-work. |
This commits adds the AVX2 intrinsic implementation of
polyvec_compress and polyvec_decompress from the official
Kyber repository.
As a part of #224
it was identified that the majority of the performance difference
in keypair and decaps of our current implementation and
the Kyber AVX2 implementation is due to the AVX2 polyvec_compress
and polyvec_decompress.
This commit adds these two functions to the native interface
and adds the AVX2 intrinic-based implementations from the Kyber
repository. These are almost verbatim copies.
The only two differences are:
to be slightly larger than MLKEM_POLYVECCOMPRESSEDBYTES, so that
full vectors can be stored/loaded. The official implementation allocated
those bytes on top level of the function. That would be slightly
messy in our implementation, so I instead allocate the larger buffer
in polyvec_compress_avx2/polyvec_decompress_avx2 itself and copy the
inputs/outputs.
also be accessible as a __m256i*.
I changed this to a cast as we guarantee the alignment in another way.
Below are the performance results on my 13th Gen Intel i7-1360P (Raptor Lake)
using gcc 14.2.1 from the Arch Linux repo.
#224