Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] AArch64: Use transposed coefficient order in NTT domain #542

Closed
wants to merge 5 commits into from

Conversation

hanno-becker
Copy link
Contributor

@hanno-becker hanno-becker commented Dec 17, 2024

This PR is an experiment for using a transposed order of coefficients in NTT domain in the AArch64 backend.

The motivation to use such an ordering is the reduced shuffling cost in invNTT and NTT. The downside is that shuffling is introduced during key generation and serialization of polynomials.

At this point, there are 'clean' AArch64 assembly implementations for (a) poly_tobytes, (b) poly_frombytes, (c) poly_transpose. Initial benchmarks suggest that at least poly_{to,from}bytes() in clean AArch64 assembly has no performance benefit compared to C, so we may want to remove them. SLOTHY-optimizing them requires adding support for further instructions to SLOTHY first.

@hanno-becker hanno-becker marked this pull request as ready for review December 17, 2024 21:32
@hanno-becker hanno-becker requested a review from a team as a code owner December 17, 2024 21:32
@hanno-becker hanno-becker force-pushed the aarch64_poly_frombytes branch from 08da95f to 34bd687 Compare December 17, 2024 21:33
@hanno-becker hanno-becker added the benchmark this PR should be benchmarked in CI label Dec 17, 2024
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 29472 cycles 29180 cycles 1.01
ML-KEM-512 encaps 35726 cycles 35550 cycles 1.00
ML-KEM-512 decaps 46100 cycles 46096 cycles 1.00
ML-KEM-768 keypair 49729 cycles 49220 cycles 1.01
ML-KEM-768 encaps 55798 cycles 55380 cycles 1.01
ML-KEM-768 decaps 70297 cycles 70208 cycles 1.00
ML-KEM-1024 keypair 73281 cycles 72237 cycles 1.01
ML-KEM-1024 encaps 81874 cycles 81077 cycles 1.01
ML-KEM-1024 decaps 101362 cycles 100881 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 13504 cycles 13533 cycles 1.00
ML-KEM-512 encaps 17247 cycles 17316 cycles 1.00
ML-KEM-512 decaps 22712 cycles 22850 cycles 0.99
ML-KEM-768 keypair 22512 cycles 22537 cycles 1.00
ML-KEM-768 encaps 24512 cycles 24505 cycles 1.00
ML-KEM-768 decaps 32470 cycles 32559 cycles 1.00
ML-KEM-1024 keypair 31354 cycles 31461 cycles 1.00
ML-KEM-1024 encaps 34919 cycles 34950 cycles 1.00
ML-KEM-1024 decaps 45538 cycles 45838 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Intel Xeon 4th gen (c7i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 34bd687 Previous: 2dd10c1 Ratio
ML-KEM-512 keypair 13956 cycles 13531 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 20334 cycles 20340 cycles 1.00
ML-KEM-512 encaps 27010 cycles 27297 cycles 0.99
ML-KEM-512 decaps 36103 cycles 35844 cycles 1.01
ML-KEM-768 keypair 34902 cycles 34886 cycles 1.00
ML-KEM-768 encaps 38135 cycles 38189 cycles 1.00
ML-KEM-768 decaps 50945 cycles 50924 cycles 1.00
ML-KEM-1024 keypair 47965 cycles 48027 cycles 1.00
ML-KEM-1024 encaps 54139 cycles 54197 cycles 1.00
ML-KEM-1024 decaps 71555 cycles 71804 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 18135 cycles 18137 cycles 1.00
ML-KEM-512 encaps 23188 cycles 23201 cycles 1.00
ML-KEM-512 decaps 30504 cycles 30511 cycles 1.00
ML-KEM-768 keypair 31069 cycles 31078 cycles 1.00
ML-KEM-768 encaps 34197 cycles 34162 cycles 1.00
ML-KEM-768 decaps 44765 cycles 44729 cycles 1.00
ML-KEM-1024 keypair 44632 cycles 44565 cycles 1.00
ML-KEM-1024 encaps 49913 cycles 49897 cycles 1.00
ML-KEM-1024 decaps 64436 cycles 64402 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 15148 cycles 15077 cycles 1.00
ML-KEM-512 encaps 19667 cycles 19660 cycles 1.00
ML-KEM-512 decaps 26296 cycles 26308 cycles 1.00
ML-KEM-768 keypair 25649 cycles 25578 cycles 1.00
ML-KEM-768 encaps 28150 cycles 28154 cycles 1.00
ML-KEM-768 decaps 37934 cycles 37846 cycles 1.00
ML-KEM-1024 keypair 35756 cycles 35641 cycles 1.00
ML-KEM-1024 encaps 41039 cycles 40966 cycles 1.00
ML-KEM-1024 decaps 54472 cycles 54548 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 34885 cycles 34897 cycles 1.00
ML-KEM-512 encaps 44988 cycles 44978 cycles 1.00
ML-KEM-512 decaps 58930 cycles 58945 cycles 1.00
ML-KEM-768 keypair 59118 cycles 59225 cycles 1.00
ML-KEM-768 encaps 71690 cycles 71828 cycles 1.00
ML-KEM-768 decaps 89296 cycles 89397 cycles 1.00
ML-KEM-1024 keypair 87543 cycles 87542 cycles 1.00
ML-KEM-1024 encaps 104592 cycles 104646 cycles 1.00
ML-KEM-1024 decaps 127574 cycles 127678 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 19218 cycles 19003 cycles 1.01
ML-KEM-512 encaps 23623 cycles 23602 cycles 1.00
ML-KEM-512 decaps 30590 cycles 30765 cycles 0.99
ML-KEM-768 keypair 32665 cycles 32272 cycles 1.01
ML-KEM-768 encaps 35910 cycles 35727 cycles 1.01
ML-KEM-768 decaps 45865 cycles 45872 cycles 1.00
ML-KEM-1024 keypair 47546 cycles 46840 cycles 1.02
ML-KEM-1024 encaps 53097 cycles 52612 cycles 1.01
ML-KEM-1024 decaps 66766 cycles 66495 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 56603 cycles 56640 cycles 1.00
ML-KEM-512 encaps 69452 cycles 69527 cycles 1.00
ML-KEM-512 decaps 91509 cycles 91505 cycles 1.00
ML-KEM-768 keypair 91842 cycles 91903 cycles 1.00
ML-KEM-768 encaps 107769 cycles 107822 cycles 1.00
ML-KEM-768 decaps 136356 cycles 136407 cycles 1.00
ML-KEM-1024 keypair 134654 cycles 134896 cycles 1.00
ML-KEM-1024 encaps 155204 cycles 155420 cycles 1.00
ML-KEM-1024 decaps 191555 cycles 191704 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 29481 cycles 29190 cycles 1.01
ML-KEM-512 encaps 35733 cycles 35560 cycles 1.00
ML-KEM-512 decaps 46114 cycles 46112 cycles 1.00
ML-KEM-768 keypair 49749 cycles 49228 cycles 1.01
ML-KEM-768 encaps 55801 cycles 55407 cycles 1.01
ML-KEM-768 decaps 70322 cycles 70217 cycles 1.00
ML-KEM-1024 keypair 73133 cycles 72353 cycles 1.01
ML-KEM-1024 encaps 81788 cycles 81163 cycles 1.01
ML-KEM-1024 decaps 101436 cycles 100914 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 18352 cycles 18199 cycles 1.01
ML-KEM-512 encaps 22243 cycles 22233 cycles 1.00
ML-KEM-512 decaps 28830 cycles 28985 cycles 0.99
ML-KEM-768 keypair 31033 cycles 30681 cycles 1.01
ML-KEM-768 encaps 33948 cycles 33725 cycles 1.01
ML-KEM-768 decaps 43387 cycles 43302 cycles 1.00
ML-KEM-1024 keypair 45020 cycles 44354 cycles 1.02
ML-KEM-1024 encaps 50309 cycles 49788 cycles 1.01
ML-KEM-1024 decaps 63131 cycles 62840 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 45740 cycles 45720 cycles 1.00
ML-KEM-512 encaps 56887 cycles 56867 cycles 1.00
ML-KEM-512 decaps 76250 cycles 76234 cycles 1.00
ML-KEM-768 keypair 74537 cycles 74544 cycles 1.00
ML-KEM-768 encaps 88575 cycles 88570 cycles 1.00
ML-KEM-768 decaps 114466 cycles 114433 cycles 1.00
ML-KEM-1024 keypair 109432 cycles 109465 cycles 1.00
ML-KEM-1024 encaps 127440 cycles 127494 cycles 1.00
ML-KEM-1024 decaps 160006 cycles 160139 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 52412 cycles 52178 cycles 1.00
ML-KEM-512 encaps 65433 cycles 65783 cycles 0.99
ML-KEM-512 decaps 88545 cycles 88428 cycles 1.00
ML-KEM-768 keypair 84398 cycles 84729 cycles 1.00
ML-KEM-768 encaps 102134 cycles 101502 cycles 1.01
ML-KEM-768 decaps 131336 cycles 132074 cycles 0.99
ML-KEM-1024 keypair 124791 cycles 124073 cycles 1.01
ML-KEM-1024 encaps 145257 cycles 145769 cycles 1.00
ML-KEM-1024 decaps 182747 cycles 183677 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 45389 cycles 45386 cycles 1.00
ML-KEM-512 encaps 54212 cycles 54214 cycles 1.00
ML-KEM-512 decaps 71145 cycles 71155 cycles 1.00
ML-KEM-768 keypair 74835 cycles 74823 cycles 1.00
ML-KEM-768 encaps 86077 cycles 86063 cycles 1.00
ML-KEM-768 decaps 108672 cycles 108802 cycles 1.00
ML-KEM-1024 keypair 111101 cycles 111111 cycles 1.00
ML-KEM-1024 encaps 125926 cycles 125936 cycles 1.00
ML-KEM-1024 decaps 154574 cycles 154635 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 42016 cycles 41978 cycles 1.00
ML-KEM-512 encaps 50090 cycles 50164 cycles 1.00
ML-KEM-512 decaps 66110 cycles 66049 cycles 1.00
ML-KEM-768 keypair 69111 cycles 69057 cycles 1.00
ML-KEM-768 encaps 79859 cycles 79763 cycles 1.00
ML-KEM-768 decaps 101129 cycles 101019 cycles 1.00
ML-KEM-1024 keypair 102212 cycles 102456 cycles 1.00
ML-KEM-1024 encaps 117206 cycles 117443 cycles 1.00
ML-KEM-1024 decaps 143653 cycles 143389 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 71292 cycles 71162 cycles 1.00
ML-KEM-512 encaps 85135 cycles 85065 cycles 1.00
ML-KEM-512 decaps 112641 cycles 112770 cycles 1.00
ML-KEM-768 keypair 117612 cycles 117261 cycles 1.00
ML-KEM-768 encaps 135290 cycles 135096 cycles 1.00
ML-KEM-768 decaps 172010 cycles 171735 cycles 1.00
ML-KEM-1024 keypair 175111 cycles 174233 cycles 1.01
ML-KEM-1024 encaps 197212 cycles 196442 cycles 1.00
ML-KEM-1024 decaps 243384 cycles 242511 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bananapi bpi-f3 benchmarks

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 335027 cycles 335046 cycles 1.00
ML-KEM-512 encaps 445694 cycles 445607 cycles 1.00
ML-KEM-512 decaps 593806 cycles 593856 cycles 1.00
ML-KEM-768 keypair 556185 cycles 556062 cycles 1.00
ML-KEM-768 encaps 698052 cycles 697865 cycles 1.00
ML-KEM-768 decaps 890484 cycles 889403 cycles 1.00
ML-KEM-1024 keypair 821541 cycles 821286 cycles 1.00
ML-KEM-1024 encaps 998894 cycles 998065 cycles 1.00
ML-KEM-1024 decaps 1229586 cycles 1230119 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 58878 cycles 57964 cycles 1.02
ML-KEM-512 encaps 66398 cycles 65238 cycles 1.02
ML-KEM-512 decaps 84689 cycles 83987 cycles 1.01
ML-KEM-768 keypair 100048 cycles 97986 cycles 1.02
ML-KEM-768 encaps 111377 cycles 109131 cycles 1.02
ML-KEM-768 decaps 137465 cycles 135466 cycles 1.01
ML-KEM-1024 keypair 151970 cycles 148680 cycles 1.02
ML-KEM-1024 encaps 169161 cycles 164777 cycles 1.03
ML-KEM-1024 decaps 203998 cycles 200003 cycles 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

Benchmark suite Current: 29a1909 Previous: 271b362 Ratio
ML-KEM-512 keypair 52188 cycles 52303 cycles 1.00
ML-KEM-512 encaps 58976 cycles 59330 cycles 0.99
ML-KEM-512 decaps 75202 cycles 75340 cycles 1.00
ML-KEM-768 keypair 88727 cycles 88072 cycles 1.01
ML-KEM-768 encaps 97462 cycles 96577 cycles 1.01
ML-KEM-768 decaps 120596 cycles 119275 cycles 1.01
ML-KEM-1024 keypair 133786 cycles 132209 cycles 1.01
ML-KEM-1024 encaps 147253 cycles 144750 cycles 1.02
ML-KEM-1024 decaps 178307 cycles 175884 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not result in better performance on any uArch. Why would you like to add this?

@mkannwischer mkannwischer force-pushed the aarch64_poly_frombytes branch from 34bd687 to 112e6ef Compare December 17, 2024 23:06
@hanno-becker hanno-becker force-pushed the aarch64_poly_frombytes branch from 112e6ef to c1da0ac Compare December 18, 2024 03:34
@hanno-becker hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Dec 18, 2024
@hanno-becker
Copy link
Contributor Author

This does not result in better performance on any uArch. Why would you like to add this?

@mkannwischer This is part of an exploration with @jargh on whether it's useful to use the transposed NTT order for AArch64 (as we do for x86_64). However, since poly_{to,from}bytes() work in NTT domain, they need adjusting, and therefore we currently require that native implementations be present.

@hanno-becker
Copy link
Contributor Author

@mkannwischer Looks like removing the existing (clean, admittedly) poly_tobytes() also has no bearing.

@hanno-becker hanno-becker force-pushed the aarch64_poly_frombytes branch from c1da0ac to 4f0f2e8 Compare December 18, 2024 06:49
@hanno-becker hanno-becker changed the title AArch64: Add native poly_frombytes() implementation [DRAFT] AArch64: Use transposed coefficient order in NTT domain Dec 18, 2024
This commit adds an AArch64 implementation for `poly_frombytes()`.
Like the already existing `poly_tobytes()`, we do not yet optimize
it using SLOTHY, but work with the clean version in both the clean
ahd the optimized backend. Applying SLOTHY to both needs work
on the (micro)architecture models first.

Signed-off-by: Hanno Becker <[email protected]>
This commit modifies the AArch64 arithmetic backend to use a transposed
order of polynomial coefficients in NTT domain.

- In the forward NTT, this saves a st4
- In the inverse NTT, this saves a ld4
- No cost in the base multiplication: We merely need to shuffle
  the twiddles for the mulcache computation, which is done through
  a change to `autogenerate_files.py`.
- A temporary change is made to `polyvec.c`, adding the permutation
  before/after to/from bytes. This will be removed once those functions
  are adjusted to respect the custom order.
- For now, the coefficient permutation is written in simple-minded C.
  This will be removed in a subsequent commit.

Signed-off-by: Hanno Becker <[email protected]>
Also, add clean AArch64 assembly for custom order permutation.

Signed-off-by: Hanno Becker <[email protected]>
This commit reoptimizes NTT and invNTT with the custom order,
using SLOTHY.

We copy the clean versions of `poly_{tobyte,frombytes,transpose}`
for now.

This finishes a prototype of the optimized AArch64 backend using
the custom NTT order.

Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker hanno-becker force-pushed the aarch64_poly_frombytes branch from 4f0f2e8 to 29a1909 Compare December 18, 2024 06:53
@hanno-becker hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Dec 18, 2024
@hanno-becker hanno-becker marked this pull request as draft December 18, 2024 07:18
@hanno-becker
Copy link
Contributor Author

Agreed with @jargh that we are not pursuing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark this PR should be benchmarked in CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants