Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2: Document bounds in NTT and invNTT #469

Merged
merged 5 commits into from
Dec 2, 2024
Merged

Conversation

hanno-becker
Copy link
Contributor

@hanno-becker hanno-becker commented Dec 2, 2024

Resolves #222

This commit documents the AVX2 assembly for the [inv]NTT in the x86_64 native backend. In particular, it tracks the bounds of data through the various layers.

The invNTT implementation is highly aggressive in terms of minimizing the number of modular reductions, which makes a pen-and-paper analysis rather difficult.

At the last invNTT layer, the bounds reasoning applied is not enough to show absence of overflow. To remedy, an additional reduction is added.

The analysis for the forward NTT is straightforward.

Ultimately, we confirm the absolute bound of 8*MLKEM_Q for the output of forward and inverse NTT; this is the contractual bound that the higher-level C-code is working with.

@hanno-becker hanno-becker marked this pull request as ready for review December 2, 2024 07:26
@hanno-becker hanno-becker added the benchmark this PR should be benchmarked in CI label Dec 2, 2024
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 29196 cycles 29175 cycles 1.00
ML-KEM-512 encaps 35819 cycles 35853 cycles 1.00
ML-KEM-512 decaps 46702 cycles 46654 cycles 1.00
ML-KEM-768 keypair 49191 cycles 49177 cycles 1.00
ML-KEM-768 encaps 55788 cycles 55840 cycles 1.00
ML-KEM-768 decaps 71078 cycles 71008 cycles 1.00
ML-KEM-1024 keypair 72097 cycles 72212 cycles 1.00
ML-KEM-1024 encaps 81628 cycles 81519 cycles 1.00
ML-KEM-1024 decaps 101987 cycles 102036 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 13849 cycles 13826 cycles 1.00
ML-KEM-512 encaps 18185 cycles 18063 cycles 1.01
ML-KEM-512 decaps 24139 cycles 24004 cycles 1.01
ML-KEM-768 keypair 22439 cycles 22434 cycles 1.00
ML-KEM-768 encaps 24500 cycles 24441 cycles 1.00
ML-KEM-768 decaps 32535 cycles 32445 cycles 1.00
ML-KEM-1024 keypair 32133 cycles 32157 cycles 1.00
ML-KEM-1024 encaps 35693 cycles 35654 cycles 1.00
ML-KEM-1024 decaps 47298 cycles 47250 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 20319 cycles 20318 cycles 1.00
ML-KEM-512 encaps 27107 cycles 27066 cycles 1.00
ML-KEM-512 decaps 36193 cycles 36184 cycles 1.00
ML-KEM-768 keypair 34850 cycles 34841 cycles 1.00
ML-KEM-768 encaps 38246 cycles 38138 cycles 1.00
ML-KEM-768 decaps 51396 cycles 51357 cycles 1.00
ML-KEM-1024 keypair 48039 cycles 48087 cycles 1.00
ML-KEM-1024 encaps 55259 cycles 54213 cycles 1.02
ML-KEM-1024 decaps 72072 cycles 72098 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 18143 cycles 18000 cycles 1.01
ML-KEM-512 encaps 23115 cycles 23070 cycles 1.00
ML-KEM-512 decaps 30349 cycles 30404 cycles 1.00
ML-KEM-768 keypair 31005 cycles 31036 cycles 1.00
ML-KEM-768 encaps 33974 cycles 33948 cycles 1.00
ML-KEM-768 decaps 44745 cycles 44836 cycles 1.00
ML-KEM-1024 keypair 44672 cycles 44636 cycles 1.00
ML-KEM-1024 encaps 50109 cycles 49978 cycles 1.00
ML-KEM-1024 decaps 64630 cycles 64674 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 3rd gen (c6a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6c644cb Previous: b2c6403 Ratio
ML-KEM-768 encaps 37303 cycles 33948 cycles 1.10

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 15074 cycles 15081 cycles 1.00
ML-KEM-512 encaps 19795 cycles 19775 cycles 1.00
ML-KEM-512 decaps 26411 cycles 26426 cycles 1.00
ML-KEM-768 keypair 25676 cycles 25593 cycles 1.00
ML-KEM-768 encaps 28216 cycles 28200 cycles 1.00
ML-KEM-768 decaps 38455 cycles 38171 cycles 1.01
ML-KEM-1024 keypair 35732 cycles 35590 cycles 1.00
ML-KEM-1024 encaps 41267 cycles 40946 cycles 1.01
ML-KEM-1024 decaps 54682 cycles 54551 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 36223 cycles 36238 cycles 1.00
ML-KEM-512 encaps 46498 cycles 46211 cycles 1.01
ML-KEM-512 decaps 61867 cycles 61564 cycles 1.00
ML-KEM-768 keypair 58987 cycles 59051 cycles 1.00
ML-KEM-768 encaps 73020 cycles 73103 cycles 1.00
ML-KEM-768 decaps 91585 cycles 91700 cycles 1.00
ML-KEM-1024 keypair 88574 cycles 88798 cycles 1.00
ML-KEM-1024 encaps 109431 cycles 109136 cycles 1.00
ML-KEM-1024 decaps 133579 cycles 132893 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 18268 cycles 18266 cycles 1.00
ML-KEM-512 encaps 22490 cycles 22477 cycles 1.00
ML-KEM-512 decaps 29420 cycles 29400 cycles 1.00
ML-KEM-768 keypair 30756 cycles 30772 cycles 1.00
ML-KEM-768 encaps 34058 cycles 34057 cycles 1.00
ML-KEM-768 decaps 43869 cycles 43908 cycles 1.00
ML-KEM-1024 keypair 44475 cycles 44483 cycles 1.00
ML-KEM-1024 encaps 50220 cycles 50207 cycles 1.00
ML-KEM-1024 decaps 63655 cycles 63635 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 58024 cycles 58010 cycles 1.00
ML-KEM-512 encaps 65658 cycles 65637 cycles 1.00
ML-KEM-512 decaps 84333 cycles 84284 cycles 1.00
ML-KEM-768 keypair 98242 cycles 98320 cycles 1.00
ML-KEM-768 encaps 110011 cycles 110552 cycles 1.00
ML-KEM-768 decaps 136662 cycles 136538 cycles 1.00
ML-KEM-1024 keypair 148739 cycles 149031 cycles 1.00
ML-KEM-1024 encaps 166058 cycles 165978 cycles 1.00
ML-KEM-1024 decaps 202392 cycles 201821 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 56689 cycles 56690 cycles 1.00
ML-KEM-512 encaps 71736 cycles 71796 cycles 1.00
ML-KEM-512 decaps 96251 cycles 96315 cycles 1.00
ML-KEM-768 keypair 91793 cycles 91895 cycles 1.00
ML-KEM-768 encaps 111401 cycles 111512 cycles 1.00
ML-KEM-768 decaps 144404 cycles 144626 cycles 1.00
ML-KEM-1024 keypair 134664 cycles 134620 cycles 1.00
ML-KEM-1024 encaps 160240 cycles 160055 cycles 1.00
ML-KEM-1024 decaps 201254 cycles 201479 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 45768 cycles 45940 cycles 1.00
ML-KEM-512 encaps 58849 cycles 58757 cycles 1.00
ML-KEM-512 decaps 80021 cycles 79973 cycles 1.00
ML-KEM-768 keypair 74675 cycles 74621 cycles 1.00
ML-KEM-768 encaps 91845 cycles 91441 cycles 1.00
ML-KEM-768 decaps 120364 cycles 120123 cycles 1.00
ML-KEM-1024 keypair 109877 cycles 110170 cycles 1.00
ML-KEM-1024 encaps 131139 cycles 131077 cycles 1.00
ML-KEM-1024 decaps 167658 cycles 167564 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 52551 cycles 52297 cycles 1.00
ML-KEM-512 encaps 67427 cycles 67747 cycles 1.00
ML-KEM-512 decaps 92722 cycles 92585 cycles 1.00
ML-KEM-768 keypair 85050 cycles 84626 cycles 1.01
ML-KEM-768 encaps 104160 cycles 104787 cycles 0.99
ML-KEM-768 decaps 137437 cycles 137690 cycles 1.00
ML-KEM-1024 keypair 124904 cycles 125413 cycles 1.00
ML-KEM-1024 encaps 150166 cycles 149628 cycles 1.00
ML-KEM-1024 decaps 192020 cycles 192160 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 29201 cycles 29176 cycles 1.00
ML-KEM-512 encaps 35816 cycles 35859 cycles 1.00
ML-KEM-512 decaps 46727 cycles 46666 cycles 1.00
ML-KEM-768 keypair 49202 cycles 49247 cycles 1.00
ML-KEM-768 encaps 55812 cycles 55930 cycles 1.00
ML-KEM-768 decaps 71067 cycles 71065 cycles 1.00
ML-KEM-1024 keypair 72101 cycles 72173 cycles 1.00
ML-KEM-1024 encaps 81618 cycles 81620 cycles 1.00
ML-KEM-1024 decaps 102042 cycles 102069 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 41957 cycles 41913 cycles 1.00
ML-KEM-512 encaps 51806 cycles 51758 cycles 1.00
ML-KEM-512 decaps 69322 cycles 69321 cycles 1.00
ML-KEM-768 keypair 69067 cycles 69069 cycles 1.00
ML-KEM-768 encaps 82699 cycles 82684 cycles 1.00
ML-KEM-768 decaps 106408 cycles 106475 cycles 1.00
ML-KEM-1024 keypair 102228 cycles 102446 cycles 1.00
ML-KEM-1024 encaps 120595 cycles 120643 cycles 1.00
ML-KEM-1024 decaps 150588 cycles 150588 cycles 1

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 19029 cycles 19031 cycles 1.00
ML-KEM-512 encaps 23867 cycles 23841 cycles 1.00
ML-KEM-512 decaps 31240 cycles 31230 cycles 1.00
ML-KEM-768 keypair 32332 cycles 32332 cycles 1
ML-KEM-768 encaps 36082 cycles 36083 cycles 1.00
ML-KEM-768 decaps 46487 cycles 46488 cycles 1.00
ML-KEM-1024 keypair 46956 cycles 46969 cycles 1.00
ML-KEM-1024 encaps 53092 cycles 53088 cycles 1.00
ML-KEM-1024 decaps 67380 cycles 67404 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 71217 cycles 71083 cycles 1.00
ML-KEM-512 encaps 87802 cycles 87838 cycles 1.00
ML-KEM-512 decaps 118209 cycles 118147 cycles 1.00
ML-KEM-768 keypair 117578 cycles 117503 cycles 1.00
ML-KEM-768 encaps 139343 cycles 139289 cycles 1.00
ML-KEM-768 decaps 180000 cycles 180001 cycles 1.00
ML-KEM-1024 keypair 174961 cycles 174233 cycles 1.00
ML-KEM-1024 encaps 202851 cycles 201851 cycles 1.00
ML-KEM-1024 decaps 254466 cycles 253746 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 45372 cycles 45361 cycles 1.00
ML-KEM-512 encaps 56181 cycles 56173 cycles 1.00
ML-KEM-512 decaps 75115 cycles 75100 cycles 1.00
ML-KEM-768 keypair 74831 cycles 74868 cycles 1.00
ML-KEM-768 encaps 89159 cycles 89191 cycles 1.00
ML-KEM-768 decaps 114542 cycles 114549 cycles 1.00
ML-KEM-1024 keypair 111071 cycles 111080 cycles 1.00
ML-KEM-1024 encaps 129959 cycles 129991 cycles 1.00
ML-KEM-1024 decaps 162524 cycles 162576 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

Benchmark suite Current: a2bf116 Previous: b2c6403 Ratio
ML-KEM-512 keypair 51368 cycles 51400 cycles 1.00
ML-KEM-512 encaps 58332 cycles 58666 cycles 0.99
ML-KEM-512 decaps 75140 cycles 74987 cycles 1.00
ML-KEM-768 keypair 87566 cycles 87491 cycles 1.00
ML-KEM-768 encaps 96622 cycles 97259 cycles 0.99
ML-KEM-768 decaps 120045 cycles 120726 cycles 0.99
ML-KEM-1024 keypair 131183 cycles 130977 cycles 1.00
ML-KEM-1024 encaps 145281 cycles 145186 cycles 1.00
ML-KEM-1024 decaps 177641 cycles 177592 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Dec 2, 2024
This commit documents the AVX2 assembly for the [inv]NTT in the x86_64
native backend. In particular, it tracks the bounds of data through
the various layers.

The invNTT implementation is highly aggressive in terms of minimizing
the number of modular reductions, which makes a pen-and-paper analysis
rather difficult.

At the last invNTT layer, the bounds reasoning applied is not enough to
show absence of overflow. To remedy, an additional reduction is added.

The analysis for the forward NTT is straightforward.

Ultimately, we confirm the absolute bound of 8*MLKEM_Q for the output
of forward and inverse NTT; this is the contractual bound that the
higher-level C-code is working with.

Signed-off-by: Hanno Becker <[email protected]>
The AVX2 inverse NTT aggressively minimizes the number of modular
reductions required. This is correct to our knowledge, but it is
difficult to reason about: As the previous commit shows, one has
to take into account rather tight bounds for the absolute value
of a Montgomery multiplication.

This commit adds another modular reduction to the AVX2 inverse NTT.
While not strictly necessary, it simplifies the bounds reasoning
considerably, and does not come at a meaningful performance cost.

Signed-off-by: Hanno Becker <[email protected]>
This commit adds a few comments to the AVX2 assembly for the
base multiplication in NTT domain. In particular, it explains
that each coefficient in the output of the basemul is bound by
2q in absolute value, and that, thus, the native AVX2 implementation
of polyvec_basemul_acc_montgomery_cached_avx2() does not overflow.

Signed-off-by: Hanno Becker <[email protected]>
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that you have not modified the code except for the added/moved reductions.
I read through the comments and checked that they make sense.

I do not know how to improve the bounds reasoning to avoid these extra reductions - this is something to check with the author.
However, we can go ahead and merge this now and remove the reductions later when we understand it.

mlkem/native/x86_64/shuffle.inc Outdated Show resolved Hide resolved
@mkannwischer
Copy link
Contributor

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 3rd gen (c6a)'. Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6c644cb Previous: b2c6403 Ratio
ML-KEM-768 encaps 37303 cycles 33948 cycles 1.10
This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker - is this benchmark correct or was that some noise? 10% slowdown on single platforms sounds wrong.

@hanno-becker hanno-becker merged commit 3ae93fe into main Dec 2, 2024
41 checks passed
@hanno-becker hanno-becker deleted the avx2_documentation branch December 2, 2024 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark this PR should be benchmarked in CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Doublecheck and explain bounds for AVX2 NTT and invNTT
3 participants