-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove redundant reduction from unpack_pk()
and unpack_sk()
#505
Conversation
84dbfaa
to
dab62af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
29195 cycles |
29175 cycles |
1.00 |
ML-KEM-512 encaps |
35496 cycles |
35852 cycles |
0.99 |
ML-KEM-512 decaps |
46064 cycles |
46653 cycles |
0.99 |
ML-KEM-768 keypair |
49187 cycles |
49176 cycles |
1.00 |
ML-KEM-768 encaps |
55325 cycles |
55842 cycles |
0.99 |
ML-KEM-768 decaps |
70046 cycles |
71009 cycles |
0.99 |
ML-KEM-1024 keypair |
72112 cycles |
72213 cycles |
1.00 |
ML-KEM-1024 encaps |
81038 cycles |
81522 cycles |
0.99 |
ML-KEM-1024 decaps |
100783 cycles |
102032 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
13870 cycles |
13835 cycles |
1.00 |
ML-KEM-512 encaps |
18054 cycles |
18205 cycles |
0.99 |
ML-KEM-512 decaps |
23951 cycles |
24104 cycles |
0.99 |
ML-KEM-768 keypair |
22450 cycles |
22551 cycles |
1.00 |
ML-KEM-768 encaps |
24406 cycles |
24623 cycles |
0.99 |
ML-KEM-768 decaps |
32402 cycles |
32838 cycles |
0.99 |
ML-KEM-1024 keypair |
32043 cycles |
30856 cycles |
1.04 |
ML-KEM-1024 encaps |
35433 cycles |
34532 cycles |
1.03 |
ML-KEM-1024 decaps |
47063 cycles |
46135 cycles |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Intel Xeon 4th gen (c7i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03
.
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-1024 keypair |
32043 cycles |
30856 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
18139 cycles |
18122 cycles |
1.00 |
ML-KEM-512 encaps |
22898 cycles |
23102 cycles |
0.99 |
ML-KEM-512 decaps |
30131 cycles |
30423 cycles |
0.99 |
ML-KEM-768 keypair |
31053 cycles |
31078 cycles |
1.00 |
ML-KEM-768 encaps |
33981 cycles |
33973 cycles |
1.00 |
ML-KEM-768 decaps |
44635 cycles |
44794 cycles |
1.00 |
ML-KEM-1024 keypair |
44528 cycles |
44464 cycles |
1.00 |
ML-KEM-1024 encaps |
49885 cycles |
49908 cycles |
1.00 |
ML-KEM-1024 decaps |
64218 cycles |
64600 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
20311 cycles |
20339 cycles |
1.00 |
ML-KEM-512 encaps |
26992 cycles |
27100 cycles |
1.00 |
ML-KEM-512 decaps |
35985 cycles |
36228 cycles |
0.99 |
ML-KEM-768 keypair |
34794 cycles |
34857 cycles |
1.00 |
ML-KEM-768 encaps |
38042 cycles |
38334 cycles |
0.99 |
ML-KEM-768 decaps |
51015 cycles |
51398 cycles |
0.99 |
ML-KEM-1024 keypair |
47888 cycles |
47925 cycles |
1.00 |
ML-KEM-1024 encaps |
53897 cycles |
54096 cycles |
1.00 |
ML-KEM-1024 decaps |
71557 cycles |
72059 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i) (no-opt)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
36013 cycles |
36184 cycles |
1.00 |
ML-KEM-512 encaps |
44779 cycles |
46464 cycles |
0.96 |
ML-KEM-512 decaps |
58465 cycles |
61924 cycles |
0.94 |
ML-KEM-768 keypair |
59472 cycles |
59104 cycles |
1.01 |
ML-KEM-768 encaps |
70549 cycles |
72980 cycles |
0.97 |
ML-KEM-768 decaps |
88420 cycles |
91716 cycles |
0.96 |
ML-KEM-1024 keypair |
88094 cycles |
88370 cycles |
1.00 |
ML-KEM-1024 encaps |
104847 cycles |
109377 cycles |
0.96 |
ML-KEM-1024 decaps |
127666 cycles |
133626 cycles |
0.96 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A55 (Snapdragon 888) benchmarks
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
58032 cycles |
58001 cycles |
1.00 |
ML-KEM-512 encaps |
65079 cycles |
65710 cycles |
0.99 |
ML-KEM-512 decaps |
83098 cycles |
84327 cycles |
0.99 |
ML-KEM-768 keypair |
98170 cycles |
98213 cycles |
1.00 |
ML-KEM-768 encaps |
109841 cycles |
110478 cycles |
0.99 |
ML-KEM-768 decaps |
134544 cycles |
136579 cycles |
0.99 |
ML-KEM-1024 keypair |
148719 cycles |
149118 cycles |
1.00 |
ML-KEM-1024 encaps |
164662 cycles |
166023 cycles |
0.99 |
ML-KEM-1024 decaps |
198996 cycles |
201670 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
19050 cycles |
19029 cycles |
1.00 |
ML-KEM-512 encaps |
23675 cycles |
23842 cycles |
0.99 |
ML-KEM-512 decaps |
30823 cycles |
31230 cycles |
0.99 |
ML-KEM-768 keypair |
32341 cycles |
32329 cycles |
1.00 |
ML-KEM-768 encaps |
35814 cycles |
36085 cycles |
0.99 |
ML-KEM-768 decaps |
45938 cycles |
46490 cycles |
0.99 |
ML-KEM-1024 keypair |
46947 cycles |
46967 cycles |
1.00 |
ML-KEM-1024 encaps |
52711 cycles |
53087 cycles |
0.99 |
ML-KEM-1024 decaps |
66631 cycles |
67402 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
29200 cycles |
29174 cycles |
1.00 |
ML-KEM-512 encaps |
35508 cycles |
35857 cycles |
0.99 |
ML-KEM-512 decaps |
46073 cycles |
46664 cycles |
0.99 |
ML-KEM-768 keypair |
49210 cycles |
49189 cycles |
1.00 |
ML-KEM-768 encaps |
55349 cycles |
55864 cycles |
0.99 |
ML-KEM-768 decaps |
70072 cycles |
70980 cycles |
0.99 |
ML-KEM-1024 keypair |
72184 cycles |
72212 cycles |
1.00 |
ML-KEM-1024 encaps |
81115 cycles |
81653 cycles |
0.99 |
ML-KEM-1024 decaps |
100910 cycles |
102097 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a) (no-opt)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
52600 cycles |
52351 cycles |
1.00 |
ML-KEM-512 encaps |
65406 cycles |
67804 cycles |
0.96 |
ML-KEM-512 decaps |
88422 cycles |
92660 cycles |
0.95 |
ML-KEM-768 keypair |
85134 cycles |
84684 cycles |
1.01 |
ML-KEM-768 encaps |
100891 cycles |
104911 cycles |
0.96 |
ML-KEM-768 decaps |
131042 cycles |
137837 cycles |
0.95 |
ML-KEM-1024 keypair |
125029 cycles |
125494 cycles |
1.00 |
ML-KEM-1024 encaps |
145900 cycles |
149742 cycles |
0.97 |
ML-KEM-1024 decaps |
183516 cycles |
192272 cycles |
0.95 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
15077 cycles |
15195 cycles |
0.99 |
ML-KEM-512 encaps |
19786 cycles |
19753 cycles |
1.00 |
ML-KEM-512 decaps |
26383 cycles |
26409 cycles |
1.00 |
ML-KEM-768 keypair |
25601 cycles |
25693 cycles |
1.00 |
ML-KEM-768 encaps |
28205 cycles |
28282 cycles |
1.00 |
ML-KEM-768 decaps |
38068 cycles |
38236 cycles |
1.00 |
ML-KEM-1024 keypair |
35525 cycles |
35523 cycles |
1.00 |
ML-KEM-1024 encaps |
40582 cycles |
40677 cycles |
1.00 |
ML-KEM-1024 decaps |
54288 cycles |
54488 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i) (no-opt)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
56574 cycles |
56728 cycles |
1.00 |
ML-KEM-512 encaps |
69246 cycles |
71823 cycles |
0.96 |
ML-KEM-512 decaps |
91230 cycles |
96355 cycles |
0.95 |
ML-KEM-768 keypair |
91807 cycles |
91849 cycles |
1.00 |
ML-KEM-768 encaps |
107745 cycles |
111503 cycles |
0.97 |
ML-KEM-768 decaps |
136743 cycles |
144538 cycles |
0.95 |
ML-KEM-1024 keypair |
134418 cycles |
134562 cycles |
1.00 |
ML-KEM-1024 encaps |
155046 cycles |
159971 cycles |
0.97 |
ML-KEM-1024 decaps |
191432 cycles |
201413 cycles |
0.95 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3 (no-opt)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
45363 cycles |
45367 cycles |
1.00 |
ML-KEM-512 encaps |
54218 cycles |
56174 cycles |
0.97 |
ML-KEM-512 decaps |
71251 cycles |
75096 cycles |
0.95 |
ML-KEM-768 keypair |
74834 cycles |
74864 cycles |
1.00 |
ML-KEM-768 encaps |
86211 cycles |
89191 cycles |
0.97 |
ML-KEM-768 decaps |
108714 cycles |
114550 cycles |
0.95 |
ML-KEM-1024 keypair |
111061 cycles |
111084 cycles |
1.00 |
ML-KEM-1024 encaps |
125973 cycles |
130004 cycles |
0.97 |
ML-KEM-1024 decaps |
154732 cycles |
162583 cycles |
0.95 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a) (no-opt)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
45973 cycles |
45729 cycles |
1.01 |
ML-KEM-512 encaps |
56925 cycles |
58795 cycles |
0.97 |
ML-KEM-512 decaps |
76223 cycles |
80001 cycles |
0.95 |
ML-KEM-768 keypair |
74643 cycles |
74644 cycles |
1.00 |
ML-KEM-768 encaps |
88674 cycles |
91486 cycles |
0.97 |
ML-KEM-768 decaps |
114563 cycles |
120140 cycles |
0.95 |
ML-KEM-1024 keypair |
110155 cycles |
110191 cycles |
1.00 |
ML-KEM-1024 encaps |
127240 cycles |
131275 cycles |
0.97 |
ML-KEM-1024 decaps |
159951 cycles |
167645 cycles |
0.95 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
18268 cycles |
18264 cycles |
1.00 |
ML-KEM-512 encaps |
22311 cycles |
22478 cycles |
0.99 |
ML-KEM-512 decaps |
29090 cycles |
29397 cycles |
0.99 |
ML-KEM-768 keypair |
30783 cycles |
30770 cycles |
1.00 |
ML-KEM-768 encaps |
33800 cycles |
34058 cycles |
0.99 |
ML-KEM-768 decaps |
43358 cycles |
43907 cycles |
0.99 |
ML-KEM-1024 keypair |
44506 cycles |
44487 cycles |
1.00 |
ML-KEM-1024 encaps |
49856 cycles |
50207 cycles |
0.99 |
ML-KEM-1024 decaps |
63002 cycles |
63634 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2 (no-opt)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
71218 cycles |
71079 cycles |
1.00 |
ML-KEM-512 encaps |
85054 cycles |
87838 cycles |
0.97 |
ML-KEM-512 decaps |
112664 cycles |
118157 cycles |
0.95 |
ML-KEM-768 keypair |
117473 cycles |
117329 cycles |
1.00 |
ML-KEM-768 encaps |
135129 cycles |
139224 cycles |
0.97 |
ML-KEM-768 decaps |
171645 cycles |
179987 cycles |
0.95 |
ML-KEM-1024 keypair |
174987 cycles |
175295 cycles |
1.00 |
ML-KEM-1024 encaps |
197508 cycles |
202760 cycles |
0.97 |
ML-KEM-1024 decaps |
243514 cycles |
254085 cycles |
0.96 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4 (no-opt)
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
41965 cycles |
41917 cycles |
1.00 |
ML-KEM-512 encaps |
50130 cycles |
51762 cycles |
0.97 |
ML-KEM-512 decaps |
66009 cycles |
69326 cycles |
0.95 |
ML-KEM-768 keypair |
69094 cycles |
69060 cycles |
1.00 |
ML-KEM-768 encaps |
79800 cycles |
82680 cycles |
0.97 |
ML-KEM-768 decaps |
100993 cycles |
106478 cycles |
0.95 |
ML-KEM-1024 keypair |
102224 cycles |
102236 cycles |
1.00 |
ML-KEM-1024 encaps |
117184 cycles |
120656 cycles |
0.97 |
ML-KEM-1024 decaps |
143349 cycles |
150644 cycles |
0.95 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bananapi bpi-f3 benchmarks
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
368994 cycles |
369202 cycles |
1.00 |
ML-KEM-512 encaps |
508057 cycles |
521758 cycles |
0.97 |
ML-KEM-512 decaps |
682264 cycles |
709553 cycles |
0.96 |
ML-KEM-768 keypair |
619098 cycles |
625015 cycles |
0.99 |
ML-KEM-768 encaps |
798949 cycles |
826476 cycles |
0.97 |
ML-KEM-768 decaps |
1023862 cycles |
1071745 cycles |
0.96 |
ML-KEM-1024 keypair |
921776 cycles |
921796 cycles |
1.00 |
ML-KEM-1024 encaps |
1145354 cycles |
1172924 cycles |
0.98 |
ML-KEM-1024 decaps |
1420208 cycles |
1474585 cycles |
0.96 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks
Benchmark suite | Current: dab62af | Previous: f981699 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
51310 cycles |
52197 cycles |
0.98 |
ML-KEM-512 encaps |
58091 cycles |
59431 cycles |
0.98 |
ML-KEM-512 decaps |
73729 cycles |
75460 cycles |
0.98 |
ML-KEM-768 keypair |
86975 cycles |
87682 cycles |
0.99 |
ML-KEM-768 encaps |
94905 cycles |
98211 cycles |
0.97 |
ML-KEM-768 decaps |
118337 cycles |
120510 cycles |
0.98 |
ML-KEM-1024 keypair |
131009 cycles |
131711 cycles |
0.99 |
ML-KEM-1024 encaps |
144294 cycles |
146371 cycles |
0.99 |
ML-KEM-1024 decaps |
175486 cycles |
178985 cycles |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
1ea6feb
to
be01930
Compare
requires(a > -(2 * 4096 * 32768)) | ||
requires(a < (2 * 4096 * 32768)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's a <
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hanno-becker - this is great and results in a surprisingly large performance gain on some platforms of up to 5%.
One nit.
The specifications for the polynomial base multiplication currently require one input to be bound by q in absolute value. To ensure this, the unpacking routines `unpack_pk()` and `unpack_sk()` currently explicitly reduce their output via `poly_reduce()`. Functionally, this reduction is not needed: - For the PK, the modulus-check already ensures that the unpacked PK will be unsigned canonical. - For the SK, there is no requirement in the standard to check that the SK is reduced, or reduce it after unpacking. - There is no risk of overflow in the base multiplication even if the inputs are bound by 4096 rather than q: - For the C base multiplication, the input bound before modular reduction is now 2 * 4096 * 2^15, where before it was 2 * q * 2^15 -- both is below 2^31 so we are fine. After reduction, the 3/2 * q bound is weakened to 2 * q (a much better bound would be possible, but is unneeded), which is still sufficient for up to K=4-fold accumulation since 8 * q < INT16_MAX. - For the AArch64 base multiplication, we are accumulating in 32-bit up to a single final reduction. As in the C case, each 32-bit polynomial multiplication is bound by 2 * 4096 * 2^15 = 2^28, so the accumulated value before reduction is < 2^30 < INT32_MAX. - For the x86_64 base multiplication, the reasoning is unchanged: We don't accumulate in 32-bit but conduct a Montgomery multiplication for every monomial in the product, giving an output bound by q per monomial. Up to 8 such are accumulated, giving a bound by 8 * q < INT16_MAX. This commit strengthens the specifications for all polynomial base multiplication routines - basemul_cached - poly_basemul_montgomery_cached - polyvec_basemul_acc_montgomery - polyvec_basemul_acc_montgomery_cached - matvec_mul The polynomial reductions in `check_pk` and `check_sk` can then be safely removed. Signed-off-by: Hanno Becker <[email protected]>
be01930
to
6fc7eb4
Compare
Thanks for the change! |
Resolves #480
The specifications for the polynomial base multiplication currently require one input to be bound by MLKEM_Q in absolute value. To ensure this, the unpacking routines
unpack_pk()
andunpack_sk()
currently explicitly reduce their output viapoly_reduce()
.Functionally, this reduction is not needed:
This commit strengthens the specifications for all polynomial base multiplication routines
The polynomial reductions in
check_pk
andcheck_sk
can then be safely removed.