-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX2 implementation of mulcache #579
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @dkostic for working that out! Up to 5% performance uplift is worth taking, esp. since it simplifies the code and makes it more uniform across backends.
Left some small comments. The only non trivial question is whether AVX2 allows to accumulate in 32bit, in which case we can defer the reduction to after accumulation, in line with the C and AArch64 backends, also further improving performance.
cf2616a
to
b057693
Compare
0b1d420
to
379daa0
Compare
379daa0
to
59cf85b
Compare
@dkostic There is a segfault in the |
mlkem/native/x86_64/src/basemul.c
Outdated
|
||
void poly_mulcache_compute_avx2(poly_mulcache *x, const poly *y) | ||
{ | ||
int j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we aim to use assembly-level verification using HOL-Light, could we implement this in ASM rather than intrinsics, please? ISTM that this should essentilly be the code followingTODO: This could be precomputed in the mulcache
in (the old) basemul.S
.
(I'm aware there's more intrinsics-based code, but it will all need to be rewritten, so let's not introduce more of it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, you want me to do it as part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferably, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
59cf85b
to
1136927
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI is still failing in a few places. Once pq-code-package/tsc#125 is merged, you will be able to open a PR from the main repo, and run the full CI yourself.
@dkostic Ok this is done now. You have developer rights, so should be able to push branches to the main repo and open PRs from there (closing this one). Can you do that for this PR? |
Slight performance improvement: |--------------------------------------| | MLKEM 512 | Before | After | Improv | | keypair | 15101 | 14835 | 1.02x | | encaps | 19664 | 18631 | 1.05x | | decaps | 25824 | 24420 | 1.05x | |--------------------------------------| | MLKEM 768 | Before | After | Improv | | keypair | 26187 | 25157 | 1.04x | | encaps | 28014 | 27248 | 1.03x | | decaps | 36989 | 35659 | 1.03x | |--------------------------------------| | MLKEM 1024 | Before | After | Improv | | keypair | 36014 | 35630 | 1.01x | | encaps | 39797 | 39347 | 1.00x | | decaps | 52139 | 51524 | 1.01x | |--------------------------------------| measured on Intel(R) Xeon(R) Platinum 8488C. Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
Signed-off-by: dkostic <[email protected]>
5f5ec06
to
9c00e77
Compare
Signed-off-by: dkostic <[email protected]>
I'd prefer if we can finish this PR as is. I can start pushing to the main repo with the next one. |
@dkostic I'm afraid we definitely need to move the PR because the full CI does not run otherwise. I would have done it in case you hadn't yet got the necessary rights -- but now you do! 🙂 But if it's too much bother for you, I can also do it - let me know. |
oh, so it was you who started the CI for my last update? nevertheless, is this the development process for the project:
|
No I think the last update was automatic; but because of the way we authenticate to EC2 (OICD), large parts of the CI can currently only run for PRs from the main repo.
Yes! |
closing in favor of #624 |
Summary:
AVX2 implementation of mulcache
Slight performance improvement:
measured on Intel(R) Xeon(R) Platinum 8488C.
Addresses #477 .
Steps:
If your pull request consists of multiple sequential changes, please describe them here:
Performed local tests:
lint
passingtests all
passingtests bench
passingtests cbmc
passingDo you expect this change to impact performance: Yes, see above.