Add AVX2 `polyvec_{de,}compress` #410

mkannwischer · 2024-11-15T11:59:48Z

This commits adds the AVX2 intrinsic implementation of
polyvec_compress and polyvec_decompress from the official
Kyber repository.
As a part of #224
it was identified that the majority of the performance difference
in keypair and decaps of our current implementation and
the Kyber AVX2 implementation is due to the AVX2 polyvec_compress
and polyvec_decompress.

This commit adds these two functions to the native interface
and adds the AVX2 intrinic-based implementations from the Kyber
repository. These are almost verbatim copies.
The only two differences are:

The AVX2 impelementations requires the uint8_t buffer
to be slightly larger than MLKEM_POLYVECCOMPRESSEDBYTES, so that
full vectors can be stored/loaded. The official implementation allocated
those bytes on top level of the function. That would be slightly
messy in our implementation, so I instead allocate the larger buffer
in polyvec_compress_avx2/polyvec_decompress_avx2 itself and copy the
inputs/outputs.
The official AVX2 implementation extended the poly type to
also be accessible as a __m256i*.
I changed this to a cast as we guarantee the alignment in another way.

Below are the performance results on my 13th Gen Intel i7-1360P (Raptor Lake)
using gcc 14.2.1 from the Arch Linux repo.

part	Our code `6aa6118`	Kyber repo	Our code(+polyvec_{,de}compress)
512 kg	22353	22348	22252
512 enc	27820	24868	26472
512 dec	35663	34984	33107
768 kg	39626	38070	41590
768 enc	43605	39056	44049
768 dec	54916	53726	53432
1024 kg	58983	53532	57411
1024 enc	65402	56698	61613
1024 dec	80370	75874	74681

#224

mkannwischer · 2024-11-15T12:01:14Z

Together with #409, this PR achieves this performance on my machine:

ML-KEM-512: 19590, 23861, 30397
ML-KEM-768:  33387, 35734 45130
ML-KEM-1024: 47644, 51798, 64815

That's outperforming the code from the official Kyber repo.

Signed-off-by: Matthias J. Kannwischer <[email protected]>

This commits adds the AVX2 intrinsic implementation of polyvec_compress and polyvec_decompress from the official Kyber repository. As a part of #224 it was identified that the majority of the performance difference in keypair and decaps of our current implementation and the Kyber AVX2 implementation is due to the AVX2 polyvec_compress and polyvec_decompress. This commit adds these two functions to the native interface and adds the AVX2 intrinic-based implementations from the Kyber repository. These are almost verbatim copies. The only two differences are: 1) The AVX2 impelementations requires the uint8_t buffer to be slightly larger than MLKEM_POLYVECCOMPRESSEDBYTES, so that full vectors can be stored/loaded. The official implementation allocated those bytes on top level of the function. That would be slightly messy in our implementation, so I instead allocate the larger buffer in polyvec_compress_avx2/polyvec_decompress_avx2 itself and copy the inputs/outputs. 2) The official AVX2 implementation extended the poly type to also be accessible as a __m256i*. I changed this to a cast as we guarantee the alignment in another way. Below are the performance results on my 13th Gen Intel i7-1360P (Raptor Lake) using gcc 14.2.1 from the Arch Linux repo. | part | Our code 6aa6118 |Kyber repo|Our code(+polyvec_{,de}compress) | | -------- | ---------------- | -------- | ------------------------------- | | 512 kg | 22353 | 22348 | 22252 | | 512 enc | 27820 | 24868 | 26472 | | 512 dec | 35663 | 34984 | 33107 | | 768 kg | 39626 | 38070 | 41590 | | 768 enc | 43605 | 39056 | 44049 | | 768 dec | 54916 | 53726 | 53432 | | 1024 kg | 58983 | 53532 | 57411 | | 1024 enc | 65402 | 56698 | 61613 | | 1024 dec | 80370 | 75874 | 74681 | Signed-off-by: Matthias J. Kannwischer <[email protected]>

mkannwischer · 2024-11-19T03:40:24Z

After the recent merges (adding PMU support, adding LTO support), let's re-do the benchmarks with this PR rebased on top of 75f52dc.

TL;LR: We do see performance gains for encaps and decaps of up to 8% with this PR. This is consistent accross platforms (Raptor Lake, c7i) and compiler versions (gcc 13.2.0, gcc 14.2.1). polyvec_compress is 4-7x faster. polyvec_decompress is 5-11x faster.

Here are the results on my Raptor Lake (gcc 14.2.1)
main (75f52dc)

./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 13101
    encaps cycles = 15894
    decaps cycles = 21068

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  13021  13054  13068  13079  13090  13101  13114  13130  13154  13211  28166
    encaps percentiles:  15820  15851  15864  15873  15885  15894  15910  15934  15967  16039  35007
    decaps percentiles:  20974  20999  21019  21036  21052  21068  21103  21157  21221  21320  44787

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 22231
    encaps cycles = 24024
    decaps cycles = 31583

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  22143  22182  22197  22208  22220  22231  22246  22265  22308  22830  23650
    encaps percentiles:  23932  23975  23990  24001  24011  24024  24041  24055  24096  24632  25470
    decaps percentiles:  31489  31528  31543  31557  31571  31583  31598  31619  31662  32220  33021

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 31818
    encaps cycles = 35352
    decaps cycles = 45531

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  31671  31725  31753  31773  31796  31818  31843  31882  31990  33067  34430
    encaps percentiles:  35195  35257  35290  35310  35336  35352  35380  35421  35523  36596  37939
    decaps percentiles:  45350  45407  45444  45467  45494  45531  45595  45689  45915  46762  49307

this PR (0561120):

./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 13105
    encaps cycles = 15289
    decaps cycles = 19644

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  13041  13064  13077  13086  13095  13105  13117  13130  13149  13186  14450
    encaps percentiles:  15220  15248  15262  15271  15280  15289  15303  15319  15356  15442  16702
    decaps percentiles:  19556  19591  19606  19622  19633  19644  19668  19696  19738  19827  20997

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 22209
    encaps cycles = 23105
    decaps cycles = 29439

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  22111  22153  22168  22181  22193  22209  22229  22264  22332  22892  23891
    encaps percentiles:  22994  23041  23058  23074  23088  23105  23128  23160  23230  24315  38833
    decaps percentiles:  29338  29375  29390  29405  29420  29439  29464  29501  29598  30510  61061

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 31793
    encaps cycles = 33923
    decaps cycles = 42594

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  31650  31713  31741  31759  31777  31793  31820  31867  31956  33052  34458
    encaps percentiles:  33784  33831  33863  33882  33905  33923  33944  33979  34048  35182  37472
    decaps percentiles:  42387  42453  42480  42514  42540  42594  42683  42800  42906  43835  55788

Here are the results on the EC2 c7i instance (gcc 13.2.0):
main (75f52dc):

./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 10718
    encaps cycles = 15185
    decaps cycles = 20685

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  10122  10398  10555  10628  10671  10718  10770  10870  11027  11306  12246
    encaps percentiles:  14388  14777  14957  15060  15140  15185  15242  15319  15441  15590  16270
    decaps percentiles:  19664  20097  20309  20449  20530  20685  20841  21026  21195  21493  22682

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 18714
    encaps cycles = 20871
    decaps cycles = 28260

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  17889  18223  18375  18508  18606  18714  18804  18931  19074  19373  19942
    encaps percentiles:  19980  20420  20605  20700  20773  20871  20959  21041  21141  21370  22024
    decaps percentiles:  27426  27727  27888  27999  28151  28260  28363  28492  28640  28843  29926

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 25033
    encaps cycles = 28978
    decaps cycles = 39337

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  23925  24472  24687  24829  24918  25033  25141  25240  25376  25588  26715
    encaps percentiles:  27844  28336  28547  28712  28855  28978  29133  29268  29435  29755  30669
    decaps percentiles:  38054  38466  38799  38996  39151  39337  39499  39688  39865  40191  41062

this PR (0561120):

./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 10614
    encaps cycles = 14397
    decaps cycles = 18979

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  10052  10236  10385  10480  10555  10614  10666  10707  10756  10976  11662
    encaps percentiles:  13696  13986  14128  14234  14322  14397  14475  14539  14626  14807  15460
    decaps percentiles:  18200  18539  18691  18788  18890  18979  19086  19183  19302  19438  19976

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 18471
    encaps cycles = 19655
    decaps cycles = 26197

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  17446  17969  18126  18267  18386  18471  18547  18657  18833  19133  19935
    encaps percentiles:  18701  19163  19318  19443  19546  19655  19727  19814  19985  20191  20780
    decaps percentiles:  25126  25602  25775  25950  26056  26197  26319  26426  26579  26877  27969

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 25171
    encaps cycles = 27385
    decaps cycles = 36543

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  24251  24654  24823  24959  25068  25171  25281  25408  25554  25809  26945
    encaps percentiles:  26348  26737  26936  27147  27271  27385  27517  27700  27860  28133  29019
    decaps percentiles:  35461  35813  36088  36242  36394  36543  36694  36856  37080  37408  38280

Component benchmarks (trimmmed to just polyvec_compress and polyvec_decompress)

Here are the results on my Raptor Lake (gcc 14.2.1)
main (75f52dc)

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=808
polyvec_decompress cycles=828

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=1458
polyvec_decompress cycles=1258

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=1793
polyvec_decompress cycles=1606

this PR (0561120):

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=135
polyvec_decompress cycles=81

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=202
polyvec_decompress cycles=112

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=386
polyvec_decompress cycles=163

Here are the results on the EC2 c7i instance (gcc 13.2.0):
main (75f52dc):

./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt --components

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=814
polyvec_decompress cycles=663

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=1254
polyvec_decompress cycles=995

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=1869
polyvec_decompress cycles=1408

this PR (0561120):

./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt --components

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=144
polyvec_decompress cycles=121

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=216
polyvec_decompress cycles=141

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=417
polyvec_decompress cycles=233

mkannwischer · 2024-11-19T03:41:45Z

test/bench_components_mlkem.c

+  BENCH("polyvec-compress",
+        polyvec_compress((uint8_t *)data0, (polyvec *)data1));
+  BENCH("polyvec-decompress",
+        polyvec_decompress((polyvec *)data0, (uint8_t *)data1));
+


this needs to be removed.

hanno-becker · 2024-11-19T05:35:12Z

@mkannwischer Could you also measure with clang? It does surprise me that this function is so resistant to auto-vectorization, and I wonder if I small change to the code could help here.

hanno-becker

@mkannwischer Would you mind doing the PR in smaller steps, as follows:

Hoist out the polynomial compression into a separate function first. This will require suffixing existing polynomial [de]compression routines with the d-value -- which is anyway a good idea and in line with our naming for scalar_[de]compress.
Then, allow native replacement of those poly compress/decompress variants.

The reason I'd like to do it this is way is: (a) It's cleaner. (b) Once we have the polynomial [de]compression hoisted out, it's easier to investigate how it could be rewritten for better auto-vectorization.

mkannwischer · 2024-11-20T03:05:43Z

Here are results for clang:

TL;DR: clang is not much better at autovectorizing this. In these results it looks like clang overall performs much worse on Raptor Lake vs. gcc14, and a bit better on c6i vs. gcc13. I tried to re-run the gcc14 benchmarks on Raptor Lake today and I cannot reproduce the numbers I got yesterday. Maybe I did make a mistake yesterday - but that won't matter for this PR hopefully.

Raptor Lake (clang 18.1.8)

main (75f52dc)

CC=clang ./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 23670
    encaps cycles = 26846
    decaps cycles = 34556

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  23566  23610  23629  23642  23657  23670  23683  23700  23720  23772  26226
    encaps percentiles:  26709  26769  26792  26814  26831  26846  26864  26884  26918  26958  29406
    decaps percentiles:  34389  34458  34490  34512  34536  34556  34577  34607  34643  34699  37077

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 39521
    encaps cycles = 42353
    decaps cycles = 52863

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  39385  39435  39459  39480  39502  39521  39542  39573  39653  40222  42260
    encaps percentiles:  42203  42257  42287  42312  42332  42353  42377  42405  42482  43225  45172
    decaps percentiles:  52666  52733  52771  52805  52829  52863  52895  52944  53076  53766  55598

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 59144
    encaps cycles = 62901
    decaps cycles = 79353

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  58971  59035  59070  59095  59124  59144  59170  59203  59279  61626  64327
    encaps percentiles:  62707  62795  62823  62844  62869  62901  62932  62964  63017  65308  67772
    decaps percentiles:  79085  79190  79228  79270  79312  79353  79394  79453  79580  81775  84285

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=948
polyvec_decompress cycles=760

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=1421
polyvec_decompress cycles=1142

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=2219
polyvec_decompress cycles=2473

this PR (0561120):

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 23655
    encaps cycles = 27018
    decaps cycles = 33696

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  23549  23596  23616  23629  23641  23655  23668  23681  23699  23728  26185
    encaps percentiles:  26897  26944  26970  26986  27005  27018  27034  27054  27076  27118  29615
    decaps percentiles:  33525  33591  33620  33650  33671  33696  33715  33741  33775  33845  36200

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 39444
    encaps cycles = 41144
    decaps cycles = 51520

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  39314  39361  39387  39409  39428  39444  39461  39482  39520  39888  42056
    encaps percentiles:  41007  41062  41086  41106  41124  41144  41159  41181  41215  41356  43765
    decaps percentiles:  51338  51412  51445  51470  51498  51520  51543  51584  51643  52444  54155

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 58133
    encaps cycles = 60818
    decaps cycles = 74927

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  57972  58037  58063  58087  58107  58133  58156  58194  58258  60622  63247
    encaps percentiles:  60656  60729  60755  60775  60798  60818  60846  60887  60955  63258  63749
    decaps percentiles:  74648  74757  74802  74845  74884  74927  74962  74999  75088  77309  78637

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=333
polyvec_decompress cycles=185

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=499
polyvec_decompress cycles=259

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=914
polyvec_decompress cycles=385

c7i (clang 18.1.3)

CC=clang ./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 9482
    encaps cycles = 13159
    decaps cycles = 17565

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:   9292   9327   9352   9388   9429   9482   9542   9602   9685   9786  10416
    encaps percentiles:  12870  12962  13001  13037  13090  13159  13239  13328  13422  13585  13982
    decaps percentiles:  17259  17320  17381  17430  17493  17565  17664  17727  17783  17852  18316

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 15944
    encaps cycles = 17761
    decaps cycles = 24158

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  15625  15680  15738  15790  15862  15944  16005  16067  16135  16279  16761
    encaps percentiles:  17407  17487  17544  17598  17675  17761  17829  17890  17959  18118  18629
    decaps percentiles:  23718  23844  23945  24049  24104  24158  24201  24254  24353  24578  25024

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 23357
    encaps cycles = 25210
    decaps cycles = 35316

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  22896  23005  23118  23237  23311  23357  23404  23453  23530  23944  25004
    encaps percentiles:  24734  24902  25022  25109  25167  25210  25268  25318  25416  25767  26203
    decaps percentiles:  34277  34743  34921  35060  35195  35316  35414  35503  35636  35863  36554

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=506
polyvec_decompress cycles=321

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=757
polyvec_decompress cycles=488

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=1900
polyvec_decompress cycles=1108

this PR (0561120):

CC=clang ./scripts/tests bench -c PMU  --cflags="-march=native -mtune=native -mavx2 -mbmi2 -mpopcnt -maes -flto" --opt opt

INFO  > Benchmark          ML-KEM-512  (native,    opt)    keypair cycles = 9454
    encaps cycles = 12845
    decaps cycles = 17037

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:   9261   9300   9328   9364   9403   9454   9498   9553   9625   9722  10208
    encaps percentiles:  12506  12575  12626  12697  12759  12845  12923  12988  13079  13262  13664
    decaps percentiles:  16681  16759  16816  16869  16953  17037  17089  17141  17199  17299  17708

INFO  > Benchmark          ML-KEM-768  (native,    opt)    keypair cycles = 15900
    encaps cycles = 17133
    decaps cycles = 23111

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  15644  15684  15720  15776  15826  15900  15992  16085  16161  16326  16791
    encaps percentiles:  16789  16851  16915  16968  17057  17133  17208  17276  17337  17486  18037
    decaps percentiles:  22660  22793  22873  22950  23057  23111  23159  23205  23278  23479  24005

INFO  > Benchmark          ML-KEM-1024 (native,    opt)    keypair cycles = 22803
    encaps cycles = 23516
    decaps cycles = 31879

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  22380  22487  22554  22656  22752  22803  22864  22905  22967  23382  24115
    encaps percentiles:  23056  23197  23273  23359  23429  23516  23576  23627  23700  24004  24582
    decaps percentiles:  31309  31591  31681  31749  31820  31879  31947  32039  32150  32398  32927

INFO  > Benchmark Components ML-KEM-512  (native,    opt)
polyvec_compress cycles=128
polyvec_decompress cycles=85

INFO  > Benchmark Components ML-KEM-768  (native,    opt)
polyvec_compress cycles=192
polyvec_decompress cycles=114

INFO  > Benchmark Components ML-KEM-1024 (native,    opt)
polyvec_compress cycles=365
polyvec_decompress cycles=169

mkannwischer · 2024-11-20T03:08:53Z

@mkannwischer Would you mind doing the PR in smaller steps, as follows:

Hoist out the polynomial compression into a separate function first. This will require suffixing existing polynomial [de]compression routines with the d-value -- which is anyway a good idea and in line with our naming for scalar_[de]compress.

okay, I'll do this as a separate PR first.

mkannwischer · 2024-12-03T01:36:54Z

After #435 got merged this would require a major re-work.
Currently it's not clear if we need it. Closing this for now.

mkannwischer changed the title ~~Avx2 polyvec compress~~ Add AVX2 polyvec_{de,}compress Nov 15, 2024

mkannwischer mentioned this pull request Nov 15, 2024

Enable rej_uniform native implementation in x86_64 #409

Merged

mkannwischer force-pushed the avx2-polyvec-compress branch 5 times, most recently from cda2628 to bf5458d Compare November 15, 2024 12:13

hanno-becker force-pushed the avx2-polyvec-compress branch from bf5458d to 98cc5de Compare November 16, 2024 14:18

hanno-becker added the benchmark this PR should be benchmarked in CI label Nov 16, 2024

hanno-becker force-pushed the avx2-polyvec-compress branch from 98cc5de to 6c2a30c Compare November 16, 2024 16:03

hanno-becker added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Nov 16, 2024

mkannwischer force-pushed the avx2-polyvec-compress branch 2 times, most recently from 64c7279 to 8097ad5 Compare November 18, 2024 03:36

mkannwischer added 3 commits November 18, 2024 20:24

move polyvec_decompress comment

e054de9

Signed-off-by: Matthias J. Kannwischer <[email protected]>

add polyvec_compress and polyvec_decompress to bench

78eecac

Signed-off-by: Matthias J. Kannwischer <[email protected]>

mkannwischer force-pushed the avx2-polyvec-compress branch from 34df740 to 0561120 Compare November 18, 2024 12:24

mkannwischer commented Nov 19, 2024

View reviewed changes

mkannwischer requested a review from hanno-becker November 19, 2024 03:42

hanno-becker requested changes Nov 19, 2024

View reviewed changes

mkannwischer mentioned this pull request Nov 20, 2024

Hoist poly_{,de}compress_du out of polyvec_{,de}compress #435

Merged

7 tasks

mkannwischer closed this Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 `polyvec_{de,}compress` #410

Add AVX2 `polyvec_{de,}compress` #410

mkannwischer commented Nov 15, 2024

mkannwischer commented Nov 15, 2024

mkannwischer commented Nov 19, 2024

mkannwischer Nov 19, 2024

hanno-becker commented Nov 19, 2024

hanno-becker left a comment

mkannwischer commented Nov 20, 2024 •

edited

Loading

mkannwischer commented Nov 20, 2024

mkannwischer commented Dec 3, 2024

Add AVX2 polyvec_{de,}compress #410

Add AVX2 polyvec_{de,}compress #410

Conversation

mkannwischer commented Nov 15, 2024

mkannwischer commented Nov 15, 2024

mkannwischer commented Nov 19, 2024

Component benchmarks (trimmmed to just polyvec_compress and polyvec_decompress)

mkannwischer Nov 19, 2024

Choose a reason for hiding this comment

hanno-becker commented Nov 19, 2024

hanno-becker left a comment

Choose a reason for hiding this comment

mkannwischer commented Nov 20, 2024 • edited Loading

Raptor Lake (clang 18.1.8)

c7i (clang 18.1.3)

mkannwischer commented Nov 20, 2024

mkannwischer commented Dec 3, 2024

Add AVX2 `polyvec_{de,}compress` #410

Add AVX2 `polyvec_{de,}compress` #410

mkannwischer commented Nov 20, 2024 •

edited

Loading